new research article two-phase iteration for value function...

15
Research Article Two-Phase Iteration for Value Function Approximation and Hyperparameter Optimization in Gaussian-Kernel-Based Adaptive Critic Design Xin Chen, 1 Penghuan Xie, 2 Yonghua Xiong, 1 Yong He, 1 and Min Wu 1 1 School of Automation, China University of Geosciences, Wuhan, Hubei 430074, China 2 School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, China Correspondence should be addressed to Yonghua Xiong; [email protected] Received 7 January 2015; Accepted 26 May 2015 Academic Editor: Simon X. Yang Copyright © 2015 Xin Chen et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. ere are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. ey all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set. 1. Introduction Reinforcement learning (RL) is an interactive machine learn- ing method for solving sequential decision problems. It is well known as an important learning method in unknown or dynamic environment. Different from supervised learning and unsupervised learning, RL interacts with the envi- ronment through trial mechanism and modifies its action policies to maximize the payoffs [1]. It is strongly connected from a theoretical point of view with direct and indirect adaptive optimal control methods [2]. Traditional RL research focused on discrete state/action systems; state/action only takes on a finite number of pre- scribed discrete values. e learning space grows exponen- tially as the number of states and the number of allowed actions increase. is leads to the so-called curse of dimen- sionality (CoD) [2]. In order to mitigate this CoD problem, function approximations and generalization methods [3] are introduced to store the optimal value and the optimal control as a function of the state vector. Generalization methods based on parametric model such as neural networks [46] have become one of popular means to solve RL problem in continuous environments. Currently, research work on RL in continuous environ- ment to construct learning systems for nonlinear optimal control has attracted attention of researchers and scholars in control domains for the reason that it can modify its policy only based on the value function without knowing the model structure or the parameters in advance. A family of new RL techniques known as Approximate or Adaptive Dynamic Programming (ADP) (also known as Neurodynamic Pro- gramming or Adaptive Critic Designs (ACDs)) has received more and more research interest [7, 8]. ADPs are based on the actor-critic structure, in which there is a critic assessing the value of the action or control policy applied by an actor and an actor modifying its action based on the assessment Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 760459, 14 pages http://dx.doi.org/10.1155/2015/760459

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Research ArticleTwo-Phase Iteration for Value FunctionApproximation and Hyperparameter Optimization inGaussian-Kernel-Based Adaptive Critic Design

Xin Chen1 Penghuan Xie2 Yonghua Xiong1 Yong He1 and Min Wu1

1School of Automation China University of Geosciences Wuhan Hubei 430074 China2School of Information Science and Engineering Central South University Changsha Hunan 410083 China

Correspondence should be addressed to Yonghua Xiong xiongyhcugeducn

Received 7 January 2015 Accepted 26 May 2015

Academic Editor Simon X Yang

Copyright copy 2015 Xin Chen et alThis is an open access article distributed under theCreative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control Toavoid the subjectivity in the design of a neural network that serves as a critic network kernel-based adaptive critic design (ACD)was developed recentlyThere are two essential issues for a static kernel-basedmodel how to determine proper hyperparameters inadvance and how to select right samples to describe the value function They all rely on the assessment of sample values Basedon the theoretical analysis this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based criticnetwork It is able to estimate the values of samples without infinitively revisiting them And the hyperparameters of the kernelmodel are optimized simultaneously Based on the estimated sample values the sample set can be refined by adding alternativesor deleting redundances Combining this critic design with actor network we present a Gaussian-kernel-based Adaptive DynamicProgramming (GK-ADP) approach Simulations are used to verify its feasibility particularly the necessity of two-phase learningthe convergence characteristics and the improvement of the system performance by using a varying sample set

1 Introduction

Reinforcement learning (RL) is an interactive machine learn-ing method for solving sequential decision problems It iswell known as an important learning method in unknownor dynamic environment Different from supervised learningand unsupervised learning RL interacts with the envi-ronment through trial mechanism and modifies its actionpolicies to maximize the payoffs [1] It is strongly connectedfrom a theoretical point of view with direct and indirectadaptive optimal control methods [2]

Traditional RL research focused on discrete stateactionsystems stateaction only takes on a finite number of pre-scribed discrete values The learning space grows exponen-tially as the number of states and the number of allowedactions increase This leads to the so-called curse of dimen-sionality (CoD) [2] In order to mitigate this CoD problemfunction approximations and generalization methods [3] are

introduced to store the optimal value and the optimal controlas a function of the state vector Generalization methodsbased on parametric model such as neural networks [4ndash6]have become one of popular means to solve RL problem incontinuous environments

Currently research work on RL in continuous environ-ment to construct learning systems for nonlinear optimalcontrol has attracted attention of researchers and scholars incontrol domains for the reason that it can modify its policyonly based on the value function without knowing the modelstructure or the parameters in advance A family of newRL techniques known as Approximate or Adaptive DynamicProgramming (ADP) (also known as Neurodynamic Pro-gramming or Adaptive Critic Designs (ACDs)) has receivedmore and more research interest [7 8] ADPs are based onthe actor-critic structure in which there is a critic assessingthe value of the action or control policy applied by an actorand an actor modifying its action based on the assessment

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 760459 14 pageshttpdxdoiorg1011552015760459

2 Mathematical Problems in Engineering

of values In literatures ADP approaches are categorized asthe followingmain schemes heuristic dynamic programming(HDP) dual heuristic programming (DHP) globalized dualheuristic programming (GDHP) and their action-dependentversions [9 10]

ADP researches always adopt multilayer perceptron neu-ral networks (MLPNNs) as the critic design Vrabie and Lewisproposed an online approach to continuous-time directadaptive optimal control which made use of neural networksto parametrically represent the control policy and the per-formance of the control system [11] Liu et al solved theconstrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithmvia GDHP technique with three neural networks [12] In factdifferent kinds of neural networks (NNs) play the importantroles in ADP algorithms such as radial basis function NNs[3] wavelet basis function NNs [13] and echo state network[14]

Besides the benefits brought by NNs ADP methodsalways suffer from some problems concerned in the design ofNNs On one hand the learning control performance greatlydepends on empirical design of critic networks especially themanual setting of the hidden layer or the basis functions Onthe other hand due to the local minima in neural networktraining how to improve the quality of the final policies isstill an open problem [15]

As we can see it is difficult to evaluate the effective-ness of the parametric model when the knowledge on themodelrsquos order or nonlinear characteristics of the system isnot enough Compared with parametric modeling methodsnonparametricmodelingmethods especially kernelmethods[16 17] do not need to set the model structure in advanceHence kernel machines have been popularly studied torealize nonlinear and nonparametric modeling Engel etal proposed the kernel recursive least-squares algorithmto construct minimum mean-squared-error solutions tononlinear least-squares problems [18] As popular kernelmachines support vector machines (SVMs) also have beenapplied to nonparametricmodeling problems Dietterich andWang combined linear programming with SVMs to findvalue function approximations [19] The similar research waspublished in [20] inwhich the least-squares SVMswere usedNevertheless they both focused on discrete stateaction spaceand lacked theoretical results on the policies obtained moreor less

In addition to SVMs Gaussian processes (GPs) havebecome an alternative generalization method GP modelsare powerful nonparametric tools for approximate Bayesianinference and learning In comparison with other popularnonlinear architectures such as multilayer perceptrons theirbehavior is conceptually simpler to understand and modelfitting can be achieved without resorting to nonconvexoptimization routines [21 22] In [23] Engel et al firstapplied GPs in temporal-difference (TD) learning for MDPswith stochastic rewards and deterministic transitions Theyderived a new GPTD algorithm in [24] that overcame thelimitation of deterministic transitions and extended GPTDto the estimation of state-action values GPTD algorithm justaddressed the value approximation so it should be combined

with actor-critic methods or other policy iteration methodsto solve learning control problems

An alternative approach employing GPs in RL is model-based value iteration or policy iteration method in whichGP model is used to model system dynamics and rep-resent the value function [25] In [26] an approximatedvalue-function based RL algorithm named Gaussian processdynamic programming (GPDP) was presented which builtdynamic transition model value function model and actionpolicy model respectively using GPs In this way the sampleset will be adjusted to such a reasonable shape with highsample densities near the equilibriums or the places wherevalue functions change dramatically Thus it is good at con-trolling nonlinear systems with complex dynamics A majorshortcoming even if the relatively high computation cost isendurable is that the states in sample set need to be revisitedagain and again in order to update their value functions Sincethis condition is unpractical in real implements it diminishesthe appeal of employing this method

Kernel-based method is also introduced to ADP In [15]a novel framework of ACDs with sparse kernel machines waspresented by integrating kernel methods into critic networkA sparsification method based on the approximately lineardependence (ALD) analysis was used to sparsify the kernelmachinesObviously thismethod can overcome the difficultyof presetting model structure in parametric models and real-ize actor-critic learning online [27] However the selection ofsamples based on the ALD analysis is an offline way withoutconsidering the distribution of the value functionThereforethe data samples cannot be adjusted online which makesthe method more suitable for control systems with smoothdynamics where value function changes gently

We think GPDP and ACDs with sparse kernel machinesare complementary As indicated in [28] it is known theprediction of GPs is viewed as a linear combination of thecovariance between the new points and the samples Henceit seems reasonable to introduce kernel machine with GPsto build critic network in ACDs if the values of the samplesare known or at least can be assessed numerically And thenthe sample set will be adjusted online during critic-actorlearning

The major problem here is how to realize the valuefunction learning and GP models updating simultaneouslyespecially under the condition that the samples of state-action space can hardly be revisited infinitely in order toapproximate their values To tackle this problem a two-phaseiteration is developed in order to get optimal control policyfor the system whose dynamics are unknown a priori

2 Description of the Problem

In general ADP is an actor-critic method which approx-imates the value functions and policies to encourage therealization of generalization in MDPs with large or continu-ous spaces The critic design plays the most important rolebecause it determines how the actor optimizes its actionHence we give a brief introduction on both kernel-basedACD and GPs in order to derive the clear description of thetheoretical problem

Mathematical Problems in Engineering 3

21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor

Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation

119876 (119909) =

119871

sum

119894=1120572119894119896 (119909 119909

119894) (1)

where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904

119894 119886119894)

120572119894 119894 = 1 2 119871 are the weights and (119904

119894 119886119894) represents

selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel

functionThe key of critic learning is the update of the weights vec-

tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation

If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning

If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters

22 GP-Based Value Function Model For an MDP the datasamples 119909

119894and the corresponding 119876 value can be collected

by observing theMDPHere 119909119894is the state-action pairs (119904

119894 119886119894)

119904 isin 119877119873 119886 isin 119877

119872 and 119876(119909) is the value function defined as

119876lowast(119909) = 119877 (119909) + 120573int119901 (119904

1015840| 119909)119881 (119904

1015840) 1198891199041015840 (2)

where 119881(119904) = max119886119876(119909)

Given a sample set collected from a continuous dynamicsystem 119883

119871 119910 where 119910 = 119876(119883

119871) + 120576 120576 sim 119873(0 V0) Gaussian

regression with covariance function shown in the followingequation is a well known model technology to infer the 119876

function

119896 (119909119901 119909119902) = V1 exp [minus

12(119909119901minus119909119902)119879

Λminus1

(119909119901minus119909119902)] (3)

where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed

Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes

119870119871= 119870 (119883

119871 119883119871) + V0119868119871 (4)

where119870(119883119871 119883119871)

=

[[[[[[[[

[

119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)

119896 (1199092 1199091) d

d

119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909

119871 119909119871)

]]]]]]]]

]

(5)

The parameters 119908119889 V0 and V1 are the hyperparam-

eters of the 119866119877119902and collected within the vector 120579 =

[1199081 sdot sdot sdot 119908119873+119872

V1 V0]119879

For an arbitrary input 119909lowast the predictive distribution of

the function value is Gaussian distributed with mean andvariance given by

119864 [119876 (119909lowast)] = 119870 (119909

lowast 119883119871)119870minus1119871

119910 (6)

var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909

lowast 119883119871)119870minus1119871

119870(119883119871 119909lowast) (7)

Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871

119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied

Condition 1 The hyperparameters 120579 of Gaussian kernel areknown

Condition 2 The values 119910 wrt all samples states 119883119871are

known

With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions

Consider Condition 1 If the values 119910 wrt119883119871are known

(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by

119871 (120579) = minus12119910119879119870minus1119871

119910minus12log 1003816100381610038161003816119870119871

1003816100381610038161003816 minus119899

2log 2120587 (8)

4 Mathematical Problems in Engineering

It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572 (9)

where tr[sdot] denotes the trace 120572 = 119870minus1119871

119910Consider Condition 2 For unknown system if 119870

119871is

known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration

According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values

Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing

To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed

3 Two-Phase Value Iteration for CriticNetwork Approximation

First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function

Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572119879= 0

119894 = 1 119863

(10)

where 120572 = 119870minus1119871

119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows

120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]

119879

(11)

Then the two-phase value iteration for critic network isdescribed as the following theorem

Theorem 2 Given the following conditions

(1) the system is boundary input and boundary output(BIBO) stable

(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896

119905rarr 0 suminfin

119905=1 120578119896

119905= infin

the following iteration process is convergent

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 1205781

119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782

119905119896alefsym

120579119905

(119909119883119871)

sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(12)

where 119881(120579119905 1199091015840) = max

119886119896120579119905

(1199091015840 119883119871)120572(119905) and 119896

alefsym

120579119905

(119909 119883119871)

is a kind of pseudoinversion of 119896120579119905

(119909 119883119871) that is

119896120579119905

(119909 119883119871)119896alefsym

120579119905

(119909 119883119871) = 1

Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork

The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method

Define that

119875119905119896120579119905

(119909119883119871) 120572119905= 119903 +120573119881 (120579

119905 1199091015840)

= 119903 +max119886

119896120579119905

(1199091015840 119883119871) 120572 (119905)

(13)

Equation (12) is rewritten as

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 120578

1119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(14)

Define approximation errors as

1205931119905+2 = 120593

1119905+1 = 120579

119905+1 minus119891 (120572119905+1) = 120579

119905minus119891 (120572

119905+1)

+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578

1119905) 120593

1119905

(15)

1205932119905+2 = (120572

119905+2 minus120572lowast) = (120572

119905+1 minus120572lowast) + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast+119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1] = 120593

1119905+1 + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

Mathematical Problems in Engineering 5

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast] + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1]

(16)

Further define that

1205821 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus119875

119905+1119896120579119905

(119909119883119871) 120572lowast]

1205822 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowastminus 119896120579119905

(119909119883119871) 120572119905+1]

(17)

Thus (16) is reexpressed as

1205932119905+2 = 120593

1119905+1 + 120578

21199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast) (18)

Let 120593119905= [(120593

1119905)119879

(1205932119905)119879]119879

and the two-phase iteration isin the shape of stochastic approximation that is

[1205931119905+1

1205932119905+1

] = [1205931119905

1205932119905

]+[minus120578

11199051205931119905

0]

[1205931119905+2

1205932119905+2

]

= [1205931119905+1

1205932119905+1

]

+[0

12057821199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast)]

(19)

Define119881 (120593119905)

= 120593119879

119905Λ120593119905

= [(1205931119905)119879

(1205932119905)119879

] [119868119863

0

0 119896119879

120579119905

(119909lowast 119883119871) 119896120579119905

(119909lowast 119883119871)] [

1205931119905

1205932119905

]

(20)

where 119863 is the scale of the hyperparameters and 119909lowast will

be defined later Let 119896lowast119905represent 119896119879

120579119905

(119909lowast 119883119871)119896120579119905

(119909lowast 119883119871) for

shortObviously (20) is a positive definitematrix and |119909

lowast| lt infin

|119896lowast

119905| lt infin Hence119881(120593

119905) is a Lyapunov functional At moment

119905 the conditional expectation of the positive definite matrixis

119864119905119881119905+2 (120593119905+2)

= 119864119905[120593119879

119905+2Λ120593119905+2]

= 119864119905[(120593

1119905+2)119879

(1205932119905+2)119879

] [119868119863

0

0 119896lowast

119905

][1205931119905+2

1205932119905+2

]

(21)

It is easy to compute the first-order Taylor expansion of(21) as

119864119905119881119905+2 (120593119905+2)

= 119864119905119881119905+1 (120593119905+1) + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]

+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593

119905+1)119879(120593119905+2 minus120593

119905+1)]

(22)

in which the first item on the right of (22) is

119864119905119881119905+1 (120593119905+1)

= 119881119905(120593119905) + 120578

1119905119864119905[1198811015840

119905(120593119905) (120593119905+1 minus120593

119905)]

+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593

119905)119879(120593119905+1 minus120593

119905)]

(23)

Substituting (23) into (22) yields

119864119905119881119905+2 (120593119905+2) minus119881

119905(120593119905)

= 1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1]

+ (1205781119905)21198621119864119905 [(119884

1119905)2] + (120578

1119905)21198622119864119905 [(119884

2119905+1)

2]

(24)

where 1198841119905= [ minus120593

1119905

0]

1198842119905+1 = [

0119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1

]

= [0

1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

(25)

Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572

119905| lt infin

If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the

dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded

According to Proposition 1 when the policy space isbounded and |120572

119905| lt infin |120579

119905| lt infin there exists a constant 1198881

so that

119864100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198881 (26)

In addition

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2= 119864

10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

210038161003816100381610038161003816

le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816

12058222 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816

le 11988821205932119905+1 + 1198883120593

2119905+1 le 1198884119881 (120593

119905+1)

(27)

From (26) and (27) we know that

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198623119881 (120593

119905+1) +1198623 (28)

where 1198623 = max1198881 1198884 Similarly

119864100381610038161003816100381610038161198841119905

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905(120593119905) 119884

1119905

10038161003816100381610038161003816le 1198624119881 (120593

119905) +1198624 (29)

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

2 Mathematical Problems in Engineering

of values In literatures ADP approaches are categorized asthe followingmain schemes heuristic dynamic programming(HDP) dual heuristic programming (DHP) globalized dualheuristic programming (GDHP) and their action-dependentversions [9 10]

ADP researches always adopt multilayer perceptron neu-ral networks (MLPNNs) as the critic design Vrabie and Lewisproposed an online approach to continuous-time directadaptive optimal control which made use of neural networksto parametrically represent the control policy and the per-formance of the control system [11] Liu et al solved theconstrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithmvia GDHP technique with three neural networks [12] In factdifferent kinds of neural networks (NNs) play the importantroles in ADP algorithms such as radial basis function NNs[3] wavelet basis function NNs [13] and echo state network[14]

Besides the benefits brought by NNs ADP methodsalways suffer from some problems concerned in the design ofNNs On one hand the learning control performance greatlydepends on empirical design of critic networks especially themanual setting of the hidden layer or the basis functions Onthe other hand due to the local minima in neural networktraining how to improve the quality of the final policies isstill an open problem [15]

As we can see it is difficult to evaluate the effective-ness of the parametric model when the knowledge on themodelrsquos order or nonlinear characteristics of the system isnot enough Compared with parametric modeling methodsnonparametricmodelingmethods especially kernelmethods[16 17] do not need to set the model structure in advanceHence kernel machines have been popularly studied torealize nonlinear and nonparametric modeling Engel etal proposed the kernel recursive least-squares algorithmto construct minimum mean-squared-error solutions tononlinear least-squares problems [18] As popular kernelmachines support vector machines (SVMs) also have beenapplied to nonparametricmodeling problems Dietterich andWang combined linear programming with SVMs to findvalue function approximations [19] The similar research waspublished in [20] inwhich the least-squares SVMswere usedNevertheless they both focused on discrete stateaction spaceand lacked theoretical results on the policies obtained moreor less

In addition to SVMs Gaussian processes (GPs) havebecome an alternative generalization method GP modelsare powerful nonparametric tools for approximate Bayesianinference and learning In comparison with other popularnonlinear architectures such as multilayer perceptrons theirbehavior is conceptually simpler to understand and modelfitting can be achieved without resorting to nonconvexoptimization routines [21 22] In [23] Engel et al firstapplied GPs in temporal-difference (TD) learning for MDPswith stochastic rewards and deterministic transitions Theyderived a new GPTD algorithm in [24] that overcame thelimitation of deterministic transitions and extended GPTDto the estimation of state-action values GPTD algorithm justaddressed the value approximation so it should be combined

with actor-critic methods or other policy iteration methodsto solve learning control problems

An alternative approach employing GPs in RL is model-based value iteration or policy iteration method in whichGP model is used to model system dynamics and rep-resent the value function [25] In [26] an approximatedvalue-function based RL algorithm named Gaussian processdynamic programming (GPDP) was presented which builtdynamic transition model value function model and actionpolicy model respectively using GPs In this way the sampleset will be adjusted to such a reasonable shape with highsample densities near the equilibriums or the places wherevalue functions change dramatically Thus it is good at con-trolling nonlinear systems with complex dynamics A majorshortcoming even if the relatively high computation cost isendurable is that the states in sample set need to be revisitedagain and again in order to update their value functions Sincethis condition is unpractical in real implements it diminishesthe appeal of employing this method

Kernel-based method is also introduced to ADP In [15]a novel framework of ACDs with sparse kernel machines waspresented by integrating kernel methods into critic networkA sparsification method based on the approximately lineardependence (ALD) analysis was used to sparsify the kernelmachinesObviously thismethod can overcome the difficultyof presetting model structure in parametric models and real-ize actor-critic learning online [27] However the selection ofsamples based on the ALD analysis is an offline way withoutconsidering the distribution of the value functionThereforethe data samples cannot be adjusted online which makesthe method more suitable for control systems with smoothdynamics where value function changes gently

We think GPDP and ACDs with sparse kernel machinesare complementary As indicated in [28] it is known theprediction of GPs is viewed as a linear combination of thecovariance between the new points and the samples Henceit seems reasonable to introduce kernel machine with GPsto build critic network in ACDs if the values of the samplesare known or at least can be assessed numerically And thenthe sample set will be adjusted online during critic-actorlearning

The major problem here is how to realize the valuefunction learning and GP models updating simultaneouslyespecially under the condition that the samples of state-action space can hardly be revisited infinitely in order toapproximate their values To tackle this problem a two-phaseiteration is developed in order to get optimal control policyfor the system whose dynamics are unknown a priori

2 Description of the Problem

In general ADP is an actor-critic method which approx-imates the value functions and policies to encourage therealization of generalization in MDPs with large or continu-ous spaces The critic design plays the most important rolebecause it determines how the actor optimizes its actionHence we give a brief introduction on both kernel-basedACD and GPs in order to derive the clear description of thetheoretical problem

Mathematical Problems in Engineering 3

21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor

Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation

119876 (119909) =

119871

sum

119894=1120572119894119896 (119909 119909

119894) (1)

where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904

119894 119886119894)

120572119894 119894 = 1 2 119871 are the weights and (119904

119894 119886119894) represents

selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel

functionThe key of critic learning is the update of the weights vec-

tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation

If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning

If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters

22 GP-Based Value Function Model For an MDP the datasamples 119909

119894and the corresponding 119876 value can be collected

by observing theMDPHere 119909119894is the state-action pairs (119904

119894 119886119894)

119904 isin 119877119873 119886 isin 119877

119872 and 119876(119909) is the value function defined as

119876lowast(119909) = 119877 (119909) + 120573int119901 (119904

1015840| 119909)119881 (119904

1015840) 1198891199041015840 (2)

where 119881(119904) = max119886119876(119909)

Given a sample set collected from a continuous dynamicsystem 119883

119871 119910 where 119910 = 119876(119883

119871) + 120576 120576 sim 119873(0 V0) Gaussian

regression with covariance function shown in the followingequation is a well known model technology to infer the 119876

function

119896 (119909119901 119909119902) = V1 exp [minus

12(119909119901minus119909119902)119879

Λminus1

(119909119901minus119909119902)] (3)

where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed

Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes

119870119871= 119870 (119883

119871 119883119871) + V0119868119871 (4)

where119870(119883119871 119883119871)

=

[[[[[[[[

[

119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)

119896 (1199092 1199091) d

d

119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909

119871 119909119871)

]]]]]]]]

]

(5)

The parameters 119908119889 V0 and V1 are the hyperparam-

eters of the 119866119877119902and collected within the vector 120579 =

[1199081 sdot sdot sdot 119908119873+119872

V1 V0]119879

For an arbitrary input 119909lowast the predictive distribution of

the function value is Gaussian distributed with mean andvariance given by

119864 [119876 (119909lowast)] = 119870 (119909

lowast 119883119871)119870minus1119871

119910 (6)

var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909

lowast 119883119871)119870minus1119871

119870(119883119871 119909lowast) (7)

Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871

119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied

Condition 1 The hyperparameters 120579 of Gaussian kernel areknown

Condition 2 The values 119910 wrt all samples states 119883119871are

known

With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions

Consider Condition 1 If the values 119910 wrt119883119871are known

(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by

119871 (120579) = minus12119910119879119870minus1119871

119910minus12log 1003816100381610038161003816119870119871

1003816100381610038161003816 minus119899

2log 2120587 (8)

4 Mathematical Problems in Engineering

It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572 (9)

where tr[sdot] denotes the trace 120572 = 119870minus1119871

119910Consider Condition 2 For unknown system if 119870

119871is

known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration

According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values

Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing

To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed

3 Two-Phase Value Iteration for CriticNetwork Approximation

First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function

Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572119879= 0

119894 = 1 119863

(10)

where 120572 = 119870minus1119871

119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows

120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]

119879

(11)

Then the two-phase value iteration for critic network isdescribed as the following theorem

Theorem 2 Given the following conditions

(1) the system is boundary input and boundary output(BIBO) stable

(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896

119905rarr 0 suminfin

119905=1 120578119896

119905= infin

the following iteration process is convergent

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 1205781

119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782

119905119896alefsym

120579119905

(119909119883119871)

sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(12)

where 119881(120579119905 1199091015840) = max

119886119896120579119905

(1199091015840 119883119871)120572(119905) and 119896

alefsym

120579119905

(119909 119883119871)

is a kind of pseudoinversion of 119896120579119905

(119909 119883119871) that is

119896120579119905

(119909 119883119871)119896alefsym

120579119905

(119909 119883119871) = 1

Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork

The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method

Define that

119875119905119896120579119905

(119909119883119871) 120572119905= 119903 +120573119881 (120579

119905 1199091015840)

= 119903 +max119886

119896120579119905

(1199091015840 119883119871) 120572 (119905)

(13)

Equation (12) is rewritten as

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 120578

1119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(14)

Define approximation errors as

1205931119905+2 = 120593

1119905+1 = 120579

119905+1 minus119891 (120572119905+1) = 120579

119905minus119891 (120572

119905+1)

+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578

1119905) 120593

1119905

(15)

1205932119905+2 = (120572

119905+2 minus120572lowast) = (120572

119905+1 minus120572lowast) + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast+119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1] = 120593

1119905+1 + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

Mathematical Problems in Engineering 5

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast] + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1]

(16)

Further define that

1205821 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus119875

119905+1119896120579119905

(119909119883119871) 120572lowast]

1205822 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowastminus 119896120579119905

(119909119883119871) 120572119905+1]

(17)

Thus (16) is reexpressed as

1205932119905+2 = 120593

1119905+1 + 120578

21199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast) (18)

Let 120593119905= [(120593

1119905)119879

(1205932119905)119879]119879

and the two-phase iteration isin the shape of stochastic approximation that is

[1205931119905+1

1205932119905+1

] = [1205931119905

1205932119905

]+[minus120578

11199051205931119905

0]

[1205931119905+2

1205932119905+2

]

= [1205931119905+1

1205932119905+1

]

+[0

12057821199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast)]

(19)

Define119881 (120593119905)

= 120593119879

119905Λ120593119905

= [(1205931119905)119879

(1205932119905)119879

] [119868119863

0

0 119896119879

120579119905

(119909lowast 119883119871) 119896120579119905

(119909lowast 119883119871)] [

1205931119905

1205932119905

]

(20)

where 119863 is the scale of the hyperparameters and 119909lowast will

be defined later Let 119896lowast119905represent 119896119879

120579119905

(119909lowast 119883119871)119896120579119905

(119909lowast 119883119871) for

shortObviously (20) is a positive definitematrix and |119909

lowast| lt infin

|119896lowast

119905| lt infin Hence119881(120593

119905) is a Lyapunov functional At moment

119905 the conditional expectation of the positive definite matrixis

119864119905119881119905+2 (120593119905+2)

= 119864119905[120593119879

119905+2Λ120593119905+2]

= 119864119905[(120593

1119905+2)119879

(1205932119905+2)119879

] [119868119863

0

0 119896lowast

119905

][1205931119905+2

1205932119905+2

]

(21)

It is easy to compute the first-order Taylor expansion of(21) as

119864119905119881119905+2 (120593119905+2)

= 119864119905119881119905+1 (120593119905+1) + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]

+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593

119905+1)119879(120593119905+2 minus120593

119905+1)]

(22)

in which the first item on the right of (22) is

119864119905119881119905+1 (120593119905+1)

= 119881119905(120593119905) + 120578

1119905119864119905[1198811015840

119905(120593119905) (120593119905+1 minus120593

119905)]

+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593

119905)119879(120593119905+1 minus120593

119905)]

(23)

Substituting (23) into (22) yields

119864119905119881119905+2 (120593119905+2) minus119881

119905(120593119905)

= 1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1]

+ (1205781119905)21198621119864119905 [(119884

1119905)2] + (120578

1119905)21198622119864119905 [(119884

2119905+1)

2]

(24)

where 1198841119905= [ minus120593

1119905

0]

1198842119905+1 = [

0119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1

]

= [0

1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

(25)

Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572

119905| lt infin

If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the

dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded

According to Proposition 1 when the policy space isbounded and |120572

119905| lt infin |120579

119905| lt infin there exists a constant 1198881

so that

119864100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198881 (26)

In addition

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2= 119864

10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

210038161003816100381610038161003816

le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816

12058222 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816

le 11988821205932119905+1 + 1198883120593

2119905+1 le 1198884119881 (120593

119905+1)

(27)

From (26) and (27) we know that

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198623119881 (120593

119905+1) +1198623 (28)

where 1198623 = max1198881 1198884 Similarly

119864100381610038161003816100381610038161198841119905

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905(120593119905) 119884

1119905

10038161003816100381610038161003816le 1198624119881 (120593

119905) +1198624 (29)

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 3

21 Kernel-Based ACDs Kernel-based ACDs mainly consistof a critic network a kernel-based feature learning module areward function an actor networkcontroller and a model ofthe plant The critic constructed by kernel machine is used toapproximate the value functions or their derivativesThen theoutput of the critic is used in the training process of the actorso that policy gradients can be computed As actor finallyconverges the optimal action policymapping states to actionsis described by this actor

Traditional neural network based on kernel machine andsamples serves as the model of value functions just as thefollowing equation shows and the recursive algorithm suchas KLSTD [29] serves as value function approximation

119876 (119909) =

119871

sum

119894=1120572119894119896 (119909 119909

119894) (1)

where 119909 and 119909119894represent state-action pairs (119904 119886) and (119904

119894 119886119894)

120572119894 119894 = 1 2 119871 are the weights and (119904

119894 119886119894) represents

selected state-action pairs in sample set and 119896(119909 119909119894) is a kernel

functionThe key of critic learning is the update of the weights vec-

tor120572The value functionmodeling in continuous stateactionspace is a regression problem From the view of the modelNN structure is a reproducing kernel space spanned bythe samples in which the value function is expressed as alinear regression function Obviously as the basis of Hilbertspace how to select samples determines the VC dimensionof identification as well as the performance of value functionapproximation

If only using ALD-based kernel sparsification it isonly independent of samples that are considered in sampleselection So it is hard to evaluate how good the sampleset is because the sample selection does not consider thedistribution of value function and the performance of ALDanalysis is affected seriously by the hyperparameters of kernelfunction which are predetermined empirically and fixedduring learning

If the hyperparameters can be optimized online and thevalue function wrt samples can be evaluated by iterationalgorithms the critic network will be optimized not only byvalue approximation but also by hyperparameter optimiza-tion Moreover with approximated sample values there is adirect way to evaluate the validity of sample set in orderto regulate the set online Thus in this paper we turn toGaussian processes to construct the criterion for samples andhyperparameters

22 GP-Based Value Function Model For an MDP the datasamples 119909

119894and the corresponding 119876 value can be collected

by observing theMDPHere 119909119894is the state-action pairs (119904

119894 119886119894)

119904 isin 119877119873 119886 isin 119877

119872 and 119876(119909) is the value function defined as

119876lowast(119909) = 119877 (119909) + 120573int119901 (119904

1015840| 119909)119881 (119904

1015840) 1198891199041015840 (2)

where 119881(119904) = max119886119876(119909)

Given a sample set collected from a continuous dynamicsystem 119883

119871 119910 where 119910 = 119876(119883

119871) + 120576 120576 sim 119873(0 V0) Gaussian

regression with covariance function shown in the followingequation is a well known model technology to infer the 119876

function

119896 (119909119901 119909119902) = V1 exp [minus

12(119909119901minus119909119902)119879

Λminus1

(119909119901minus119909119902)] (3)

where Λ = diag([1199081 1199082 119908119873+119872])Assuming additive independent identically distributed

Gaussian noise 120576 with variance V0 the prior on the noisyobservations becomes

119870119871= 119870 (119883

119871 119883119871) + V0119868119871 (4)

where119870(119883119871 119883119871)

=

[[[[[[[[

[

119896 (1199091 1199091) 119896 (1199091 1199092) sdot sdot sdot 119896 (1199091 119909119871)

119896 (1199092 1199091) d

d

119896 (119909119871 1199091) sdot sdot sdot sdot sdot sdot 119896 (119909

119871 119909119871)

]]]]]]]]

]

(5)

The parameters 119908119889 V0 and V1 are the hyperparam-

eters of the 119866119877119902and collected within the vector 120579 =

[1199081 sdot sdot sdot 119908119873+119872

V1 V0]119879

For an arbitrary input 119909lowast the predictive distribution of

the function value is Gaussian distributed with mean andvariance given by

119864 [119876 (119909lowast)] = 119870 (119909

lowast 119883119871)119870minus1119871

119910 (6)

var [119876 (119909)] = 119896 (119909lowast 119909lowast) minus119870 (119909

lowast 119883119871)119870minus1119871

119870(119883119871 119909lowast) (7)

Comparing (6) to (1) we find that if we let 120572 = 119870minus1119871

119910 theneural network is also regarded as the Gaussian regressionOr the critic network can be constructed based on Gaussiankernel machine if the following conditions are satisfied

Condition 1 The hyperparameters 120579 of Gaussian kernel areknown

Condition 2 The values 119910 wrt all samples states 119883119871are

known

With Gaussian-kernel-based critic network the samplestate-action pairs and corresponding values are known Andthen the criterion such as the comprehensive utility proposedin [26] can be set up in order to refine sample set online Atthe same time it is convenient to optimize hyperparametersin order to approximate 119876 value function more accuratelyThus besides the advantages brought by kernel machine thecritic based on Gaussian-kernel will be better in approximat-ing value functions

Consider Condition 1 If the values 119910 wrt119883119871are known

(note that it is indeed Condition 2) the common way to gethyperparameters is by evidencemaximization where the log-evidence is given by

119871 (120579) = minus12119910119879119870minus1119871

119910minus12log 1003816100381610038161003816119870119871

1003816100381610038161003816 minus119899

2log 2120587 (8)

4 Mathematical Problems in Engineering

It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572 (9)

where tr[sdot] denotes the trace 120572 = 119870minus1119871

119910Consider Condition 2 For unknown system if 119870

119871is

known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration

According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values

Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing

To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed

3 Two-Phase Value Iteration for CriticNetwork Approximation

First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function

Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572119879= 0

119894 = 1 119863

(10)

where 120572 = 119870minus1119871

119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows

120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]

119879

(11)

Then the two-phase value iteration for critic network isdescribed as the following theorem

Theorem 2 Given the following conditions

(1) the system is boundary input and boundary output(BIBO) stable

(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896

119905rarr 0 suminfin

119905=1 120578119896

119905= infin

the following iteration process is convergent

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 1205781

119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782

119905119896alefsym

120579119905

(119909119883119871)

sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(12)

where 119881(120579119905 1199091015840) = max

119886119896120579119905

(1199091015840 119883119871)120572(119905) and 119896

alefsym

120579119905

(119909 119883119871)

is a kind of pseudoinversion of 119896120579119905

(119909 119883119871) that is

119896120579119905

(119909 119883119871)119896alefsym

120579119905

(119909 119883119871) = 1

Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork

The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method

Define that

119875119905119896120579119905

(119909119883119871) 120572119905= 119903 +120573119881 (120579

119905 1199091015840)

= 119903 +max119886

119896120579119905

(1199091015840 119883119871) 120572 (119905)

(13)

Equation (12) is rewritten as

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 120578

1119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(14)

Define approximation errors as

1205931119905+2 = 120593

1119905+1 = 120579

119905+1 minus119891 (120572119905+1) = 120579

119905minus119891 (120572

119905+1)

+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578

1119905) 120593

1119905

(15)

1205932119905+2 = (120572

119905+2 minus120572lowast) = (120572

119905+1 minus120572lowast) + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast+119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1] = 120593

1119905+1 + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

Mathematical Problems in Engineering 5

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast] + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1]

(16)

Further define that

1205821 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus119875

119905+1119896120579119905

(119909119883119871) 120572lowast]

1205822 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowastminus 119896120579119905

(119909119883119871) 120572119905+1]

(17)

Thus (16) is reexpressed as

1205932119905+2 = 120593

1119905+1 + 120578

21199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast) (18)

Let 120593119905= [(120593

1119905)119879

(1205932119905)119879]119879

and the two-phase iteration isin the shape of stochastic approximation that is

[1205931119905+1

1205932119905+1

] = [1205931119905

1205932119905

]+[minus120578

11199051205931119905

0]

[1205931119905+2

1205932119905+2

]

= [1205931119905+1

1205932119905+1

]

+[0

12057821199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast)]

(19)

Define119881 (120593119905)

= 120593119879

119905Λ120593119905

= [(1205931119905)119879

(1205932119905)119879

] [119868119863

0

0 119896119879

120579119905

(119909lowast 119883119871) 119896120579119905

(119909lowast 119883119871)] [

1205931119905

1205932119905

]

(20)

where 119863 is the scale of the hyperparameters and 119909lowast will

be defined later Let 119896lowast119905represent 119896119879

120579119905

(119909lowast 119883119871)119896120579119905

(119909lowast 119883119871) for

shortObviously (20) is a positive definitematrix and |119909

lowast| lt infin

|119896lowast

119905| lt infin Hence119881(120593

119905) is a Lyapunov functional At moment

119905 the conditional expectation of the positive definite matrixis

119864119905119881119905+2 (120593119905+2)

= 119864119905[120593119879

119905+2Λ120593119905+2]

= 119864119905[(120593

1119905+2)119879

(1205932119905+2)119879

] [119868119863

0

0 119896lowast

119905

][1205931119905+2

1205932119905+2

]

(21)

It is easy to compute the first-order Taylor expansion of(21) as

119864119905119881119905+2 (120593119905+2)

= 119864119905119881119905+1 (120593119905+1) + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]

+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593

119905+1)119879(120593119905+2 minus120593

119905+1)]

(22)

in which the first item on the right of (22) is

119864119905119881119905+1 (120593119905+1)

= 119881119905(120593119905) + 120578

1119905119864119905[1198811015840

119905(120593119905) (120593119905+1 minus120593

119905)]

+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593

119905)119879(120593119905+1 minus120593

119905)]

(23)

Substituting (23) into (22) yields

119864119905119881119905+2 (120593119905+2) minus119881

119905(120593119905)

= 1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1]

+ (1205781119905)21198621119864119905 [(119884

1119905)2] + (120578

1119905)21198622119864119905 [(119884

2119905+1)

2]

(24)

where 1198841119905= [ minus120593

1119905

0]

1198842119905+1 = [

0119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1

]

= [0

1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

(25)

Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572

119905| lt infin

If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the

dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded

According to Proposition 1 when the policy space isbounded and |120572

119905| lt infin |120579

119905| lt infin there exists a constant 1198881

so that

119864100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198881 (26)

In addition

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2= 119864

10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

210038161003816100381610038161003816

le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816

12058222 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816

le 11988821205932119905+1 + 1198883120593

2119905+1 le 1198884119881 (120593

119905+1)

(27)

From (26) and (27) we know that

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198623119881 (120593

119905+1) +1198623 (28)

where 1198623 = max1198881 1198884 Similarly

119864100381610038161003816100381610038161198841119905

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905(120593119905) 119884

1119905

10038161003816100381610038161003816le 1198624119881 (120593

119905) +1198624 (29)

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

4 Mathematical Problems in Engineering

It requires the calculation of the derivative of 119871(120579) wrt each120579119894 given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572 (9)

where tr[sdot] denotes the trace 120572 = 119870minus1119871

119910Consider Condition 2 For unknown system if 119870

119871is

known that is the hyperparameters are known (note thatit is indeed Condition 1) the update of critic network willbe transferred to the update of values wrt samples by usingvalue iteration

According to the analysis both conditions are interde-pendent That means the update of critic depends on knownhyperparameters and the optimization of hyperparametersdepends on accurate sample values

Hence we need a comprehensive iteration method torealize value approximation and optimization of hyperpa-rameters simultaneously A direct way is to update themalternately Unfortunately this way is not reasonable becausethese two processes are tightly coupled For example tem-poral differential errors drive value approximation but thechange of weights 120572will cause the change of hyperparameters120579 simultaneously then it is difficult to tell whether thistemporal differential error is induced by observation or byGaussian regression model changing

To solve this problem a kind of two-phase value iterationfor critic network is presented in the next section and theconditions of convergence are analyzed

3 Two-Phase Value Iteration for CriticNetwork Approximation

First a proposition is given to describe the relationshipbetween hyperparameters and the sample value function

Proposition 1 The hyperparameters are optimized by evi-dence maximization according to the samples and their Qvalues and the log-evidence is given by

120597119871 (120579)

120597120579119894

= minus12tr [119870minus1119871

120597119870119871

120597120579119894

]+12120572119879 120597119870119871

120597120579119894

120572119879= 0

119894 = 1 119863

(10)

where 120572 = 119870minus1119871

119910 It can be proved that for arbitrary hyperpa-rameters 120579 if 120572 = 0 (10) defines an implicit function or acontinuously differentiable function as follows

120579 = 119891 (120572) = [1198911 (120572) 1198912 (120572) sdot sdot sdot 119891119863 (120572)]

119879

(11)

Then the two-phase value iteration for critic network isdescribed as the following theorem

Theorem 2 Given the following conditions

(1) the system is boundary input and boundary output(BIBO) stable

(2) the immediate reward 119903 is bounded(3) for 119896 = 1 2 120578119896

119905rarr 0 suminfin

119905=1 120578119896

119905= infin

the following iteration process is convergent

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 1205781

119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782

119905119896alefsym

120579119905

(119909119883119871)

sdot [119903 + 120573119881 (120579119905 1199091015840) minus 119896120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(12)

where 119881(120579119905 1199091015840) = max

119886119896120579119905

(1199091015840 119883119871)120572(119905) and 119896

alefsym

120579119905

(119909 119883119871)

is a kind of pseudoinversion of 119896120579119905

(119909 119883119871) that is

119896120579119905

(119909 119883119871)119896alefsym

120579119905

(119909 119883119871) = 1

Proof From (12) it is clear that the two phases include theupdate of hyperparameters in phase 1 which is viewed as theupdate of generalization model and the update of samplesrsquovalue in phase 2 which is viewed as the update of criticnetwork

The convergence of iterative algorithm is proved based onstochastic approximation Lyapunov method

Define that

119875119905119896120579119905

(119909119883119871) 120572119905= 119903 +120573119881 (120579

119905 1199091015840)

= 119903 +max119886

119896120579119905

(1199091015840 119883119871) 120572 (119905)

(13)

Equation (12) is rewritten as

120572119905+1 = 120572

119905

120579119905+1 = 120579

119905+ 120578

1119905(119891 (120572119905) minus 120579119905)

120572119905+2 = 120572

119905+1 + 1205782119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1]

120579119905+2 = 120579

119905+1

(14)

Define approximation errors as

1205931119905+2 = 120593

1119905+1 = 120579

119905+1 minus119891 (120572119905+1) = 120579

119905minus119891 (120572

119905+1)

+ 1205781119905[119891 (120572119905) minus 120579119905] = (1minus 120578

1119905) 120593

1119905

(15)

1205932119905+2 = (120572

119905+2 minus120572lowast) = (120572

119905+1 minus120572lowast) + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast+119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1] = 120593

1119905+1 + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909 119883119871) 120572119905+1

Mathematical Problems in Engineering 5

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast] + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1]

(16)

Further define that

1205821 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus119875

119905+1119896120579119905

(119909119883119871) 120572lowast]

1205822 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowastminus 119896120579119905

(119909119883119871) 120572119905+1]

(17)

Thus (16) is reexpressed as

1205932119905+2 = 120593

1119905+1 + 120578

21199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast) (18)

Let 120593119905= [(120593

1119905)119879

(1205932119905)119879]119879

and the two-phase iteration isin the shape of stochastic approximation that is

[1205931119905+1

1205932119905+1

] = [1205931119905

1205932119905

]+[minus120578

11199051205931119905

0]

[1205931119905+2

1205932119905+2

]

= [1205931119905+1

1205932119905+1

]

+[0

12057821199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast)]

(19)

Define119881 (120593119905)

= 120593119879

119905Λ120593119905

= [(1205931119905)119879

(1205932119905)119879

] [119868119863

0

0 119896119879

120579119905

(119909lowast 119883119871) 119896120579119905

(119909lowast 119883119871)] [

1205931119905

1205932119905

]

(20)

where 119863 is the scale of the hyperparameters and 119909lowast will

be defined later Let 119896lowast119905represent 119896119879

120579119905

(119909lowast 119883119871)119896120579119905

(119909lowast 119883119871) for

shortObviously (20) is a positive definitematrix and |119909

lowast| lt infin

|119896lowast

119905| lt infin Hence119881(120593

119905) is a Lyapunov functional At moment

119905 the conditional expectation of the positive definite matrixis

119864119905119881119905+2 (120593119905+2)

= 119864119905[120593119879

119905+2Λ120593119905+2]

= 119864119905[(120593

1119905+2)119879

(1205932119905+2)119879

] [119868119863

0

0 119896lowast

119905

][1205931119905+2

1205932119905+2

]

(21)

It is easy to compute the first-order Taylor expansion of(21) as

119864119905119881119905+2 (120593119905+2)

= 119864119905119881119905+1 (120593119905+1) + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]

+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593

119905+1)119879(120593119905+2 minus120593

119905+1)]

(22)

in which the first item on the right of (22) is

119864119905119881119905+1 (120593119905+1)

= 119881119905(120593119905) + 120578

1119905119864119905[1198811015840

119905(120593119905) (120593119905+1 minus120593

119905)]

+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593

119905)119879(120593119905+1 minus120593

119905)]

(23)

Substituting (23) into (22) yields

119864119905119881119905+2 (120593119905+2) minus119881

119905(120593119905)

= 1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1]

+ (1205781119905)21198621119864119905 [(119884

1119905)2] + (120578

1119905)21198622119864119905 [(119884

2119905+1)

2]

(24)

where 1198841119905= [ minus120593

1119905

0]

1198842119905+1 = [

0119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1

]

= [0

1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

(25)

Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572

119905| lt infin

If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the

dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded

According to Proposition 1 when the policy space isbounded and |120572

119905| lt infin |120579

119905| lt infin there exists a constant 1198881

so that

119864100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198881 (26)

In addition

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2= 119864

10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

210038161003816100381610038161003816

le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816

12058222 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816

le 11988821205932119905+1 + 1198883120593

2119905+1 le 1198884119881 (120593

119905+1)

(27)

From (26) and (27) we know that

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198623119881 (120593

119905+1) +1198623 (28)

where 1198623 = max1198881 1198884 Similarly

119864100381610038161003816100381610038161198841119905

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905(120593119905) 119884

1119905

10038161003816100381610038161003816le 1198624119881 (120593

119905) +1198624 (29)

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 5

minus119875119905+1119896120579

119905

(119909119883119871) 120572lowast] + 120578

2119905119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowast

minus 119896120579119905

(119909119883119871) 120572119905+1]

(16)

Further define that

1205821 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus119875

119905+1119896120579119905

(119909119883119871) 120572lowast]

1205822 (119909 120572119905+1 120572lowast) = 119896alefsym

120579119905

(119909119883119871)

sdot [119875119905+1119896120579

119905

(119909119883119871) 120572lowastminus 119896120579119905

(119909119883119871) 120572119905+1]

(17)

Thus (16) is reexpressed as

1205932119905+2 = 120593

1119905+1 + 120578

21199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast) (18)

Let 120593119905= [(120593

1119905)119879

(1205932119905)119879]119879

and the two-phase iteration isin the shape of stochastic approximation that is

[1205931119905+1

1205932119905+1

] = [1205931119905

1205932119905

]+[minus120578

11199051205931119905

0]

[1205931119905+2

1205932119905+2

]

= [1205931119905+1

1205932119905+1

]

+[0

12057821199051205821 (119909 120572119905+1 120572

lowast) + 120578

21199051205822 (119909 120572119905+1 120572

lowast)]

(19)

Define119881 (120593119905)

= 120593119879

119905Λ120593119905

= [(1205931119905)119879

(1205932119905)119879

] [119868119863

0

0 119896119879

120579119905

(119909lowast 119883119871) 119896120579119905

(119909lowast 119883119871)] [

1205931119905

1205932119905

]

(20)

where 119863 is the scale of the hyperparameters and 119909lowast will

be defined later Let 119896lowast119905represent 119896119879

120579119905

(119909lowast 119883119871)119896120579119905

(119909lowast 119883119871) for

shortObviously (20) is a positive definitematrix and |119909

lowast| lt infin

|119896lowast

119905| lt infin Hence119881(120593

119905) is a Lyapunov functional At moment

119905 the conditional expectation of the positive definite matrixis

119864119905119881119905+2 (120593119905+2)

= 119864119905[120593119879

119905+2Λ120593119905+2]

= 119864119905[(120593

1119905+2)119879

(1205932119905+2)119879

] [119868119863

0

0 119896lowast

119905

][1205931119905+2

1205932119905+2

]

(21)

It is easy to compute the first-order Taylor expansion of(21) as

119864119905119881119905+2 (120593119905+2)

= 119864119905119881119905+1 (120593119905+1) + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) (120593119905+2 minus120593119905+1)]

+ (1205782119905)21198622119864119905 [(120593119905+2 minus120593

119905+1)119879(120593119905+2 minus120593

119905+1)]

(22)

in which the first item on the right of (22) is

119864119905119881119905+1 (120593119905+1)

= 119881119905(120593119905) + 120578

1119905119864119905[1198811015840

119905(120593119905) (120593119905+1 minus120593

119905)]

+ (1205781119905)21198621119864119905 [(120593119905+1 minus120593

119905)119879(120593119905+1 minus120593

119905)]

(23)

Substituting (23) into (22) yields

119864119905119881119905+2 (120593119905+2) minus119881

119905(120593119905)

= 1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1]

+ (1205781119905)21198621119864119905 [(119884

1119905)2] + (120578

1119905)21198622119864119905 [(119884

2119905+1)

2]

(24)

where 1198841119905= [ minus120593

1119905

0]

1198842119905+1 = [

0119875119905+1119896120579

119905

(119909119883119871) 120572119905+1 minus 119896

120579119905

(119909119883119871) 120572119905+1

]

= [0

1205821 (119909 120572119905+1 120572lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

(25)

Consider the last two items of (24) firstly If the immediatereward is bounded and infinite discounted reward is appliedthe value function is bounded Hence |120572

119905| lt infin

If the system is BIBO stable existΣ0 Σ1 sub 119877119899+119898 119899119898 are the

dimensions of the state space and action space respectivelyforall1199090 isin Σ0 119909infin isin Σ1 the policy space is bounded

According to Proposition 1 when the policy space isbounded and |120572

119905| lt infin |120579

119905| lt infin there exists a constant 1198881

so that

119864100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198881 (26)

In addition

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2= 119864

10038161003816100381610038161003816[1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

210038161003816100381610038161003816

le 2119864 1003816100381610038161003816100381612058221 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816+ 2119864 10038161003816100381610038161003816

12058222 (119909 120572119905+1 120572

lowast)10038161003816100381610038161003816

le 11988821205932119905+1 + 1198883120593

2119905+1 le 1198884119881 (120593

119905+1)

(27)

From (26) and (27) we know that

119864100381610038161003816100381610038161198842119905+1

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905+1 (120593119905+1) 1198842119905+1

10038161003816100381610038161003816le 1198623119881 (120593

119905+1) +1198623 (28)

where 1198623 = max1198881 1198884 Similarly

119864100381610038161003816100381610038161198841119905

10038161003816100381610038161003816

2+119864

100381610038161003816100381610038161198811015840

119905(120593119905) 119884

1119905

10038161003816100381610038161003816le 1198624119881 (120593

119905) +1198624 (29)

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

6 Mathematical Problems in Engineering

According to Lemma 541 in [30] 119864119881(120593119905) lt infin and

119864|1198841119905|2lt infin and 119864|119884

2119905+1|

2lt infin

Now we focus on the first two items on the right of (24)The first item 119864

119905[1198811015840

119905(120593119905)119884

1119905] is computed as

119864119905[1198811015840

119905(120593119905) 119884

1119905]

= 119864119905[(120593

1119905)119879

(1205932119905)119879

] [119868119863

0

0 119896lowast

119905

][minus120593

1119905

0]

= minus100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2

(30)

For the second item 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] when the

state transition function is time invariant it is true that119875119905+1119870120579

119905

(119909 119883119871)120572119905+1 = 119875

119905119870120579119905

(119909 119883119871)120572119905+1 120572119905+1 = 120572

119905 Then we

have

1198641199051205821 (119909 120572119905+1 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572119905

minus 119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast] = 119896alefsym

120579119905

(119909119883119871)

sdot 119864 [119875119905119896120579119905

(119909119883119871) 120572119905

minus119875119905119896120579119905

(119909119883119871) 120572lowast] le 119896alefsym

120579119905

(119909119883119871)

sdot10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

(31)

This inequality holds because of positive 119896alefsym120579119905

(119909 119883) 119896alefsym120579119905119894gt

0 119894 = 1 119871 Define the norm Δ(120579 sdot) = max119909|Δ(120579 119909)|

where |Δ(sdot)| is the derivative matrix norm of the vector 1 and119909lowast= argmax

119909|Δ(120579 119909)| Then

10038171003817100381710038171003817119864 [119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038171003817100381710038171003817

= max119909

119864119905

10038161003816100381610038161003816[119875119905119896120579119905

(119909119883119871) 120572119905minus119875119905119896120579119905

(119909119883119871) 120572lowast]10038161003816100381610038161003816

= max119909

10038161003816100381610038161003816100381610038161003816119864119905[(119903 + 120573max

1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905)

minus(119903 + 120573max1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast)]

10038161003816100381610038161003816100381610038161003816= 120573

sdotmax119909

10038161003816100381610038161003816100381610038161003816119864119905[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905

minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

10038161003816100381610038161003816100381610038161003816= 120573max119909

10038161003816100381610038161003816100381610038161003816int1199041015840

119901 (1199041015840| 119909)

sdot [max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast] 1198891199041015840

10038161003816100381610038161003816100381610038161003816

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdot

10038161003816100381610038161003816100381610038161003816[max1198861015840

119896120579119905

(1199091015840 119883119871) 120572119905minusmax1198861015840

119896120579119905

(1199091015840 119883119871) 120572lowast]

100381610038161003816100381610038161003816100381610038161198891199041015840

le 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) 120572119905minus 119896120579119905

(1199091015840 119883119871) 120572lowast10038161003816100381610038161003816

1198891199041015840= 120573

sdotmax119909

int1199041015840

119901 (1199041015840| 119909)max

1198861015840

10038161003816100381610038161003816119896120579119905

(1199091015840 119883119871) (120572119905minus 120572lowast)100381610038161003816100381610038161198891199041015840

= 120573max119909

int1199041015840

119901 (1199041015840| 119909)

sdotmax1198861015840

10038161003816100381610038161003816119864119905[119875119905119896120579119905

(1199091015840 119883119871) 120572119905minus 119875119905119896120579119905

(1199091015840 119883119871) 120572lowast]100381610038161003816100381610038161198891199041015840

le 120573max119909

10038161003816100381610038161003816119896120579119905

(119909119883119871) (120572119905minus120572lowast)10038161003816100381610038161003816= 120573

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871)

sdot 1205932119905

10038161003816100381610038161003816

(32)

Hence

1198641199051205821 (119909 120572119905+1 120572

lowast) le 120573119896

alefsym

120579119905

(119909lowast 119883119871)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816 (33)

On the other hand

1198641199051205822 (119909 120572119905 120572

lowast) = 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905]

= 119864119905[119896alefsym

120579119905

(119909119883119871) 119875119905119896120579119905

(119909119883119871) 120572lowast

minus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572lowast] + 119896alefsym

120579119905

(119909119883119871)

sdot 119896120579119905

(119909119883119871) 120572lowastminus 119896alefsym

120579119905

(119909119883119871) 119896120579119905

(119909119883119871) 120572119905

= minus 119896alefsym

120579119905

(119909119883119871) [119896120579119905

(119909119883119871) 120593

2119905minus1198641199051205823 (119909 120579119905 120579

lowast)]

(34)

where 1205823(119909 120579119905 120579lowast) = 119875

119905119896120579119905

(119909 119883119871)120572lowastminus 119896120579119905

(119909 119883119871)120572lowast is the

value function error caused by the estimated hyperparame-ters

Substituting (33) and (34) into 119864119905[1198811015840

119905+1(120593119905+1)1198842119905+1] yields

119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] = 119864

119905[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 119896120579119905

(119909lowast 119883119871) [1205821 (119909 120572119905+1 120572

lowast) + 1205822 (119909 120572119905+1 120572

lowast)]

= [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot [1198641199051205821 (119909 120572119905+1 120572

lowast) + 1198641199051205822 (119909 120572119905+1 120572

lowast)]

le 120573 [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871) 119896alefsym

120579119905

(119909lowast 119883119871)

sdot10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816minus [119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

119896120579119905

(119909lowast 119883119871)

sdot 119896alefsym

120579119905

(119909lowast 119883119871) [119896120579119905

(119909lowast 119883119871) 120593

2119905minus1198641199051205823 (119909lowast 120579119905 120579lowast)]

le minus (1minus120573)10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2+100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

sdot 1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

(35)

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 7

Obviously if the following inequality is satisfied

1205782119905

100381610038161003816100381610038161003816[119896120579119905

(119909lowast 119883119871) 120593

2119905]119879

1198641199051205823 (119909lowast 120579119905 120579lowast)100381610038161003816100381610038161003816

lt 1205781119905

100381610038161003816100381610038161205931119905

10038161003816100381610038161003816

2+ 120578

2119905(1minus120573)

10038161003816100381610038161003816119896120579119905

(119909lowast 119883119871) 120593

2119905

10038161003816100381610038161003816

2(36)

there exists a positive const 120575 such that the first two items onthe right of (24) satisfy

1205781119905119864119905[1198811015840

119905(120593119905) 119884

1119905] + 120578

2119905119864119905[1198811015840

119905+1 (120593119905+1) 1198842119905+1] le minus 120575 (37)

According to Theorem 542 in [30] the iterative process isconvergent namely 120593

119905

1199081199011997888997888997888997888rarr 0

Remark 3 Let us check the final convergence positionObviously 120593

119905

1199081199011997888997888997888997888rarr 0 [ 120572119905

120579119905

]1199081199011997888997888997888997888rarr [

120572lowast

119891(120572lowast)] This means the

equilibriumof the critic networkmeets119864119905[119875119905119870120579lowast(119909 119883

119871)120572lowast] =

119870120579lowast(119909 119883

119871)120572lowast where 120579

lowast= 119891(120572

lowast) And the equilibrium of

hyperparameters 120579lowast = 119891(120572lowast) is the solution of evidencemax-

imization that max120593(119871(120593)) = max

120593((12)119910

119879119870minus1120579lowast (119883119871 119883119871)119910 +

(12)log|119862| + (1198992)log2120587) where 119910 = 119870minus1120579lowast (119883119871 119883119871)120572

lowast

Remark 4 It is clear that the selection of the samples is oneof the key issues Since now all samples have values accordingto two-phase iteration according to the information-basedcriterion it is convenient to evaluate samples and refine theset by arranging relative more samples near the equilibriumorwith great gradient of the value function so that the sampleset is better to describe the distribution of value function

Remark 5 Since the two-phase iteration belongs to valueiteration the initial policy of the algorithm does not need tobe stable To ensure BIBO in practice we need a mechanismto clamp the output of system even though the system willnot be smooth any longer

Theorem 2 gives the iteration principle for critic networklearning Based on the critic network with proper objectivedefined such as minimizing the expected total discountedreward [15] HDP or DHP update is applied to optimizeactor network Since this paper focuses on ACD and moreimportantly the update process of actor network does notaffect the convergence of critic network though maybe itinduces premature or local optimization the gradient updateof actor is not necessary Hence a simple optimum seeking isapplied to get optimal actions wrt sample states and then anactor network based onGaussian kernel is generated based onthese optimal state-action pairs

Up to now we have built a critic-actor architecture whichis namedGaussian-kernel-based Approximate Dynamic Pro-gramming (GK-ADP for short) and shown in Algorithm 1 Aspan 119870 is introduced so that hyperparameters are updatedevery 119870 times of 120572 update If 119870 gt 1 two phases areasynchronous From the view of stochastic approximationthis asynchronous learning does not change the convergencecondition (36) but benefit computational cost of hyperparam-eters learning

4 Simulation and Discussion

In this section we propose some numerical simulationsabout continuous control to illustrate the special propertiesand the feasibility of the algorithm including the necessityof two phases learning the specifical properties comparingwith traditional kernel-based ACDs and the performanceenhancement resulting from online refinement of sample set

Before further discussion we firstly give common setupin all simulations

(i) The span 119870 = 50(ii) To make output bounded once a state is out of the

boundary the system is reset randomly(iii) The exploration-exploitation tradeoff is left out of

account here During learning process all actions areselected randomly within limited action space Thusthe behavior of the system is totally unordered duringlearning

(iv) The same ALD-based sparsification in [15] is used todetermine the initial sample set in which 119908

119894= 18

V0 = 0 and V1 = 05 empirically(v) The sampling time and control interval are set to

002 s

41 The Necessity of Two-Phase Learning The proof ofTheorem 2 shows that properly determined learning rates ofphases 1 and 2 guarantee condition (36) Here we propose asimulation to show how the learning rates in phases 1 and 2affect the performance of the learning

Consider a simple single homogeneous inverted pendu-lum system

1199041 = 1199042

1199042 =(119872119901119892 sin (1199041) minus 120591cos (1199041)) 119889

119872119901119889

(38)

where 1199041 and 1199042 represent the angle and its speed 119872119901

=

01 kg 119892 = 98ms2 119889 = 02m respectively and 120591 is ahorizontal force acting on the pole

We test the success rate under different learning ratesSince during the learning the action is always selectedrandomly we have to verify the optimal policy after learningthat is an independent policy test is carried out to test theactor network Thus the success of one time of learningmeans in the independent test the pole can be swung upand maintained within [minus002 002] rad for more than 200iterations

In the simulation 60 state-action pairs are collected toserve as the sample set and the learning rates are set to 120578

2119905=

001(1 + 119905)0018 1205781

119905= 119886(1 + 119905

0018) where 119886 is varying from 0

to 012The learning is repeated 50 times in order to get the

average performance where the initial states of each runare randomly selected within the bound [minus04 04] and[minus05 05] wrt the dimensions of 1199041 and 1199042 The action space

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

8 Mathematical Problems in Engineering

Initialize120579 hyperparameters of Gaussian kernel model119883119871= 119909119894| 119909119894= (119904119894 119886119894)119871

119894=1 sample set

120587(1199040) initial policy1205781t 120578

2t learning step size

Let 119905 = 0Loop

119891119900119903 k = 1 119905119900 119870 119889119900

119905 = t + 1119886119905= 120587(119904

119905)

Get the reward 119903119905

Observe next state 119904119905+1

Update 120572 according to (12)Update the policy 120587 according to optimum seeking

119890119899119889 119891119900119903

Update 120579 according to (12)Until the termination criterion is satisfied

Algorithm 1 Gaussian-kernel-based Approximate Dynamic Programming

0 001 002 005 008 01 0120

01

02

03

04

05

06

07

08

09

1

Succ

ess r

ate o

ver 5

0 ru

ns

6

17

27

42

5047 46

Parameter a in learning rate 1205781

Figure 1 Success times over 50 runs under different learning rate1205781119905

is bounded within [minus05 05] And in each run the initialhyperparameters are set to [119890

minus2511989005

11989009

119890minus1

1198900001

]Figure 1 shows the result of success rates Clearly with

1205782119905fixed different 1205781

119905rsquos affect the performance significantly In

particular without the learning of hyperparameters that is1205781119905= 0 there are only 6 successful runs over 50 runsWith the

increasing of 119886 the success rate increases till 1 when 119886 = 008But as 119886 goes on increasing the performance becomes worse

Hence both phases are necessary if the hyperparametersare not initialized properly It should be noted that thelearning rates wrt two phases need to be regulated carefullyin order to guarantee condition (36) which leads the learningprocess to the equilibrium in Remark 3 but not to some kindof boundary where the actor always executed the maximal orminimal actions

0 1000 2000 3000 4000 5000 6000 7000 8000500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

No learninga = 001a = 002a = 005

a = 008a = 01a = 012

Figure 2 The average evolution processes over 50 runs wrtdifferent 1205781

119905

If all samplesrsquo values on each iteration are summed up andall cumulative values wrt iterations are depicted in seriesthen we have Figure 2 The evolution process wrt the bestparameter 119886 = 008 is marked by the darker diamond It isclear that even with different 1205781

119905 the evolution processes of

the samplesrsquo value learning are similar That means due toBIBO property the samplesrsquo values must be finally boundedand convergent However as mentioned above it does notmean that the learning process converges to the properequilibrium

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 9

S1

S3

Figure 3 One-dimensional inverted pendulum

42 Value Iteration versus Policy Iteration As mentioned inAlgorithm 1 the value iteration in critic network learningdoes not depend on the convergence of the actor networkand compared with HDP or DHP the direct optimumseeking for actor network seems a little aimless To test itsperformance and discuss its special characters of learning acomparison between GK-ADP and KHDP is carried out

The control objective in the simulation is a one-dimensional inverted pendulum where a single-stageinverted pendulum is mounted on a cart which is able tomove linearly just as Figure 3 shows

The mathematic model is given as

1199041 = 1199042 + 120576

1199042 =119892 sin (1199041) minus 119872

119901119889119904

22 cos (1199041) sin (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

+cos (1199041) 119872

119889 (43 minus 119872119901cos2 (1199041) 119872)

119906

1199043 = 1199044

1199044 =119906 + 119872

119901119889 (119904

22 sin (1199041) minus 1199042 cos (1199041))

119872

(39)

where 1199041 to 1199044 represent the state of angle angle speed lineardisplacement and linear speed of the cart respectively 119906represents the force acting on the cart119872 = 02 kg 119889 = 05m120576 sim 119880(minus002 002) and other denotations are the same as(38)

Thus the state-action space is 5D space much larger thanthat in simulation 1The configurations of the both algorithmsare listed as follows

(i) A small sample set with 50 samples is adopted to buildcritic network which is determined by ALD-basedsparsification

(ii) For both algorithms the state-action pair is limited to[minus03 03] [minus1 1] [minus03 03] [minus2 2] [minus3 3] wrt 1199041to 1199044 and 119906

(iii) In GK-ADP the learning rates are 1205781119905= 002(1+119905

001)

and 1205782119905= 002(1 + 119905)

002

0 2000 4000 6000 8000minus02

002040608

112141618

Iteration

The a

vera

ge o

f cum

ulat

ed w

eigh

ts ov

er 1

00 ru

ns

minus1000100200300400500600700800900

10000

GK-ADP120590 = 09

120590 = 12

120590 = 15

120590 = 18

120590 = 21

120590 = 24

Figure 4 The comparison of the average evolution processes over100 runs

(iv) The KHDP algorithm in [15] is chosen as the compar-ison where the critic network uses Gaussian kernelsTo get the proper Gaussian parameters a series of 120590 as09 12 15 18 21 and 24 are tested in the simulationThe elements of weights 120572 are initialized randomly in[minus05 05] the forgetting factor in RLS-TD(0) is set to120583 = 1 and the learning rate in KHDP is set to 03

(v) All experiments run 100 times to get the average per-formance And in each run there are 10000 iterationsto learn critic network

Before further discussion it should be noted that it makesno sense to figure out ourselves with which algorithm isbetter in learning performance because besides the debateof policy iteration versus value iteration there are too manyparameters in the simulation configuration affecting learningperformance So the aim of this simulation is to illustrate thelearning characters of GK-ADP

However to make the comparison as fair as possible weregulate the learning rates of both algorithms to get similarevolution processes Figure 4 shows the evolution processesof GK-ADP and KHDPs under different 120590 where the 119910-axisin the left represents the cumulatedweightssum

119894120572119894 of the critic

network in kernel ACD and the other 119910-axis represents thecumulated values of the samples sum

119894119876(119909119894) 119909119894isin 119883119871 in GK-

ADP It implies although with different units the learningprocesses under both learning algorithms converge nearly atthe same speed

Then the success rates of all algorithms are depictedin Figure 5 The left six bars represent the success rates ofKHDP with different 120590 With superiority argument left asidewe find that such fixed 120590 in KHDP is very similar to thefixed hyperparameters 120579 which needs to be set properlyin order to get higher success rate But unfortunately thereis no mechanism in KHDP to regulate 120590 online On the

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

10 Mathematical Problems in Engineering

09 12 15 18 21 240

01

02

03

04

05

06

07

08

09

1

Parameter 120590 in ALD-based sparsification

Succ

ess r

ate o

ver 1

00 ru

ns

67 6876 75

58 57

96

GK-ADP

Figure 5 The success rates wrt all algorithms over 100 runs

contrary the two-phase iteration introduces the update ofhyperparameters into critic network which is able to drivehyperparameters to better values even with the not so goodinitial values In fact this two-phase update can be viewed asa kind of kernel ACD with dynamic 120590 when Gaussian kernelis used

To discuss the performance in deep we plot the testtrajectories resulting from the actors which are optimized byGK-ADP and KHDP in Figures 6(a) and 6(b) respectivelywhere the start state is set to [02 0 minus02 0]

Apparently the transition time of GK-ADP is muchsmaller than KHDP We think besides the possible wellregulated parameters an important reason is nongradientlearning for actor network

The critic learning only depends on exploration-exploita-tion balance but not the convergence of actor learning Ifexploration-exploitation balance is designed without actornetwork output the learning processes of actor and criticnetworks are relatively independent of each other and thenthere are alternatives to gradient learning for actor networkoptimization for example the direct optimum seeking Gaus-sian regression actor network in GK-ADP

Such direct optimum seeking may result in nearly non-smooth actor network just like the force output depicted inthe second plot of Figure 6(a) To explain this phenomenonwe can find the clue from Figure 7 which illustrates the bestactions wrt all sample states according to the final actornetwork It is reasonable that the force direction is always thesame to the pole angle and contrary to the cart displacementAnd clearly the transition interval from negative limit minus3119873 topositive 3119873 is very short Thus the output of actor networkresulting from Gaussian kernels intends to control anglebias as quickly as possible even though such quick controlsometimes makes the output force a little nonsmooth

It is obvious that GK-ADP is with high efficiency butalso with potential risks such as the impact to actuatorsHence how to design exploration-exploitation balance tosatisfy Theorem 2 and how to design actor network learning

to balance efficiency and engineering applicability are twoimportant issues in future work

Finally we check the samples values which are depictedin Figure 8 where only dimensions of pole angle and lineardisplacement are depicted Due to the definition of immedi-ate reward the closer the states to the equilibrium are thesmaller the 119876 value is

If we execute 96 successful policies one by one and recordall final cart displacements and linear speed just as Figure 9shows the shortage of this sample set is uncovered that evenwith good convergence of pole angle the optimal policy canhardly drive cart back to zero point

To solve this problem and improve performance besidesoptimizing learning parameters Remark 4 implies thatanother and maybe more efficient way is refining sample setbased on the samplesrsquo values We will investigate it in thefollowing subsection

43 Online Adjustment of Sample Set Due to learning sam-plersquos value it is possible to assess whether the samples arechosen reasonable In this simulation we adopt the followingexpected utility [26]

119880 (119909) = 120588ℎ (119864 [119876 (119909) | 119883119871])

+120573

2log (var [119876 (119909) | 119883119871])

(40)

where ℎ(sdot) is a normalization function over sample set and120588 = 1 120573 = 03 are coefficients

Let 119873add represent the number of samples added to thesample set 119873limit the size limit of the sample set and 119879(119909) =

119909(0) 119909(1) 119909(119879119890) the state-action pair trajectory during

GK-ADP learning where 119879119890denotes the end iteration of

learning The algorithm to refine sample set is presented asAlgorithm 2

Since in simulation 2 we have carried out 100 runs ofexperiment for GK-ADP 100 sets of sample values wrtthe same sample states are obtained Now let us applyAlgorithm 2 to these resulted sample sets where 119879(119909) ispicked up from the history trajectory of state-action pairs ineach run and 119873add = 10 119873limit = 60 in order to add 10samples into sample set

We repeat all 100 runs of GK-ADP again with the samelearning rates 120578

1 and 1205782 Based on 96 successful runs in

simulation 2 94 runs are successful in obtaining controlpolicies swinging up pole Figure 10(a) shows the averageevolutional process of sample values over 100 runs wherethe first 10000 iterations display the learning process insimulation 2 and the second 8000 iterations display thelearning process after adding samples Clearly the cumulative119876 value behaves a sudden increase after sample was addedand almost keeps steady afterwards It implies even withmore samples added there is not toomuch to learn for samplevalues

However if we check the cart movement we will findthe enhancement brought by the change of sample set Forthe 119895th run we have two actor networks resulting from GK-ADP before and after adding samples Using these actors Wecarry out the control test respectively and collect the absolute

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 11

0 1 2 3 4 5minus15

minus1

minus05

0

05

Time (s)

0 1 2 3 4 5Time (s)

minus2

0

2

4

Forc

e (N

)

Angle displacementAngle speed

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(a) The resulted control performance using GK-ADP

0 1 2 3 4 5Time (s)

0

0

1 2 3 4 5Time (s)

Angle displacementAngle speed

minus1

0

1

2

3

Forc

e (N

)

minus1

minus05

05

Pole

angl

e (ra

d) an

d sp

eed

(rad

s)

(b) The resulted control performance using KHDP

Figure 6 The outputs of the actor networks resulting from GK-ADP and KHDP

minus04 minus02 0 0204

minus04minus02

002

04minus4

minus2

0

2

4

Pole angle (rad)

Linear displacement (m)

Best

actio

n

Figure 7 The final best policies wrt all samples

minus04 minus020 02

04

minus04minus02

002

04minus10

0

10

20

30

40

Pole angle (rad)Linear displacement (m)

Q v

alue

Figure 8The119876 values projected on to dimensions of pole angle (1199041)and linear displacement (1199043) wrt all samples

0 20 40 60 80 100

0

05

1

15

2

Run

Cart

disp

lace

men

t (m

) and

lini

ear s

peed

(ms

)

Cart displacementLinear speed

Figure 9 The final cart displacement and linear speed in the test ofsuccessful policy

values of cart displacement denoted by 1198781198873(119895) and 119878119886

3(119895) at themoment that the pole has been swung up for 100 instants Wedefine the enhancement as

119864 (119895) =119878119887

3 (119895) minus 119878119886

3 (119895)

1198781198873 (119895) (41)

Obviously the positive enhancement indicates that theactor network after adding samples behaves better As allenhancements wrt 100 runs are illustrated together just

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

12 Mathematical Problems in Engineering

119891119900119903 l = 1 119905119900 119905ℎ119890 119890119899119889 119900119891 119879(119909)

Calculate 119864[119876(119909(119897)) | 119883119871] and var[119876(119909119897) | 119883

119871] using (6) and (7)

Calculate 119880(119909(119897)) using (40)119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119879119890in ascending order

Add the first119873add state-action pairs to sample set to get extended set119883119871+119873add

119894119891 L +119873add gt 119873limit

Calculate hyperparameters based on 119883119871+119873add

119891119900119903 l = 1 119905119900 119871 +119873add

Calculate 119864[119876(119909(119897)) | 119883119871+119873add

] and var[119876(119909119897) | 119883119871+119873add

] using (6) and (7)Calculate 119880(119909(119897)) using (40)

119890119899119889 119891119900119903 119897

Sort all 119880(119909(119897)) 119897 = 1 119871 + 119873add in descending orderDelete the first 119871 +119873add minus 119873limit samples to get refined set 119883

119873limit

119890119899119889 119894119891

Algorithm 2 Refinement of sample set

0 2000 4000 6000 8000 10000 12000 14000 16000 18000100

200

300

400

500

600

700

800

Iteration

The s

um o

f Q v

alue

s of a

ll sa

mpl

e sta

te-a

ctio

ns

(a) The average evolution process of 119876 values before and after addingsamples

0 20 40 60 80 100minus15

minus1

minus05

0

05

1

Run

Enha

ncem

ent (

)

(b) The enhancement of the control performance about cart displacementafter adding samples

Figure 10 Learning performance of GK-ADP if sample set is refined by adding 10 samples

as Figure 10(b) shows almost all cart displacements as thepole is swung up are enhanced more or less except for 8times of failure Hence with proper principle based on thesample values the sample set can be refined online in orderto enhance performance

Finally let us check which state-action pairs are addedinto sample set We put all added samples over 100 runstogether and depict their values in Figure 11(a) where thevalues are projected onto the dimensions of pole angleand cart displacement If only concerning the relationshipbetween 119876 values and pole angle just as Figure 11(b) showswe find that the refinement principle intends to select thesamples a little away from the equilibrium and near theboundary minus04 due to the ratio between 120588 and 120573

5 Conclusions and Future Work

ADPmethods are among the most promising research workson RL in continuous environment to construct learningsystems for nonlinear optimal control This paper presentsGK-ADP with two-phase value iteration which combines theadvantages of kernel ACDs and GP-based value iteration

The theoretical analysis reveals that with proper learningrates two-phase iteration is good atmakingGaussian-kernel-based critic network converge to the structure with optimalhyperparameters and approximate all samplesrsquo values

A series of simulations are carried out to verify thenecessity of two-phase learning and illustrate properties ofGK-ADP Finally the numerical tests support the viewpointthat the assessment of samplesrsquo values provides the way to

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Mathematical Problems in Engineering 13

minus04minus02

002

04

minus04minus0200204minus10

0

10

20

30

40

Pole angle (rad)

Q v

alue

s

Cart

displac

emen

t (m)

(a) Projecting on the dimensions of pole angle and cart displacement

minus04 minus03 minus02 minus01 0 01 02 03 04minus5

0

5

10

15

20

25

30

35

40

Pole angle (rad)

Q v

alue

s

(b) Projecting on the dimension of pole angle

Figure 11 The 119876 values wrt all added samples over 100 runs

refine sample set online in order to enhance the performanceof critic-actor architecture during operation

However there are some issues needed to be concernedin future The first is how to guarantee condition (36) duringlearning which is now determined by empirical ways Thesecond is the balance between exploration and exploitationwhich is always an opening question but seemsmore notablehere because bad exploration-exploitation principle will leadtwo-phase iteration to failure not only to slow convergence

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China under Grants 61473316 and 61202340 andthe International Postdoctoral Exchange Fellowship Programunder Grant no 20140011

References

[1] R Sutton and A Barto Reinforcement Learning An Introduc-tion Adaptive Computation andMachine LearningMIT PressCambridge Mass USA 1998

[2] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[3] F L Lewis D Vrabie and K Vamvoudakis ldquoReinforcementlearning and feedback control using natural decision methodsto design optimal adaptive controllersrdquo IEEE Control SystemsMagazine vol 32 no 6 pp 76ndash105 2012

[4] H Zhang Y Luo and D Liu ldquoNeural-network-based near-optimal control for a class of discrete-time affine nonlinearsystems with control constraintsrdquo IEEE Transactions on NeuralNetworks vol 20 no 9 pp 1490ndash1503 2009

[5] Y Huang and D Liu ldquoNeural-network-based optimal trackingcontrol scheme for a class of unknown discrete-time nonlinear

systems using iterative ADP algorithmrdquo Neurocomputing vol125 pp 46ndash56 2014

[6] X Xu L Zuo and Z Huang ldquoReinforcement learning algo-rithms with function approximation recent advances andapplicationsrdquo Information Sciences vol 261 pp 1ndash31 2014

[7] A Al-Tamimi F L Lewis and M Abu-Khalaf ldquoDiscrete-timenonlinear HJB solution using approximate dynamic program-ming convergence proofrdquo IEEE Transactions on Systems Manand Cybernetics Part B Cybernetics vol 38 no 4 pp 943ndash9492008

[8] S Ferrari J E Steck andRChandramohan ldquoAdaptive feedbackcontrol by constrained approximate dynamic programmingrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 38 no 4 pp 982ndash987 2008

[9] F-Y Wang H Zhang and D Liu ldquoAdaptive dynamic pro-gramming an introductionrdquo IEEE Computational IntelligenceMagazine vol 4 no 2 pp 39ndash47 2009

[10] D Wang D Liu Q Wei D Zhao and N Jin ldquoOptimal controlof unknown nonaffine nonlinear discrete-time systems basedon adaptive dynamic programmingrdquo Automatica vol 48 no 8pp 1825ndash1832 2012

[11] D Vrabie and F Lewis ldquoNeural network approach tocontinuous-time direct adaptive optimal control for partiallyunknown nonlinear systemsrdquo Neural Networks vol 22 no 3pp 237ndash246 2009

[12] D Liu D Wang and X Yang ldquoAn iterative adaptive dynamicprogramming algorithm for optimal control of unknowndiscrete-time nonlinear systemswith constrained inputsrdquo Infor-mation Sciences vol 220 pp 331ndash342 2013

[13] N Jin D Liu T Huang and Z Pang ldquoDiscrete-time adaptivedynamic programming using wavelet basis function neural net-worksrdquo in Proceedings of the IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning pp 135ndash142 Honolulu Hawaii USA April 2007

[14] P Koprinkova-Hristova M Oubbati and G Palm ldquoAdaptivecritic designwith echo state networkrdquo in Proceedings of the IEEEInternational Conference on SystemsMan andCybernetics (SMCrsquo10) pp 1010ndash1015 IEEE Istanbul Turkey October 2010

[15] X Xu Z Hou C Lian and H He ldquoOnline learning controlusing adaptive critic designs with sparse kernelmachinesrdquo IEEE

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

14 Mathematical Problems in Engineering

Transactions on Neural Networks and Learning Systems vol 24no 5 pp 762ndash775 2013

[16] V Vapnik Statistical Learning Theory Wiley New York NYUSA 1998

[17] C E Rasmussen and C K Williams Gaussian Processesfor Machine Learning Adaptive Computation and MachineLearning MIT Press Cambridge Mass USA 2006

[18] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[19] T G Dietterich and X Wang ldquoBatch value function approxi-mation via support vectorsrdquo in Advances in Neural InformationProcessing Systems pp 1491ndash1498 MIT Press Cambridge MassUSA 2002

[20] X Wang X Tian Y Cheng and J Yi ldquoQ-learning systembased on cooperative least squares support vector machinerdquoActa Automatica Sinica vol 35 no 2 pp 214ndash219 2009

[21] T Hofmann B Scholkopf and A J Smola ldquoKernel methodsin machine learningrdquoThe Annals of Statistics vol 36 no 3 pp1171ndash1220 2008

[22] M F Huber ldquoRecursive Gaussian process on-line regressionand learningrdquo Pattern Recognition Letters vol 45 no 1 pp 85ndash91 2014

[23] Y Engel S Mannor and R Meir ldquoBayes meets bellman theGaussian process approach to temporal difference learningrdquoin Proceedings of the 20th International Conference on MachineLearning pp 154ndash161 August 2003

[24] Y Engel S Mannor and R Meir ldquoReinforcement learning withGaussian processesrdquo in Proceedings of the 22nd InternationalConference on Machine Learning pp 201ndash208 ACM August2005

[25] C E Rasmussen andM Kuss ldquoGaussian processes in reinforce-ment learningrdquo in Advances in Neural Information ProcessingSystems S Thrun L K Saul and B Scholkopf Eds pp 751ndash759 MIT Press Cambridge Mass USA 2004

[26] M P Deisenroth C E Rasmussen and J Peters ldquoGaussianprocess dynamic programmingrdquo Neurocomputing vol 72 no7ndash9 pp 1508ndash1524 2009

[27] X Xu C Lian L Zuo and H He ldquoKernel-based approximatedynamic programming for real-time online learning controlan experimental studyrdquo IEEE Transactions on Control SystemsTechnology vol 22 no 1 pp 146ndash156 2014

[28] A Girard C E Rasmussen and R Murray-Smith ldquoGaussianprocess priors with uncertain inputs-multiple-step-ahead pre-dictionrdquo Tech Rep 2002

[29] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[30] H J Kushner and G G Yin Stochastic Approximation andRecursive Algorithms and Applications vol 35 of Applications ofMathematics Springer Berlin Germany 2nd edition 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 15: New Research Article Two-Phase Iteration for Value Function …downloads.hindawi.com/journals/mpe/2015/760459.pdf · 2019. 7. 31. · Research Article Two-Phase Iteration for Value

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of