feature selection for neuro-dynamic programming selection for neuro-dynamic programming dayu huang...

Feature Selection for Neuro-Dynamic Programming∗

Dayu Huang Wei Chen Prashant Mehta Sean Meyn Amit Surana†

October 4, 2011

Abstract

Neuro-Dynamic Programming encompasses techniques from both reinforcement learn-ing and approximate dynamic programming. Feature selection refers to the choice of basisthat defines the function class that is required in the application of these techniques.

This chapter reviews two popular approaches to neuro-dynamic programming, TD-learning and Q-learning. The main goal of the chapter is to demonstrate how insightfrom idealized models can be used as a guide for feature selection for these algorithms.Several approaches are surveyed, including fluid and diffusion models, and the applicationof idealized models arising from mean-field game approximations. The theory is illustratedwith several examples.

Keywords: Optimal Control, Stochastic Control, Approximate Dynamic Programming,Reinforcement Learning

2000 AMS Subject Classification: 49L20, 93E20, 93E35, 60J10

1 Introduction

If you have taken a course that mentioned a Riccati equation, then you have already beenexposed to approximate dynamic programming: No physical system is linear! The modelis assumed to be linear, and the cost function is assumed quadratic, so that a closed-formexpression for the optimal feedback law can be computed.

In many cases, a linear approximation is not easily justified. In particular, in applicationsfrom operations research or chemical engineering, the state space is usually constrained. Toobtain an effective feedback law for control, we must find alternative approaches to approxi-mation.

Techniques from approximate dynamic programming and reinforcement learning are de-signed to obtain such approximations [1, 18]. In particular, in TD-learning and relatedapproaches, there is no attempt to approximate the system or cost function. Instead, thesolution to a dynamic programming equation is approximated directly, within a prescribedfinite-dimensional function class. A key determinant of the success of these techniques is theselection of this function class, also known as the basis.

∗Acknowledgements Financial support from UTRC, AFOSR grant FA9550-09-1-0190, ITMANETDARPA RK 2006-07284, and NSF grant CPS 0931416 is gratefully acknowledged. Any opinions, findings,and conclusions or recommendations expressed in this material are those of the authors and do not necessarilyreflect the views of the National Science Foundation.†Amit Surana is with UTRC. The remaining authors are with the Coordinated Science Laboratory, University

of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. Email: [email protected], [email protected],[email protected], [email protected] and [email protected]

Feature Selection for NDP Pre-publication copy October 4, 2011 1

The goal of this chapter is to describe techniques to construct a basis for approximation.All of these techniques return to approximate modeling : We first construct a highly idealizedapproximate model, and then compute or approximate the solution to a dynamic programmingequation for this idealization. These approximations lead naturally to a basis for approximationin the original stochastic system.

Several specific approaches are surveyed. The simplest and most easily applied is throughconsideration of a fluid model in which variability is disregarded. Diffusion models, and modelsobtained from the solution of mean field games are also considered. The main message is thatADP and RL techniques provide a general approach to applying insight gained from naivemodels of real-life systems.

The remainder of the chapter is organized as follows. Section 2 contains a brief reviewof optimal control theory for both stochastic and deterministic models. Here we demonstratethe similarity between the HJB equation seen in a first-year graduate control course, and theaverage-cost equations appearing in average-cost optimal control for both finite-state MDPs,and controlled diffusions. We argue that the continuous-time dynamic programming equa-tions are frequently more tractable than their deterministic counterparts. Using Taylor-seriesapproximations, we also explain why the solution to the dynamic programming equations fora deterministic or diffusion model might provide a good approximation for the MDP model.Section 3 contains a brief survey of techniques from Neuro-Dynamic Programming — The termis used here to cover both reinforcement learning and approximate dynamic programming.

TD-learning, Q-learning, SARSA, and approximate linear-programming techniques are alldesigned to obtain an approximate solution to a dynamic programming equation, within a pre-scribed finite-dimensional function class. The question of how to select an appropriate functionclass is discussed based on the idealized models introduced in the first part of the chapter.The application of “fluid models” (or ODE models) is explored in greater depth in Section 4,followed by diffusion approximations in Section 5. There are many other approaches for ap-proximations. We briefly discussion the value of mean-field games in Section 6. Conclusionsand directions for future work are contained in Section 7.

2 Optimality equations

Here we briefly review optimality equations for dynamic control systems — We consider bothstochastic and deterministic models. The state space is denoted X, and the space of inputsis denoted U. It is assumed that a cost function c : X × U → R+ is given, and the goal is tominimize either average or discounted cost.

Throughout the chapter we assume full state measurements are available. The input attime t is assumed to be a function of present and past state values.

To unify the presentation, we introduce a generator for each class of models. This isstandard practice in the theory of control and performance evaluation for diffusions, but lesscommon in other settings.

We begin with the deterministic model in continuous time, and recall the HJB equationsfor the associated optimal control problems. The total cost criterion (3) is highlighted becausethis is what is typically seen in control courses. Its most natural cousin in the stochastic

To appear in RL and ADP for Feedback Control, IEEE Press Computational Intelligence Series, 2012


setting is average-cost optimal control. Theory to explain the solidarity between solutions ofthe various optimality equations is contained in [12, 8, 15], and the references therein.

Deterministic model We have a state process x evolving on X = R`. Given the initialcondition x(0), and the input u evolving on U = R`u , the output is assumed to be uniquelyspecified by the ODE,

x = f(x, u) (1)

This is the deterministic, non-linear state space model.The controlled generator DF maps functions to functions: If h : X → R is differentiable,

then the generator defines a new function on X× U,

DFuh (x) =

d

dth(x(t)) with t = 0, x(0) = x, and u(0) = u.

By the chain rule of calculus, we have

DFuh (x) = f(x, u) · ∇h (x) (2)

The generator appears in dynamic programming equations for the deterministic model. Con-sider the total-cost criterion,

J∗(x) = infU

∫ ∞0

c(x(t), u(t)) dt , x(0) = x (3)

Under general conditions, the Hamilton-Jacobi-Bellman (HJB) equation holds,

minu

{c(x, u) +DF

uJ∗ (x)

}= 0 (4)

and an optimal policy is defined as a minimizer: u∗(t) is a minimizer of (4) when x∗(t) = x.

Diffusion model The state space and action space remain the same, but we now introducenoise,

dX = f(X,U) dt+ b(X,U) dN (5)

where N is an `n-dimensional Brownian motion, and b(x, u) is an `× `n matrix for each x andu. We use upper case letters to emphasize that (X,U ,N) are stochastic processes.

The controlled generator is again defined as a derivative,

DDuh (x) =

d

dtE[h(X(t))] with t = 0, X(0) = x, and U(0) = u, (6)

whenever the derivative exists; this will require smoothness and growth conditions on h. [2].Under these conditions, the generator can be expressed as a second-order differential operator,

DDuh (x) = f(x, u) · ∇h (x) + 1

2trace(b(x, u)T∇2h (x)b(x, u)

)(7)

A common control criterion is average cost. Its minimum over all policies is denoted,

η∗ = infU

{lim supT→∞

1

T

∫ T

0c(X(t), U(t)) dt

}(8)



Subject to conditions on the model and the cost function, the average cost η∗ is a deterministicquantity, independent of the initial condition [3]. Under further assumptions, an optimal policyachieving this minimal average cost can be obtained by solving the Average Cost OptimalityEquation (ACOE):

minu

{c(x, u) +DD

uh∗ (x)

}= η∗ (9)

The minimizer again defines an optimal policy.Observe the close similarity between (9) and the HJB equation (4).The ACOE is a fixed point equation in the relative value function h∗, and the optimal cost

η∗. Its solution is not unique, because we can always add a constant to h∗ to obtain anothersolution. Fix any point x• ∈ X, and normalize h∗ so that h∗(x•) = η∗. Then, the ACOE (13)is a fixed point equation in h∗ alone:

minu

{c(x, u) +DD

uh∗ (x)

}= h∗(x•) (10)

Models in discrete time Controlled models in discrete time will be expressed in the re-cursive form,

X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1)) (11)

where N is i.i.d.. This is known as a Markov Decision Process (MDP).The controlled generator is defined as in the previous cases, except that now we consider

differences rather than derivatives,

Duh (x) = E[h(X(t+ 1))− h(X(t)) | X(t) = x, U(t) = u]

= E[h(x+ f(x, u,N))]− h(x)(12)

The minimal average cost is defined as in the diffusion model (8), but as a discrete-timeaverage. The ACOE is expressed as in the case of diffusions,

minu

{c(x, u) +Duh∗ (x)

}= η∗ (13)

and h∗ is again called the relative value function.In this chapter, most of our attention is focused on the MDP model (11). A running

example is the controlled queue,

X(t+ 1) = X(t)− U(t) +A(t+ 1) (14)

in which X, U , and A are non-negative scalar-valued stochastic processes. The state X(t) ∈X = R+ is interpreted as the queue-length, U(t) ∈ R+ the quantity of service, and A(t) thenumber of new “customers” arriving at time t. It is assumed that A is i.i.d., so that (14)defines an MDP model. When U(t) is constrained to the binary set {0, 1}, this is a specialcase of the controlled random walk (CRW) queue [15].



Approximations The various models lead to very similar dynamic programming equations.The models in continuous time are frequently attractive because we can apply calculus forcomputation or approximation.

One way to approximate the MDP model with a fluid or diffusion model is through Taylorseries. The mean drift for (11) is defined by,

f(x, u) = E[X(t+ 1)−X(t) | X(t) = x, U(t) = u] (15)

and the associated fluid model,

d

dtx(t) = f(x(t), u(t)), x(0) ∈ X . (16)

A first-order Taylor series approximation can be used to obtain the following approximation,subject to conditions on h,

Duh (x) = E[h(X(t+ 1))− h(X(t)) | X(t) = x, U(t) = u]

≈ f(x, u) · ∇h (x)

The right hand side defines the generator for the fluid model (16).The diffusion model is a refinement of the fluid model to take into account volatility. It is

motivated similarly, using a second-order Taylor series expansion.To bring in randomness to the fluid model, it is useful to first introduce a fluid model in

discrete-time. For any t ≥ 0, the random variable denoted ∆(t + 1) = f(X(t), U(t),W (t +1)) − f(X(t), U(t)) has zero mean. The evolution of X can be expressed as a discrete timenonlinear system, plus “white noise”,

X(t+ 1) = X(t) + f(X(t), U(t)) + ∆(t+ 1), t ≥ 0. (17)

The fluid model in discrete time is obtained by ignoring the noise,

x(t+ 1) = x(t) + f(x(t), u(t)) (18)

This is the discrete time counterpart of (16). Observe that it can be justified exactly as inthe continuous time analysis: If h is a smooth function of x, then we may approximate thegenerator using a first order-Taylor series approximation as follows: Duh (x):=

E[h(X(t+ 1))− h(X(t))|X(t) = x, U(t) = u]

≈ h(x+ f(x, u))− h(x) + E[∇h(x+ f(x, u)) ·∆(1) | X(0) = x, U(0) = u]

= h(x+ f(x, u))− h(x)

(19)

The right hand side is the generator applied to h, for the discrete-time fluid model (18).Consideration of the model (18) may give better approximation for value functions in some

cases, but we lose the simplicity of differential equations that characterize value functions inthe continuous time model (16).



The representation (17) also motivates the diffusion model. Denote the conditional covari-ance matrix by

Σf (x, u) = E[∆(t+ 1)∆(t+ 1)T | X(t) = x, U(t) = u] ,

and let b denote an `× ` “square-root”, b(x, u)b(x, u)T = Σf (x, u) for each x, u. The diffusionmodel is of the form (5),

dX(t) = f(X(t), U(t)) dt+ b(X(t), U(t)) dN(t) (20)

where the process N is a standard Brownian motion on R`.To justify its form we consider a second order Taylor series approximation. If h : X → R

is a C2 function, and at time t we have X(t) = x, U(t) = u, then the standard second orderTaylor series about x+ = x+ f(x, u) gives,

h(X(t+ 1))− h(X(t)) ≈ ∇h (x+) · f(x, u) + 12∆(t+ 1)T∇2h (x+)∆(t+ 1)

Suppose we can justify a further approximation, obtained by replacing x+ with x in thisequation. Then, on taking expectations of each side, conditioned on X(t) = x, U(t) = u, weapproximate the generator for X by the second-order ordinary differential operator,

DDuh (x) :=∇h (x) · f(x, u) + 1

2trace(

Σf (x, u)∇2h (x)). (21)

This is precisely the differential generator for (20) (see (7)).

For the controlled queue (14), the associated fluid and diffusion models take similar forms:

q(t) = −u(t) + α Fluid model (22)

dQ(t) = (−U(t) + α) dt+ dI(t) + σA dN(t) Diffusion model (23)

where α denotes the mean of A(t), and σ2A its variance.We relax integer constraints in these models: x and X evolve on R+, and are assumed

to be continuous functions of time. Consequently, we must also relax integer constraints onthe inputs. If U(t) is constrained to {0, 1} in (14), then u(t) and U(t) are constrained to theclosed interval [0, 1] in (22) and (23).

The diffusion model (23) is expressed in the more abstract form to emphasize that this isa stochastic differential equation. The idleness process I is interpreted as a projection of Qto R+, so that, ∫ ∞

0I{Q(t) > 0} dI(t) = 0 (24)

Moreover, it is a non-decreasing function of t, with I(0) = 0.If in addition we have U(t) ≡ 1, then Q is called a Reflected Brownian Motion (RBM).

On an interval of time for which dI(t) = 0, this corresponds to draining the queue at maximalrate.

In applications found in operations research, it is usually taken for granted that c(x, u) ≡ x(or∑xi in network models). Theorem 4.1 combined with (61) demonstrate that the dynamic

programming equations for the CRW queue, and the corresponding diffusion and fluid models,are each quadratic, and very similar in form. Similar results hold for network models undergeneral conditions [15].



3 Neuro-dynamic Algorithms

The algorithms considered in this chapter are designed to construct approximations to dynamicprogramming equations. There are several approaches, based on a few different approximationcriteria. We begin with some further discussion on the MDP model in discrete time.

MDP Model The controlled stochastic model (11) is the general MDP model consideredin this paper. The controlled transition law is denoted,

Pu(x,A) := P{x+ f(x, u,W (1)) ∈ A}, A ∈ B(X).

Hence the controlled generator defined in (7) is simply Pu− I. The state space X will be takento be R`, or a subset. The set of inputs (or action space) U is taken to be a subset of R`u .The constraints are sometimes state dependent: We let U(x) denote the set of feasible inputsu for U(t) when X(t) takes the value x ∈ X.

A cost function c : X × U → R+ is given, and our goal is to find an optimal control basedon this cost function. We focus on the average cost problem, with associated Average CostOptimality Equation (ACOE) (13).

Without loss of generality, we restrict to inputs that are defined by a stationary policy. Thismeans that the input is defined by state feedback, U(t) = φ(X(t)), t ≥ 0, where φ : X → U.Sometimes we allow randomization,

U(t) = φ(X(t), Z(t)), t ≥ 0, (25)

where Z is i.i.d., and independent of N in (11). This amounts to a deterministic stationarypolicy for the extended state process (X,Z). If X is controlled using a stationary policyφ, then it evolves as a time-homogeneous Markov chain, with transition law denoted Pφ.For a deterministic policy, Pφ(x, dy) = Pφ(x)(x, dy) is the resulting transition law for thechain. If the policy is randomized, then Pφ(x, dy) =

∫Pφ(x,z)(x, dy)µZ(dz), where µZ is

the marginal distribution for Z. We let cφ denote the resulting cost function on X alone:cφ(x) = c(x, φ(x)) for deterministic policies, and cφ(x) =

∫c(x, φ(x, z))µZ(dz) for the general

randomized stationary policy. The average cost is denoted ηφ.Key to the theory of average-cost optimal control is Poisson’s equation, which is a degen-

erate version of the ACOE: If X is a Markov chain with transition kernel P , and generatorD = P − I, then Poisson’s equation is the fixed point equation,

c+Dh = η (26)

where η is the steady-state mean of c. To make sense of “steady state”, and to ensure theexistence of a solution, it is necessary to impose both irreducibility conditions and stabilityconditions on X [17, 15].

Two common approaches to computation of a solution to the ACOE are policy iterationand value iteration. The first approach is usually avoided because of higher computationalcost, but it is in fact the most convenient approach for integration with the learning techniquesdescribed in this section.



In the Policy Iteration Algorithm (PIA), a sequence of deterministic stationary policiesare obtained, with increasingly improved performance, in the sense that the correspondingaverage costs are decreasing. The algorithm is initialized with a policy φ0 and then thefollowing operations are performed in the kth stage of the algorithm:

. Given the policy φk, find the solution hk to Poisson’s equation Pφkhk = hk − ck + ηk,

where ck(x) = c(x, φk(x)), and ηk is the average cost.

. Update the policy via

φk+1(x) ∈ arg minu{c(x, u) + Puh

k (x)} , x ∈ X . (27)

Complexity in the PIA may come from one or two sources:

(i) The solution to Poisson’s equation can be cast as a matrix inversion problem, whosedimension coincides with the number of elements in X.

(ii) The update step (27) may require considerable computation if U or X is large, especiallyif the minimization cannot be expressed as a convex program.

TD-learning is designed to address (i), and both SARSA and Q-learning are methods designedto address (ii).

TD-Learning TD learning is a technique for approximating value functions of MDPs withina parameterized class. Letting h denote a value function, and {hr} a family of approximations,the goal is to find r∗ that solves,

minr‖h− hr‖2 (28)

If {hr} is a linearly parameterized family of functions, and the norm is defined by an “innerproduct”, then this is a least-squares problem. It is simplest to begin with an abstract leastsquares problem that mirrors TD-learning, following [19, 20, 15].

Least Sqaures Let X denote a vector space of functions, let R : X → X , and for c ∈ Xgiven, denote h = Rc. We are also given a d-dimensional subspace L ⊂ X , with basis {ψi : 1 ≤i ≤ d} ⊂ X . We let hr =

∑riψi, a typical element of L, and seek the best r to approximate

h.For approximation, we assume that there is an inner-product on X that defines a norm.

Our goal then is to solve the quadratic optimization problem,

minr‖hr − h‖2 = min

r〈hr − h, hr − h〉 (29)

That is, we seek the projection of h onto L. This is the least-squares problem that has anexplicit solution: Let M denote the d× d matrix with entries Mi,j = 〈ψi, ψj〉, and b the vectorin Rd defined by bi = 〈ψi, h〉. If M is invertible, so that L is d-dimensional, then the uniqueminimizer in (29) is

r∗ = M−1b (30)



The problem with this representation is that b depends on h = Rc, the function we areattempting to approximate. This issue is solved by introducing the adjoint R†, defined so that

bi = 〈ψi, Rc〉 = 〈R†ψi, c〉 (31)

The adjoint has an elegant form in the specific MDP setting, provided the norm is chosenconsistently with the Markov model.

Markovian Setting The simulation algorithm used to compute the optimal parameter r∗

given in (30) is called Least-Squares TD Learning (LSTD). To obtain the adjoint which leadsto the LSTD algorithm, we must first define the vector space X , and we must then define Rin this setting.

Let P denote the transition kernel for the Markov chain, and assume that there is aninvariant probability distribution π. In the context of this chapter, P will be obtained byfixing a stationary policy φ, so that P = Pφ. We let X = L2(π), the usual Hilbert spaceof square-integrable functions, with inner product 〈f, g〉 = E[f(X(0))g(X(0))], f, g ∈ L2(π),where the expectation is in steady-state, so that X(0) ∼ π. It is assumed that a cost functionc ∈ L2(π) is given.

For the class of Markov chains considered in this chapter, and more generally in [17], itis known that the solution to Poisson’s equation can be expressed h = Rc, where R is thefundamental kernel . Under the assumptions we have imposed, the function h exists (as an a.e.finite function on X). Stronger conditions are required to guarantee that h ∈ L2(π).

Suppose that there is a fixed state ϑ ∈ X satisfying π(ϑ) > 0. Let τϑ denote the first returntime to this state, τϑ = min{t ≥ 1 : X(t) = ϑ}. In this case, the fundamental kernel can beexpressed as the total relative cost until reaching ϑ:

Rc (x) = E[τϑ−1∑t=0

[c(X(t))− η]∣∣∣X(0) = x

]The existence of a so-called ‘atom’ ϑ is not at all necessary, but this assumption saves us fromintroducing additional technical machinery [17].

Given this representation, it is not difficult to arrive at a representation for the adjoint ofR: For any f ∈ L2(π) with mean η we have,

R†f (x) = E[ ∑τ−ϑ <t≤0

[f(X(t))− η]∣∣∣X(0) = x

](32)

provided the expectation exists and is finite-valued.

LSTD Algorithm Let ψ : X → Rd denote the vector of basis functions, and hr = rTψ,exactly as in the abstract Least-Squares problem. LSTD takes the mean-square error criterion,

‖h− hr‖2 = Eπ[(h(X(0))− hr(X(0)))2] :=

∫(h(x)− hr(x))2 π(dx).



Hence the optimal parameter satisfies the fixed point equation,

Eπ[(h(X(0))− hr(X(0)))ψi(X(0))] = 0 1 ≤ i ≤ d. (33)

This is the Least Squares solution described previously: In this Markovian setting, the optimalparameter (30) can be expressed as follows,

r∗ = M−1b with M = Eπ[ψ(X(0))ψ(X(0))T]

b = Eπ[ψ(X(0))h(X(0))].(34)

LSTD Algorithm The algorithm is defined in the following steps, to recursively estimateM , b, and the steady-state mean of c. For each t ≥ 0,

(i) The steady-state average cost ηφ is estimated via,

ηφ(t+ 1) = ηφ(t) +1

t+ 1

(c(X(t+ 1)− ηφ(t)

)(ii) The eligibility vector is updated as follows,

ψϑ(t+ 1) = ψϑ(t) + I{X(t) 6= ϑ}ψ(X(t)) , with ψϑ(0) = ψ(X(0))

(iii) Estimates of the matrix M and vector b are updated as follows,

M(t+ 1) = M(t) +1

t+ 1

(ψ(X(t))ψT(X(t))−M(t)

)(35)

b(t+ 1) = b(t) +1

t+ 1

([c(X(t))− ηφ(t)]ψϑ(t)− b(t)

)(36)

(iv) The estimate of r∗ is rt = M−1t bt.�

Observe that several of these steps are simply formulations of standard Monte-Carlo. Inparticular, Mt is a running average of ψ(X(t))ψT(X(t)), and bt approximates the runningaverage of [c(X(t))− η]ψϑ(t). The algorithm is consistent, rt → r∗ as t→∞, provided ψ andh are square integrable, M is invertible, and X is stationary. The stationarity assumption canbe relaxed under further assumptions.

Consideration of the algorithm in steady-state explains the introduction of the eligibilityvectors. Suppose that {X(t), ψϑ(t) : −∞ < t < ∞} is a stationary realization of the coupledMarkov model and LSTD algorithm, in which η(t) ≡ η is the mean of c under π. By the lawof large numbers for stationary processes we obtain,

limt→∞

b(t) = limT→∞

1

T

T∑i=1

[ψϑ(t)(c(X(t))− η)] = E[ψϑ(0)(c(X(0))− η)]

By the recursion that defines the eligibility vectors, the right hand side is expressed

E[ψϑ(0)(c(X(0))− η)] = E[ ∑τ−ϑ <t≤0

ψ(X(t))[c(X(0))− η]∣∣∣X(0) = x

]which is precisely the representation of b given in (31, 32).



SARSA Recall that to apply policy iteration, a potential difficulty is the policy updateformula (27): The computation of Puh itself may be difficult if the state space is large.

Consider the following approach to avoid integration: Let H denote the function of twovariables,

H(x, u) = c(x, u) + Puh (x) (37)

If we can estimate H directly, then the policy update can be obtained by minimizing H(x, u)over u, for each state x.

To place the analysis within the prior setting, assume that a policy φ has been specified,and that h is the solution to Poisson’s equation for Pφ, with cost cφ. Then H and h satisfythe following equation,

Hφ(x) :=H(x, φ(x)) = cφ(x) + Pφh (x) = h(x) + η (38)

Substituting this back into the definition of H (replacing h by Hφ − η), we obtain the fixedpoint equation,

η +H = c+ PuHφ (39)

We conclude that H solves Poisson’s equation for the state-control process:

Proposition 3.1. Suppose that X is controlled using a randomized stationary policy φ, andthat h is the solution to Poisson’s equation using cφ. Then,

(i) The state-control process Φ(t) = (X(t), U(t)) is also Markov.

(ii) The function H is a solution to Poisson’s equation for the Markov chain Φ(t), andcost function c(x, u).

Proof. Part (i) is obvious: The process Φ evolves according to a controlled stochastic modelof the recursive form (11).

To see (ii) we reinterpret (39):

η +H(x, u) = c(x, u) + E[H(Φ(t+ 1)) | Φ(t) = (x, u)]

which is indeed Poisson’s equation for Φ. �

A natural parameterization is of the formHr = c+rTψ, where ψ : X×U→ Rd. Given a basis{ψ0 : 1 ≤ i ≤ d} intended for application in TD-learning, we might choose, ψi(x, u) = Puψ

0i (x),

x ∈ X, u ∈ U. If the integration is difficult to compute, then we might seek an approximationusing a fluid or diffusion model. Note that Pu = I + Du. So, for a fluid approximation withgenerator DF, we might justify the approximation Pu ≈ I + DF, and define a basis usingψi(x, u) = ψ0

i (x) +DFuψ

0i (x), x ∈ X , u ∈ U.



Q-Learning The Q-learning algorithm is designed to skip the policy improvement step en-tirely. Rather than attempt to minimize the mean-square error criterion (28), for a givenapproximation we consider the error in the optimality equations.

Q-learning is more difficult than TD-learning because the ACOE cannot be interpretedas a linear fixed-point equation, and hence least-squares concepts are not directly applicable.The successful algorithms do not attempt to approximate h∗ directly, but instead consider theso-called Bellman error . If hr is an approximation to h∗, then the Bellman error measures theerror in the ACOE equation (10):

Er(x) :=−hr(x•) + minu

{c(x, u) +Duhr (x)

}(40)

If Er(x) is a zero, then the ACOE is solved, where η∗ = hr(x•).The successful approaches to this problem consider not h∗, but the function of two variables,

H∗(x, u) = c(x, u) + Puh∗ (x) (41)

Watkins introduced a reinforcement learning technique for computation of H∗ in his thesis[22], and a complete convergence proof appeared later in [23]. An elementary proof based onan associated ‘fluid limit model’ is contained in [5], and this idea forms one foundation of themonograph [4]. Unfortunately, this approach depends critically on a finite state space, a finiteaction space, and a complete parameterization, that includes all possible Markov models whosestate space has a given cardinality. Consequently, complexity grows with the size of the statespace.

Progress has been more positive for special classes of models. For deterministic linearsystems with quadratic cost (the LQR problem), a variant of Q-learning combined with anadaptive version of policy iteration is known to be convergent — see [7] for an analysis indiscrete time, and [21] for a similar approach in continuous time. A complete solution tothe deterministic problem can be found in [14]. There is also a complete theory for optimalstopping problems [20, 25].

A general approach to the construction of an algorithm proceeds as follows. First, notethat just as in the construction of SARSA, we can write the ACOE as a fixed point equationin H∗: On denoting H∗(x) = minH∗(x, u), x ∈ X, the ACOE (13) gives

H∗(x) = h∗(x) + η∗

Substituting h∗ = H∗ − η into the definition of H∗ then gives,

H∗(x, u) = c(x, u) + PuH∗ (x)− η∗ (42)

Based on this expression, the Bellman error (40) must be extended for approximation of H∗

among functions on X × U. If Hr is an estimate of H∗, we denote Hr(x) := minuHr(x, u),

x ∈ X, and define the Bellman error as the function of two variables,

Er(x, u) :=−Hr(x•, u•) +{c(x, u) + PuH

r(x)−Hr(x, u)}

(43)

where the pair (x•, u•) ∈ X × U is again arbitrary. Once again, if Er(x, u) is zero, then theACOE is solved, where η∗ = Hr(x•, u•).



One version of Q-learning can be obtained as steepest descent. For simplicity consider alinear parameterization,

Hr(x, u) = c(x, u) +d∑i=1

riψi(x, u) , r ∈ Rd, (44)

where {ψi : 1 ≤ i ≤ d} ⊂ X , and X now consists of real-valued functions on X× U. Let ‖ · ‖denote any norm on X , and consider the steepest descent algorithm,

d

dtr(t) = −∇‖Er‖2

∣∣∣r=r(t)

(45)

An RL or ADP algorithm can be constructed by mimicking this ODE. For the norm, wechoose an ergodic norm as in TD learning. A major difference is that in this case we choosea randomized stationary policy, designed to explore the state-action space X × U. For anyfunction g ∈ X we define,

‖g‖2 = Eπ[g(X,U)2]

where X ∼ π, the steady-state distribution obtained with the given policy φ, and U is arandomized function of X: Given X = x, we have U ∼ φ(x, z)µZ(dz) (see definition below(25)). A descent direction is expressed,

−∇‖Er‖2 = −2E[Er(X,U)∇Er(X,U)] (46)

where, in general, “∇” denotes a subgradient.From the definition of the Bellman error (43) we have,

∇Er(x, u) = −∇Hr(x, u) +∇{PuH

r(x)}

In the finite state space case this can be expressed,

∇Er(x, u) = −∇Hr(x, u) +∑x′∈X

Pu(x, x′)∇Hr(x′)

A similar expression can be obtained in general, subject to integrability constraints. For eachx′, the function Hr(x′) is concave as a function of r. A subgradient is

∇Hr(x′) = ψ(x′, u∗x′,r) ∈ Rd ,

where u∗x′,r solves minuHr(x′, u). The gradient of Hr is obviously

∇Hr(x, u) = ψ(x, u) ∈ Rd .

So, a subgradient of the Bellman error is given by,

∇Er(x, u) = −ψ(x, u) + E[ψ(X(t+ 1), U∗r (t+ 1)) | X(t) = x, U(t) = u

](47)

where U∗r (t+ 1) = u∗x′,r when X(t+ 1) = x′.



In conclusion, the descent direction (46) can be expressed,

−∇‖Er‖2 = −2E[Ert∇Ert ] , (48)

Ert = c(X(t), U(t)) + E[Hr(X(t+ 1)) | X(t), U(t)

]−Hr(X(t), U(t))

−Hr(x•, u•)

∇Ert = −ψ(X(t), U(t)) + E[ψ(X(t+ 1), U∗r (t+ 1)) | X(t), U(t)

]The representation (48) is amenable to the construction of a simulation-based algorithm for

approximate dynamic programming. For each t, having obtained values (x, u) = (X(t), U(t)),we obtain two values of the next state, denoted X(t+1) and X(t+1). These random variablesare each distributed according to Pφ(x, · ), but conditionally independent, given X(t) = x.That is, we are taking two draws from Pφ. We can thus express,

E[Ert∇Ert ] = E[Bt+1(r)Φt+1(r)] , where,

Bt+1(r) = −Hr(x•, u•) +{c(X(t), U(t)) +Hr(X(t+ 1))−Hr(X(t), U(t))

}Φt+1(r) = −ψ(X(t), U(t)) + ψ(X(t+ 1), U∗r (t+ 1))

Given these representations, a stochastic approximation of the ODE (45) is given by therecursion,

rt+1 = rt − atBt+1(rt)Φt+1(rt) , where {at} is the stepsize.

Architecture These techniques require a basis for approximation of a value function, orsome generalization. Another approach not considered in this chapter is based on linear pro-gramming techniques: It is well-known that dynamic-programming equations can be expressedas linear programs, and this leads to approaches to approximate dynamic programming [11]., and to reinforcement learning [6]. These methods require a basis, exactly as in Q-learning orTD-learning.

The remainder of this chapter concerns the question of basis construction. There are manyways to approach this: Taylor or Fourier series, linearization, approximations using variousasymptotics (such as congestion – “heavy traffic” in networks, or large-state in general models[15]). All of the approaches considered in the following sections rely on approximate modeling.

4 Fluid models

The CRW queue The ACOE for this model is given by (13), with generator Duh∗ (x) =E[h∗(x− u+A(1))− h(x)]. The input u is constrained to {0, 1}.

With the cost function c(x, u) ≡ x, it is not hard to guess that the optimal policy forthe stochastic model is non-idling: U∗(t) = I{X∗(t) ≥ 1}. Hence, the ACOE coincides withPoisson equation,

Dh∗ (x) = −x+ η∗, x ∈ Z+, (49)



where D without the subscript “u” denotes the conditional expectation (12) under the non-idling policy. We can compute h∗ in this case, and we will see that it is a quadratic functionof x.

Consider first the fluid model approximation. Under the optimal policy that sets u(t) = 1when q(t) > 0, the resulting state trajectory is given by q(t) = max(q(0) − (1 − α)t, 0). Wedenote T ∗ = x/(1− α), which is the first time that q(t) reaches zero.

The value function J∗ defined in (3) can be interpreted as the area under a right trianglewith height x = q(0), and width T ∗. Consequently,

J∗(x) = 12xT

∗ = 12

x2

1− α

It is easy to see that the HJB equation (4) is satisfied,

minu

{c(x, u) +DF

uJ∗ (x)

}= min

u

{x+ (−u+ α) · ∇J∗ (x)

}= 0

Theorem 4.1 is taken from [15, Theorem 3.0.1]. The identity (51) establishes that h∗, thesolution to Poisson’s equation for the CRW model, is a perturbation of the fluid value functionJ∗. The formula (50) for the steady-state mean of Q is a version of the Pollaczek-Khintchineformula.

Theorem 4.1. Consider the CRW queueing model (14) satisfying α := E[A(t)] < 1, σ2A :=E[(A(t)− α)2] <∞. Then,

(i) Q is positive recurrent: With steady-state mean,

η := Eπ[Q(0)] = 12

σ2

1− α(50)

where σ2 = σ2A + α(1− α).

(ii) A solution to Poisson’s equation (49) is,

h∗(x) = J∗(x) + 12

((1− α)2 − α2

1− α

)x , x ∈ Z+. (51)

Speed-scaling model Dynamic speed scaling is an increasingly common approach to powermanagement in computer system design. The goal is to control the processing speed so asto optimally balance energy and delay costs: Reducing (increasing) the speed in times whenthe workload is small (large). Speed scaling has traditionally been used in processor design,however in recent years speed scaling approaches have been applied in a variety of othercontexts, including disks and wireless transmission.

The primary model is described in discrete time as the controlled queue (14). The costfunction is chosen to balance the cost of delay with the energy cost associated with the pro-cessing speed. A special case is quadratic,

c(x, u) = x+ βu2 , (52)



with β > 0, where u2 models the power required as a function of the speed u.Applying the definition of f in (15), the fluid model corresponding to the speed scaling

model is again given by (22). To obtain the total-cost value function for the fluid model wemodify the cost function so that it vanishes when x = 0 and u = α (its equilibrium value). Asimple modification is the following,

cF(x, u) = x+ β[u− α]2+ (53)

where [u− α]+ = max(0, u− α). With this perturbation of the cost, the total cost J∗ definedin (3), and the optimal policy φF (the minimizer of (4)) are computable in closed form. Inparticular, J∗(x) = 1

3(2x)3/2, when β = 12 .

Rather than modify the cost function, another approach is to modify the objective functionin the control problem for the fluid model. Consider,

K∗(x) = infu

∫ T0

0c(x(t), u(t)) dt ,

where T0 is the first hitting time to the origin, and x(0) = x ∈ R+. This function is alsocomputable in closed form. When β = 1

2 we obtain,

K∗(x) = αx+ 13 [(2x+ α2)3/2 − α3] (54)

This solves the TCOE (4) using the cost function c(x, u) = x+ 12u

2.We can also obtain an expression for the total relative cost : For any η > 0, and any policy

for the fluid model, denote Tη = min{t : c(x(t), u(t)) ≤ η}. Let K∗η denote the minimal totalrelative cost,

K∗η(x) = inf

∫ Tη

0

(c(x(t), u(t))− η

)dt (55)

As in the definition of J∗ and K∗, the infimum is over all policies. For x > η, the valuefunction K∗η solves the dynamic programming equation,

minu

{c(x, u) +DD

uK∗η (x)

}= η

An approximate solution is given by,

hη(x) :=K∗(x)− 2η(√

2x+ q2 −√q2), (56)

where q > 0 is a constant. It can be shown that supx |hη(x) −K∗η(x)| < ∞, regardless of thevalues of the non-negative scalars η and q.

Numerical Results In the results that follow, the arrival process A is taken to be a scaledgeometric distribution on {0, 1, . . . } with parameter pA. The mean and variance of A(t) aregiven by, respectively,

α =pA

1− pA, σ2A =

pA(1− pA)2

. (57)



x0 2 4 6 8 10 12 14 16 18 20

−20

0

20

40

60

80

100

120

140

160

Di�erence

φ*

φK*

Figure 1: The optimal policy compared to the (c,K∗)-myopic policy for the quadratic costfunction c(x, u) = x+ 1

2u2.

The solution to the ACOE (13) was computed for this model, and with the cost functionc(x, u) = x+ 1

2u2, using value iteration. This required approximation: The input was restricted

to the non-negative integers, and the state space X was restricted to a finite set of the form{0, 1, . . . , N}, by truncating arrivals.

Shown in Figure 1 is a comparison of the optimal policy, computed numerically using valueiteration, and the (c,K∗)-myopic policy defined in analogy with the optimal policy,

φK∗(x) = arg min

0≤u≤x{c(x, u) + PuK

∗ (x)} . (58)

Shown in Figure 2 is a comparison of the fluid value function K∗, the relative value functionh∗, and the output of the LSTD algorithm using the basis ψ1(x) ≡ x, and ψ2(x) ≡ K∗(x)(defined in (54)). The policy (58) was used in the application of the TD-learning algorithm.

0 1 2 3 4 5

π

0 1 2 3 4 50

2

4

6

8

10

12

14

16

h Relative value functionTD-approximationFluid value function

∗(x

x

)hr

∗(x)

K∗(x)

Value

func

tion e

valua

ted a

t x

Figure 2: Value functions for the dynamic speed scale model with quadratic cost. The x-axisdenotes the state value, and the three plots compare the final approximation hr

∗obtained using

LSTD-learning with the fluid value function, and the relative value function h∗ (computedusing value iteration). The histogram is an estimate of the steady-state distribution π obtainedusing the optimal policy



5 Diffusion Models

In the two examples that follow we will see that the fluid value function is a very good approxi-mation to the relative value functions for both the CRW model and its diffusion approximation.This solidarity holds in part because ∇2J∗ is not large. In other network applications, thediffusion model might result in a significant refinement of the fluid model [24, 15, 9]. We omitdetails here due to lack of space.

The CRW queue The generator (6) for the RBM model (23) is,

DDuh (x) = −(u− α)h′(x) + 1

2σ2Ah′′ (x)

For the non-idling process, the queue evolves as a reflected Brownian motion on R+. Thedifferential generator is given with u ≡ 1,

DDh (x) = −δh′(x) + 12σ

2Ah′′ (x) , (59)

where δ = 1− α. The interpretation of the generator requires Ito’s formula.Ito’s formula provides a representation of h(Q(t)) for any twice continuously differentiable

(C2) function h : R→ R,

h(Q(t)) = h(x) +

∫ t

0[−δh′(Q(s)) + 1

2σ2Ah′′(Q(s))] ds

+

∫ t

0h′(Q(s)) dN(s) +

∫ t

0h′(0) dI(s).

(60)

The idleness process can only increase when Q is zero, giving,∫ t

0h′(Q(s)) dI(s) =

∫ t

0h′(0) dI(s) = h′(0)I(t).

Suppose that the function h solves Poisson’s equation, c + DDh = η. Then, Ito’s formula(60) implies the representation,∫ t

0[c(Q(s))− η] ds = h(x)− h(Q(t)) + h′(0)I(t) +

∫ t

0h′(Q(t)) dN(t).

Subject to growth conditions on h, we have

limt→∞

1

t

∫ t

0h′(Q(t)) dN(t) = lim

t→∞

1

th(Q(t)) = 0.

Consequently, we can conclude that the steady-state mean of c is equal to η, provided h′(0) = 0.A function satisfying this condition is the fluid value function, J∗(x) := 1

2δ−1x2. Applying

the generator (59) we obtain,

x+DDJ∗ (x) = 12σ

2A[δ−1]. (61)

Hence J∗ solves Poisson’s equation provided δ > 0. That is, α < 1, which is the necessaryand sufficient condition for stability. Under this condition, the average cost is in fact given byη∗ = 1

2(1− α)1σ2A, a formula similar to (50), and h∗ = J∗ solves the ACOE (9).



Speed-scaling model The diffusion model is again given by (23), but with U(t) ∈ R+,rather than constrained to a bounded interval.

Looking back at the analysis of the fluid model in Section 4, we see that hη is in thedomain of the generator, in the sense that d

dxhη (0) = 0, provided we choose q = η/α (see(56)). The function hη is strictly convex, so we have hη(x) > 0, for all x > 0, when thisderivative condition holds.

Let Eη denote the Bellman error for the approximation hη, defined via (40), with thedifferential generator:

Eη(x) := minu

{c(x, u) +DD

uhη (x)}

The function hη approximately solves the ACOE for the diffusion model, in the sense that theBellman error is bounded: supx |Eη(x)| <∞ for any η.

6 Mean field games

Q-learning is a natural candidate for applications in distributed control and games (see [10, 16]for closely related messages). We illustrate this with results obtained for the large-populationcost-coupled LQG problem introduced in [13].

The model consists of n non-homogeneous autonomous agents. The ith-agent is modeledas a deterministic, scalar linear system, in continuous time,

d

dtxi = aixi + biui, (62)

where xi and ui denote the state and the control of the ith-agent, respectively. The agentsare coupled through their respective quadratic cost functions: For each 1 ≤ i ≤ n, ci(xi, ui) =(xi − z)2 + u2i , where z is the mean, z = n−1(x1 + · · ·+ xn). The optimal control problem of[13] is the discounted cost, linear-quadratic control problem. The HJB equation is modified asfollows,

minu

{c(x, u) +DF

uJ∗ (x)

}= γJ∗ (63)

where γ > 0 is the discount rate. As in the definition (42), the “Q-function” will remainthe function appearing in the brackets in (63). Because the cost is assumed to be quadratic,c(x, u) = 1

2xTMx+ 1

2uTRu, we take the parameterization,

Hr(x, u) = c(x, u) + xTErx+ xTF ru (64)

where {Er, F r} depend linearly on r. In this notation, the corresponding policy obtained byminimizing Hr over u is given by,

φr(x) = −R−1(F r)Tx (65)

Hence u = φr(x) is linear state-feedback for any r.An approximate model is proposed in [13] (see their Eqns. (4.6)-(4.9)): Each agent solves

the optimal control problem with a two-dimensional approximate state, given by (xi, z)T for

the ith agent.



0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10x 104

(individual state)

(ensemble state)

Gains obtained:

−0.06

0

0.06

0

1

-1

Agent 4 Agent 5

Figure 3: Sample paths of estimates of (kix, kiz) for i = 4 and 5. The dashed lines show the

asymptotically optimal values obtained in [13].

The conclusions of this prior work suggest the following architecture for Q-learning. TheQ-function for the LQ problem is defined according to the matrix parametrization (64). Eachagent has three parameters (Er) that are coefficients of the basis functions {x2i , z2, xiz}, andtwo parameters (F r) that are coefficients of the basis functions {xiui, zui}.

The following numerical results are based on an example with five agents described by (62).Each of the five inputs ui were taken to be sinusoidal, with irrationally related frequencies.Five applications of the Q-learning algorithm were run in parallel, one for each agent. Fordetails see [14].

Figure 3 depicts the evolution of estimates of the two components of the local optimal gain(65) for two of the five agents (i = 4 and 5), expressed ui = −kixxi − kizz. Also shown are thegains introduced in [13] that were found to be asymptotically optimal for large n. The limitingvalues of the estimates of (kix, k

iz) were close to those predicted in [13] in all cases. The first

plot shows typical behavior of the algorithm.The sequence of gains shown in the second plot appear more volatile, and less consistent.

However, note that the vertical scale is bounded by ±0.1, so that the gains are nearly zero.

7 Conclusions

We have shown through theory and examples that insight from a naive model can be extremelyvaluable in the creation of an architecture for reinforcement learning or approximate dynamicprogramming. We have focused on fluid and diffusion value functions to approximate thestochastic value function, but other approximations may be useful, depending on the applica-tion. When the approximate model cannot be solved exactly, then further approximation canbe applied, as we have seen in Section 4 and Section 5.

In this chapter we have largely ignored the subject of computing error bounds, and applyingthese bounds for performance approximation. We refer the reader to [8, 15, 19, 20] for someresults in this direction.

One open problem in this area concerns the formulation of Q-learning algorithm for rein-forcement learning, in a parameterized setting, in which the system dynamics are not givena-priori. One possible approach to parameterized reinforcement learning might emerge via theapproximate LP approaches.



Acknowledgments Financial support from UTRC, AFOSR grant FA9550-09-1-0190, IT-MANET DARPA RK 2006-07284, and NSF grant CPS 0931416 is gratefully acknowledged.Any opinions, findings, and conclusions or recommendations expressed in this material arethose of the authors and do not necessarily reflect the views of the National Science Founda-tion.

References

[1] D.P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Atena Scientific,Cambridge, Mass, 1996.

[2] V. S. Borkar. Optimal control of diffusion processes, volume 203 of Pitman Research Notesin Mat hematics Series. Longman Scientific & Technical, Harlow, 1989.

[3] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook ofMarkov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages347–375. Kluwer Acad. Publ., Boston, MA, 2002.

[4] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. HindustanBook Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,2008.

[5] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approxi-mation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. (alsopresented at the IEEE CDC, December, 1998).

[6] V. S. Borkar, J. Pinto, and T. Prabhu. A new learning algorithm for optimal stopping.Discrete Event Dynamic Systems: Theory and Applications, 19:91–113, March 2009.

[7] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using policyiteration. In Proceedings of the 1994 American Control Conference, volume 3, pages 3475–3479, 1994.

[8] Wei Chen, Dayu Huang, Ankur A. Kulkarni, Jayakrishnan Unnikrishnan, Quanyan Zhu,Prashant Mehta, Sean Meyn, and Adam Wierman. Approximate dynamic programmingusing fluid and diffusion approximations with applications to power management. In Proc.of the 48th IEEE Conf. on Dec. and Control, pages 3575–3580, 2009.

[9] I.-K. Cho and S. P. Meyn. A dynamic newsboy model for optimal reserve managementin electricity markets. Submitted for publication, SIAM J. Control and Optimization.,2009.

[10] R. Cogill, M. Rotkowitz, B. Van Roy, and S Lall. An approximate dynamic programmingapproach to decentralized control of stochastic systems. In Control of Uncertain Systems:Modelling, Approximation, and Design, pages 243–256. Springer, 2006.

[11] D. P. de Farias and B. Van Roy. The linear programming approach to approximatedynamic programming. Operations Res., 51(6):850–865, 2003.



[12] S. G. Henderson, S. P. Meyn, and V. B. Tadic. Performance evaluation and policy selectionin multiclass networks. Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):149–189, 2003. Special issue on learning, optimization and decision making (invited).

[13] M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG prob-lems with nonuniform agents: Individual-mass behavior and decentralized ε-Nash equi-libria. IEEE Trans. Automat. Control, 52(9):1560–1571, 2007.

[14] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc.of the 48th IEEE Conf. on Dec. and Control; held jointly with the 2009 28th ChineseControl Conference, pages 3598–3605, Dec. 2009.

[15] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press,Cambridge, 2007. Pre-publication edition available online.

[16] S. P. Meyn and G. Mathew. Shannon meets Bellman: Feature based Markovian modelsfor detection and optimization. In Proc. 47th IEEE CDC, pages 5558–5564, 2008.

[17] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge Univer-sity Press, Cambridge, second edition, 2009. Published in the Cambridge MathematicalLibrary. 1993 edition online.

[18] C. Szepesvari. Algorithms for Reinforcement Learning. Synthesis Lectures on ArtificialIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.

[19] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with functionapproximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.

[20] J. Tsitsiklis and B. van Roy. Optimal stopping of Markov processes: Hilbert space the-ory, approximation algorithms, and an application to pricing high-dimensional financialderivatives. IEEE Trans. Automat. Control, 44(10):1840 –1851, 1999.

[21] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F.L. Lewis. Adaptive optimal control forcontinuous-time linear systems based on policy iteration. Automatica, 45(2):477 – 484,2009.

[22] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cam-bridge, UK, 1989.

[23] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.

[24] L. M. Wein. Dynamic scheduling of a multiclass make-to-stock queue. Operations Res.,40(4):724–735, 1992.

[25] H. Yu and D. P. Bertsekas. Q-learning algorithms for optimal stopping based on leastsquares. In Proc. European Control Conference (ECC), July 2007.


feature selection for neuro-dynamic programming selection for neuro-dynamic programming dayu huang...

Documents