convergence proofs of least squares policy iteration …castlelab.princeton.edu/html/papers/ma...
TRANSCRIPT
Convergence Proofs of Least Squares Policy IterationAlgorithm for High-Dimensional Infinite Horizon
Markov Decision Process Problems
Jun Ma and Warren B. PowellDepartment of Operations Research and Financial Engineering
Princeton University, Princeton, NJ 08544
December 17, 2008
Abstract
Most of the current theory for dynamic programming algorithms focuses on finite state,
finite action Markov decision problems, with a paucity of theory for the convergence of
approximation algorithms with continuous states. In this paper we propose a policy iteration
algorithm for infinite-horizon Markov decision problems where the state and action spaces are
continuous and the expectation cannot be computed exactly. We show that an appropriately
designed least squares (LS) or recursive least squares (RLS) method is provably convergent
under certain problem structure assumptions on value functions. In addition, we show
that the LS/RLS approximate policy iteration algorithm converges in the mean, meaning
that the mean error between the approximate policy value function and the optimal value
function shrinks to zero as successive approximations become more accurate. Furthermore,
the convergence results are extended to the more general case of unknown basis functions
with orthogonal polynomials.
1 Introduction
The core concept of dynamic programming for solving Markov decision process (MDP) is
Bellman’s equation, which is often written in the standard form (Puterman (1994))
Vt(St) = maxxt
{C(St, xt) + γ∑
s′∈SP (s′|St, xt)Vt+1(s
′)}. (1)
It is more convenient for our purpose but mathematically equivalent to write Bellman’s
equation (1) with the expectation form
Vt(s) = maxxt
{C(St, xt) + γE[Vt+1(St+1)|St = s]}, (2)
where St+1 = SM(St, xt,Wt+1) and the expectation is taken over the random information
variable Wt+1. If the state variable St and decision variable xt are discrete scalars and the
transition matrix P is known, the value function Vt(St) can be computed by enumerating all
the states backward through time, which is a method often referred to as backward dynamic
programming. Moreover, there is a mature and elegant convergence theory supporting algo-
rithms that handle problems with finite state and action spaces and computable expectation
(Puterman (1994)). However, discrete representation of the problem often suffers from the
well-known “curses of dimensionality”: 1) If the state St is a vector, the state space grows
exponentially with the number of dimensions. 2) A multi-dimensional information variable
Wt makes the computation of the expectation intractable. 3) When the decision variable
xt is a high-dimensional vector, we have to rely on mathematical programming to solve
the more complicated optimization problem in the Bellman’s equation. In addition, there
are a large number of real world applications with continuous state and action spaces, to
which direct application of algorithms developed for discrete problems is not appropriate.
Hence, continuous value function approximations are suggested to handle high-dimensional
and continuous applications.
This paper describes a provably convergent and implementable approximate policy-
iteration algorithm that uses linear function approximation to handle infinite-horizon Markov
decision process problems where state, action and information variables are all continuous
1
vectors (possibly of high dimensionality) and the expectation cannot be computed exactly.
Policy iteration is not a novel idea. However, when the state and action spaces are not finite,
policy iteration runs into such difficulties as (1) the feasibility of obtaining accurate policy
value functions in a computationally implementable way and (2) the existence of a sequence
of policies generated by the algorithm (Bertsekas & Shreve (1978)). Least squares or recur-
sive least squares (LS/RLS) updating methods are proposed to conquer the first difficulty.
To overcome the second difficulty, we impose the significant assumption that the true value
functions for all policies in the policy space are continuous and linearly parameterized with
finitely many known basis functions. Consequently, it leads to a sound convergence theory
of the algorithm applied to problems with continuous state and action spaces.
The main contributions of this paper include: a) a thorough and up-to-date literature
review on continuous function approximations, b) extension of the least squares temporal
difference learning algorithm (LSTD(0)) by Bradtke & Barto (1996) to the continuous case, c)
almost sure convergence of the exact policy iteration algorithm using the LS/RLS updating,
d) convergence in the mean of the corresponding approximate policy iteration algorithm,
and e) extension of the convergence results to unknown basis functions.
The rest of the paper is organized as follows. In section 2, we review the literature on
continuous function approximations applied to Markov decision process problems and their
asymptotic properties. In section 3, we summarize important mathematical foundations
to establish convergence and illustrate the details of a least squares/recursive least squares
approximate policy iteration (LS/RLSAPI) algorithm. Section 4 presents the convergence
results for a fixed policy. In section 5, by applying the results in section 4, we show almost sure
convergence of the exact policy iteration algorithm, which requires exact policy evaluation,
to the optimal policy. Section 6 proves mean convergence of approximate policy iteration,
in which case policies are evaluated approximately. In section 7, we extend the convergence
result in section 6 to unknown basis functions using orthogonal polynomials. The last section
concludes and discusses future research directions.
2
2 Literature Review
In this section, we review the literature on continuous function approximations and related
convergence theories. We break this section into three parts. The first part focuses on con-
tinuous approximation algorithms for discrete MDP problems. The second part deals with
function approximation algorithms applied to problems with a continuous state space. The
last part describes some divergence examples of value iteration type algorithms using para-
metric function approximators. The table in figure 1 is a brief summary of the algorithms
categorized by their characteristics such as whether the state and action spaces are discrete or
continuous (D/C), whether the contribution/reward function is quadratic or general (Q/G),
whether the expectation can be computed exactly (Y/N), whether the problem is determinis-
tic or stochastic (Gaussian noise) (D/S(G)), the type of algorithms including value iteration
(VI), fixed policy (FP), exact or approximate policy iteration (EPI/API), and whether there
is a convergence guarantee for the algorithm (Y/N) or performance bound (B). Details of
the algorithms and their convergence properties are discussed in the following subsections.
2.1 Continuous approximations of discrete problems
Since the inception of the field of dynamic programming, researchers have devoted consid-
erable effort to explore the use of compact representations in value function approximation
in order to overcome the curses of dimensionality in large-scale stochastic dynamic pro-
gramming problems (Bellman & Dreyfus (1959), Bellman et al. (1963), Reetz (1977), Whitt
(1978), Schweitzer & Seidmann (1985)). However, most of these approaches are heuristic,
and there is no formal proof guaranteeing convergence of the algorithms. More recently,
provably convergent algorithms have been proposed for different approximation techniques
such as feature-based function approximation, temporal difference learning, fitted temporal
difference learning with contraction or expansion mappings, and residual gradient. These
are reviewed below.
3
State Action Reward Exp. Noise Policy Conv.Bradtke et al. (1994) C C Q N D API Y
Gordon (1995) NA NA G Y S VI YBaird (1995) D D G N D VI Y
Tsitsiklis & Van Roy (1996) D D G Y S VI YBradtke & Barto (1996) D D G N S FP Y
Landelius & Knutsson (1996) C C Q N D VI/API YTsitsiklis & Van Roy (1997) D D G Y S FP Y
Meyn (1997) C D G Y S EPI YPapavassiliou & Russell (1999) D D G N S FP Y
Gordon (2001) D D G N S VI YOrmoneit & Sen (2002) C D G N S VI Y
Lagoudakis & Parr (2003) D D G N S API NPerkins & Precup (2003) D D G N S API Y
Engel et al. (2005) C C G N S API NMunos & Szepesvari (2005) C D G N S VI B
Melo et al. (2007) C D G N S FP YSzita (2007) C C Q N S(G) VI Y
Antos et al. (2007a, 2008) C D G N S API BAntos et al. (2007b) C C G N S API B
Deisenroth et al. (2008) D D G Y S VI N
Figure 1: Table of some continuous function approximation algorithms and related conver-gence results
2.1.1 Feature-based function approximation
The first step to set up a rigorous framework combining dynamic programming and compact
representations is taken in Tsitsiklis & Van Roy (1996). Two types of feature-based value
iteration algorithms are proposed. One is a variant of the value iteration algorithm that
uses a look-up table in feature space rather than in state space (similar to aggregation of
state variables). The other value iteration algorithm employs feature extraction and linear
approximations with a fixed set of basis functions including radial basis functions, neural
networks, wavelet networks, and polynomials. Under rather strict technical assumptions
on the feature mapping, Tsitsiklis & Van Roy (1996) proves the convergence of the value
iteration algorithm (not necessarily to the optimal value function unless it is spanned by
the basis functions) and provides a bound on the quality of the resulting approximations
compared with the optimal value function.
4
2.1.2 Temporal difference learning
Tsitsiklis & Van Roy (1997) proves convergence of online temporal difference learning TD(λ)
algorithm using a linear-in-the-parameters model and continuous basis functions. The con-
vergence proof assumes a fixed policy, in which case the problem is a Markov chain. All the
results are established on the assumption of a discrete state Markov chain. It is claimed that
the proofs can be easily carried over to the continuous case, but no details for this extension
are provided. Papavassiliou & Russell (1999) describes the Bridge algorithm for temporal
difference learning applied to a fixed policy. It shows that the algorithm converges to an
approximate global optimum for any “agnostically learnable” hypothesis class other than the
class of linear combination of fixed basis functions and provides approximation error bounds.
Bradtke & Barto (1996) and Boyan (1999) proves almost sure convergence of Least-
Squares TD (LSTD) algorithm when it is used with linear function approximation for a
constant policy. Motivated by the LSTD algorithm, Lagoudakis & Parr (2003) proposes the
least square policy iteration (LSPI) algorithm which combines value-function approximation
with linear architectures and approximate policy iteration. Mahadevan & Maggioni (2007)
extends the LSPI algorithm within the representation policy iteration (RPI) framework. By
representing a finite sample of state transitions induced by the MDP as an undirected graph,
RPI constructs an orthonormal set of basis functions with the graph Laplacian operator.
With some technical assumptions on the policy improvement operator, Perkins & Precup
(2003) proves convergence of the approximate policy iteration algorithm to a unique solution
from any initial policy when SARSA updates (see Sutton & Barto (1998)) are used with linear
state-action value function approximation for policy evaluation.
Gordon (1995) provides a convergence proof for fitted temporal difference learning (value
iteration) algorithm with function approximations that are contraction or expansion map-
pings, such as k-nearest-neighbor, linear interpolation, some types of splines, and local
weighted average. Interestingly, linear regression and neural networks do not fall into this
class due to the property that small local changes can lead to large global shifts of the
approximation, so they can in fact diverge. Gordon (2001) proves a weaker convergence re-
sult for linear approximation algorithms, SARSA(0) (a Q-learning type algorithm presented
5
in Rummery & Niranjan (1994)) and V(0) (a value-iteration type algorithm introduced by
Tesauro (1994)), converge to a bounded region almost surely.
2.1.3 Residual gradient algorithm
To overcome the instability of Q-learning or value iteration when implemented directly with a
general function approximation, residual gradient algorithms, which perform gradient descent
on the mean squared Bellman residual rather than the value function or Q-function, are
proposed in Baird (1995). Convergence of the algorithms to a local minimum of the Bellman
residual is guaranteed only for the deterministic case.
2.2 Approximations of continuous problems
There are comparatively fewer papers that treat function approximation algorithms directly
applied to continuous problems. Convergence results are mostly found for the problem class
of linear quadratic regulation (LQR), batch reinforcement learning, and the non-parametric
approach with kernel regression. Another non-parametric method using a Gaussian process
model has been successfully applied to complex nonlinear control problems but lacks conver-
gence guarantees. Other convergence results are established for exact policy iteration and
constant policy Q-learing that assume a finite or countable action space and ergodicity of
the underlying process.
2.2.1 Linear quadratic regulation
In the 1990’s, convergence proofs of DP/ADP type algorithms were established for a special
problem class with continuous states called Linear Quadratic Regulation in control theory.
One nice feature of LQR type problems is that one can find analytical solutions if the
exact reward/contribution and transition functions are known. When the exact form of the
functions is unknown and only sample observations are made, dynamic programming-type
algorithms are proposed to find the optimal control. Bradtke et al. (1994) proposes two
algorithms based on Q-learning and applied to deterministic infinite-horizon LQR problems.
6
Approximate policy iteration is provably convergent to the optimal control policy (Bradtke
(1993)), but an example shows that value iteration only converges locally. In Landelius
& Knutsson (1996), convergence proofs are presented for several adaptive-critic algorithms
applied to deterministic LQR problems including Heuristic Dynamic Programming (HDP,
another name for Approximate Dynamic Programming), Dual Heuristic Programming (DHP,
which works with derivatives of value function), Action Dependent HDP (ADHDP, another
name for Q-learning) and Action Dependent DHP (ADDHP). The key difference between
the RL algorithm proposed in Bradtke et al. (1994) and HDP is the parameter updating
formula. Bradtke et al. (1994) uses recursive least squares, while Landelius & Knutsson
(1996) uses a stochastic gradient algorithm that may introduce a scaling problem. It seems
reasonable to extend convergence proofs from the deterministic case to the stochastic case,
but we are only aware of a convergence proof for the stochastic gradient algorithm by Szita
(2007) under the assumption of Gaussian noises.
2.2.2 Batch reinforcement learning
More recently, a series of papers (Munos & Szepesvari (2005), Antos et al. (2007a,b, 2008))
derive finite-sample Probably Approximately Correctness (PAC) bounds for batch reinforce-
ment learning problems. These performance bounds ensure that the algorithms produce
near-optimal policy with high probability. They depend on the mixing rate of the sample
trajectory, the smoothness properties of the underlying MDP, the approximation power and
capacity of the function approximation method used, the iteration counter of the policy
improvement step, and the sample size for policy evaluation. More specifically, Munos &
Szepesvari (2005) considers sampling-based fitted value iteration algorithm for MDP prob-
lems with large or possibly infinite state space but finite action space with a known generative
model of the environment. In a model-free setting, Antos et al. (2007b, 2008) propose off-
policy fitted policy iteration algorithms that are based on a single trajectory of some fixed
behavior policy to handle problems in continuous state space and finite action space. An-
tos et al. (2007a) extends previous consistency results to problems with continuous action
spaces.
7
2.2.3 Kernel-based reinforcement learning
A kernel-based approach to reinforcement learning that adopts a non-parametric perspective
is presented in Ormoneit & Sen (2002) to overcome the stability problems of temporal-
difference learning in problems with a continuous state space but finite action space. The
approximate value iteration algorithm is provably convergent to a unique solution of an
approximate Bellman’s equation regardless of its initial values for a finite training data set
of historical transitions. As the size of the transition data set goes to infinity, the approximate
value function and approximate policy converge in probability to the optimal value function
and optimal policy respectively.
2.2.4 Gaussian process models
Another non-parametric approach to optimal control problems is the Gaussian process model,
in which value functions are modeled with Gaussian processes. Both the Gaussian process
temporal difference (GPTD) algorithm (policy iteration) in Engel et al. (2005) and the
Gaussian process dynamic programming (GPDP) algorithm (value iteration) in Deisenroth
et al. (2008) have successful applications in complex nonlinear control problems without
convergence guarantees.
2.2.5 Other convergence results
Meyn (1997) proves convergence of policy iteration algorithms for average cost optimal con-
trol problems with unbounded cost and general state space. The algorithm assumes countable
action space and requires exact computation of the expectation. With further assumptions
of a c-regular (a strong stability condition where c represents the cost function) initial policy
and irreducibility of the state space, the algorithm generates a sequence of c-regular policies
that converge to the optimal average cost policy. One extension of the temporal difference
algorithm to continuous domain can be found in Melo et al. (2007), which proves the conver-
gence of a Q-learning algorithm with linear function approximation under a fixed learning
policy for MDP problems with continuous state space but finite action space.
8
2.3 Examples of divergence
Value iteration-type algorithms with linear function approximation frequently fail to con-
verge. Bertsekas (1995) describes a counter-example of TD(0) algorithm with stochastic
gradient updating for the parameters. It is shown that the algorithm converges but can
generate a poor approximation of the optimal value function in terms of Euclidean distance.
Tsitsiklis & Van Roy (1996) presents a similar counter-example as in Bertsekas (1995) but
with a least squares updating rule for the parameter estimates, in which case divergence
happens even when the optimal value function can be perfectly represented by the linear
approximator. Boyan & Moore (1995) illustrates that divergent behavior occurs for value
iteration algorithms with a variety of function approximation techniques such as polynomial
regression, back-propagation and local weighted regression when the algorithm is applied to
simple nonlinear problems.
3 Preliminaries and the algorithm
We consider a class of infinite horizon Markov decision processes with continuous state
and action spaces. The following subsections discuss several important preliminary con-
cepts including Markov decision processes, contraction operators, continuous correspondence,
continuous-state Markov chain and post-decision state variable. These basics are necessary
in the convergence proofs in sections 4, 5 and 6. The last subsection illustrates the details
of an approximate policy iteration algorithm with recursive least squares updating.
3.1 Markov decision processes
We start with a brief review of Markov decision processes. A Markov decision process is a
sequential optimization problem where the goal is to find a policy that maximizes (for our
application) the expected infinite-horizon discounted rewards. Let St be the state of the
system at time t, xt be a vector-valued continuous decision (control) vector, Xπ(St) be a
decision function corresponding to a policy π ∈ Π where Π is the stationary deterministic
policy space, C(St, xt) be a contribution/reward function, and γ be a discount factor between
9
0 and 1. We shall, by convenient abuse of notation, denote π to be the policy and use it
interchangeably with the corresponding decision function Xπ. The system evolves according
to the following state transition function
St+1 = SM(St, xt, Wt+1), (3)
where Wt+1 represents the exogenous information that arrives during the time interval from
t to t + 1. The problem is to find the policy that solves
supπ∈Π
E
{ ∞∑t=0
γtC(St, Xπt (St))
}. (4)
Since solving the objective function (4) directly is computationally intractable, Bellman’s
equation is introduced so that the optimal control can be computed recursively
V (s) = maxx∈X
{C(s, x) + γE
[V (SM(s, x, W ))|s]} , (5)
where V (s) is the value function representing the value of being in state s by following the
optimal policy onward. It is worth noting that the contribution function in (5) can also be
stochastic. Then, Bellman’s equation becomes
V (s) = maxx∈X
{E
[C(s, x,W ) + γV (SM(s, x, W ))|s
]}. (6)
To solve problems with continuous states, we list the following assumptions and defi-
nitions for future reference in later sections. Assume that the state space S is a Borel-
measurable, convex and compact subset of Rm, the action space X is a compact subset of
Rn, the outcome space W is a compact subset of Rl and Q : S × X ×W → R is a contin-
uous probability transition function. Let Cb(S) denote the space of all bounded continuous
functions from S to R. It is well-known that Cb(S) is a complete metric space.
3.2 Contraction operators
In this section, we describe the contraction operators associated with Markov decision
processes. Their contraction property is crucial in the convergence proof of our algorithm.
10
Definition 3.1 (Max operator) Let M be the max operator such that for all s ∈ S and
V ∈ Cb(S),
MV (s) = supx∈X
{C(s, x) + γ
∫
WQ(s, x, dw)V (SM(s, x, w))},
where Q has the Feller Property (that is M maps Cb(S) into itself) and SM : S×X×W → Sis the continuous state transition function.
Definition 3.2 (Operator for a fixed policy π) Let Mπ be the operator for a fixed policy
π such that for all s ∈ S
MπV (s) = C(s,Xπ(s)) + γ
∫
WQ(s,Xπ(s), dw)V (SM(s,Xπ(s), w))
where Q and SM have the same property as in definition 3.1.
There are a few elementary properties of the operators M and Mπ (Bertsekas & Shreve
(1978)) that will play an important role in the subsequent sections.
Proposition 3.1 (Monotonicity of M and Mπ) For any V1, V2 ∈ Cb(S), if V1(s) ≤ V2(s)
for all s ∈ S, then for all k ∈ N and s ∈ S
MkV1(s) ≤ MkV2(s),
MkπV1(s) ≤ Mk
πV2(s).
Proposition 3.2 (Fixed point of M and Mπ) For any V ∈ Cb(S), limk→∞ MkV = V ∗
where V ∗ is the optimal value function, and V ∗ is the only solution to the equation V = MV .
Similarly, for any V ∈ Cb(S), limk→∞ MkπV = V π where V π is the value function by following
policy π, and V π is the only solution to the equation V = MπV .
3.3 Continuous correspondence
The correspondence Γ formally defined below describes the feasible set of actions that ensures
the max operator M takes the function space Cb(S) into itself. A correspondence is said to
be compact-valued if the set Γ(s) is compact for every s ∈ S.
11
Definition 3.3 (Correspondence) A correspondence Γ : S → X is a relation that assigns
a feasible action set Γ(s) ⊂ X for each s ∈ S.
Definition 3.4 (Lower hemi-continuity of correspondence) A correspondence Γ : S →X is lower hemi-continuous (l.h.c.) at s if Γ(s) is nonempty and if, for every x ∈ Γ(s) and
every sequence sn → s, there exists N ≥ 1 and a sequence {xn}∞n=N such that xn → x and
xn ∈ Γ(sn) for all n ≥ N .
Definition 3.5 (Upper hemi-continuity of correspondence) A compact-valued corre-
spondence Γ : S → X is upper hemi-continuous (u.h.c.) at s if Γ(s) is nonempty and if, for
every sequence sn → s and every sequence {xn} such that xn ∈ Γ(sn) for all n, there exists
a convergent subsequence of {xn} with limit point x ∈ Γ(s).
Definition 3.6 (Continuity of correspondence) A correspondence Γ : S → X is con-
tinuous at s ∈ S if it is both u.h.c. and l.h.c. at s.
3.4 Markov chains with continuous state space
To work with Markov chains with a general state space, we present the following definitions of
irreducibility, invariant measure, recurrence and positivity that all have familiar counterparts
in discrete chains. These properties are related to the stability of a Markov chain, which
is of great importance in proving the convergence of value function estimates. In addition,
the continuity property of the transition kernel is helpful in defining behavior of chains with
desirable topological structure of the state space (Meyn & Tweedie (1993)). Hence, we
introduce the concepts of Feller chains, petite sets and T -chains, which will be used later to
classify positive Harris chains.
Definition 3.7 (ψ-Irreducibility for general space chains) We call a Markov chain Φ
on state space S ϕ-irreducible if there exists a measure ϕ on B(S) such that whenever ϕ(A) >
0 for A ∈ B(S), we have
Ps{Φ ever enters A} > 0, ∀s ∈ S
12
where Ps denotes the conditional probability on the event that the chain starts in state s. Let
ψ be the maximal irreducibility measure among such measures.
Definition 3.8 (Harris recurrence) The set A ∈ B(S) is called Harris recurrent if
Ps{Φ ∈ A infinitely often} = 1, ∀s ∈ S.
A chain Φ is called Harris (recurrent) if it is ψ-irreducible and every set in
B+(S) = {A ∈ B(S) : ψ(A) > 0}
is Harris recurrent.
Definition 3.9 (Invariant measure) Let P (·, ·) be the transition kernel of a chain Φ on
the state space S. A σ-finite measure µ on B(S) with the property
µ(A) =
∫
Sµ(ds)P (s, A),∀A ∈ B(S)
will be called invariant.
Definition 3.10 (Positive chains) Suppose a chain Φ is ψ-irreducible and admits an in-
variant probability measure µ. Then Φ is called a positive chain.
Definition 3.11 (Weak Feller chains) If a chain Φ has a transition kernel P such that
P (·, O) is a lower semi-continuous function for any open set O ∈ B(S), then Φ is called a
(weak) Feller chain.
It is worth noting that weak Feller property is often defined by assuming that the tran-
sition kernel P maps the set of all continuous functions C(S) into itself.
Definition 3.12 (Petite set) A set C ∈ B(S) is called petite if for some measure ν on
B(S) and δ > 0,
K(s, A) ≥ δν(A), s ∈ C, A ∈ B(S)
13
where K is the resolvent kernel defined by
K(s, A) =∞∑
n=0
(1
2)n+1P n(s, A).
Definition 3.13 (T -chains) If every petite compact set of B(S) is petite, then Φ is called
a T -chain. (For another more detailed definition, see Meyn & Tweedie (1993) chapter 6.)
3.5 Post-decision state variable
Computing the expectation within the max operator M is often intractable when the un-
derlying distribution of the evolution of the stochastic system is unknown or the decision
x is a vector. However, we can circumvent the difficulty by introducing the notion of the
post-decision state variable (Van Roy et al. (1997), Judd (1998), Powell (2007)). To illustrate
the idea, suppose we can break the original transition function (3) into the two steps
Sxt = SM,x(St, xt), (7)
St+1 = SM,W (Sxt ,Wt+1). (8)
Then, let V x(Sxt ) be the value of being in state Sx
t immediately after we make a decision.
There is a simple relationship between the pre-decision value function V (St) and post-decision
value function V x(Sxt ) that is summarized as
V (St) = maxxt∈X
{C(St, xt) + V x(Sxt )} , (9)
V x(Sxt ) = E{V (St+1)|Sx
t }. (10)
By substituting (9) into (10), we have Bellman’s equation of post-decision value function
V x(Sxt ) = E{ max
xt+1∈X{C(St+1, xt+1) + γV x(Sx
t+1)}|Sxt }. (11)
In our algorithm, we work with the post-decision value functions resulting from following
a constant policy. By our assumptions of linear pre-decision policy value functions and the
14
Step 0: Initialization:
Step 0a Set the initial values of the value function parameters θ0.
Step 0b Set the initial policy
π0(s) = arg maxx∈Γ(s)
{C(s, x) + γφ(sx)′θ0}.
Step 0c Set the iteration counter n = 1.
Step 0d Set the initial State S00 .
Step 1: Do for n = 1, . . . , N ,
Step 2: Do for m = 1, . . . ,M :
Step 3: Initialize vm = 0.Step 4: Choose one step sample realization ω.Step 5: Do the following:
Step 5a Set xnm = πn−1(Sn
m).Step 5b Compute Sn,x
m = SM,x(Snm, xn
m), Snm+1 = SM (Sn,x
m ,Wm+1(ω)) and Sn,xm+1 =
SM,x(Snm+1, x
nm+1)
Step 5c Compute the corresponding basis function value in the general form φ(sn,xm ) −
γφ(sn,xm+1).
Step 6: Do the following:Step 6a Compute vm = C(Sn,x
m , Sn,xm+1)
Step 6b Update parameters θn,m with LS/RLS method
Step 7: Update the parameter and the policy:
θn = θn,M ,
πn(s) = arg maxx∈Γ(s)
{C(s, x) + γφ(sx)′θn}.
Step 8: Return the policy πN and parameters θN .
Figure 2: Infinite-horizon approximate policy iteration algorithm with recursive least squaresmethod
assumption of a continuous state transition function, relation (10) implies that the post-
decision value functions are continuous and of the linear form V x(sx|θ) = φ(sx)T θ where
φ(sx) is the vector of basis functions and θ is the vector of parameters. It is further assumed
that the spanning set of the basis functions are known, so it is enough to just estimate the
linear parameters.
15
3.6 Algorithm details
The recursive least squares approximate policy iteration algorithm (RLSAPI) is summa-
rized in Figure 2. It is worth making a remark on the arg max function at the end of the
algorithm. This step is usually a multi-variate global optimization problem that updates
the policy (exactly or approximately) from the post-decision value function of the previous
inner iteration. The updated policy function feeds back a decision given any fixed input of
the state variable. We assume that there is a tie-breaking rule that determines a unique
solution to the arg max function for all f ∈ Cb(S) i.e. nonlinear proximal point algorithm
( Luque (1987)). As a result, the returned policies are well-defined single-valued functions.
It is worth noting that determining the unique solution to the arg max function may not
be an easy job in practice. However, the computational difficulty is significantly reduced if
we have special problem structures such as strict concavity and differentiability of the value
functions.
4 Convergence results for a fixed policy
The following theorems and lemmas are extensions of the proof for the convergence of the
LSTD algorithm by Bradtke & Barto (1996) applied to a Markov chain with continuous
state space. We would apply the parameter convergence results to prove the convergence of
the inner loop in our policy iteration algorithm. Bradtke & Barto (1996) argues that LSTD
algorithm is superior to the TD algorithm in terms of convergence properties for the following
three reasons: (1) Choosing step-size parameters is not necessary in LSTD (It is well-known
that a poor choice of step-sizes can significantly slow down convergence). (2) LSTD uses
samples more efficiently, so it produces faster convergence rate. (3) TD is not robust to the
initial value of the parameter estimates and choice of basis functions, but LSTD is.
For a fixed policy π, the transition steps (7) and (8) become
Sπt = SM,π(St, π(St)),
St+1 = SM,W (Sπt ,Wt+1).
16
As a result, the Markov decision problem can be reduced to a Markov chain for post-decision
states if the exogenous information Wt+1 only depends on the post-decision state Sπt in the
previous time period. The Bellman’s equation (11) for post-decision state becomes
V π(s) =
∫
Sπ
P (s, ds′)(Cπ(s, s′) + γV π(s′)), (12)
where P (·, ·) is the transition probability function of the chain, Cπ(·, ·) is the stochastic con-
tribution/reward function with Cπ(Sπt , Sπ
t+1) = C(St+1, π(St+1)) and Sπ is the post decision
state space by following policy π. It is worth noting that Sπ is compact since S and X are
compact and transition function is continuous. In addition, s in (12) is the post-decision
state variable and we drop the superscript x for simplicity. Suppose the true value function
for following the fixed policy π is V π(s) = φ(s)T θ∗ where φ(s) = [· · · , φf (s), · · · ] is the vector
of basis functions of dimension F = |F| (number of basis functions) and f ∈ F (F denotes
the set of features). Bellman’s equation (12) gives us
φ(s)T θ∗ =
∫
Sπ
P (s, ds′)[Cπ(s, s′) + γφ(s′)T θ∗].
We can rewrite the recursion as
Cπ(s, s′) = (φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T θ∗ + Cπ(s, s′)−∫
Sπ
P (s, ds′)Cπ(s, s′).
Remark: It is worth noting that φ(s) and φ(s′) are vectors. The integral is taken compo-
nentwise for φ(s′), so it feeds back a vector. Similarly, if we take an integral of a matrix, it
is taken componentwise.
Here, we have a general linear model with both observation errors and input noise, which
is known as the errors-in-variable model (Young (1984)). Cπ(s, s′) is the observation, and
Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) is the observation error. φ(s) − γ
∫Sπ P (s, ds′)φ(s′) can be
viewed as the input variable. Since the transition probability function may be unknown
or not computable, at iteration m, instead of having the exact input variable, we can only
observe an unbiased sample estimate φm − γφm+1 where φm is the shorthand notation for
φ(sm). Therefore, the regular formula of linear regression for estimating θ∗ does not apply
since the errors in the input variables introduce bias. To circumvent this difficulty, we need
17
to introduce an instrumental variable ρ, which is correlated with the true input variable
but uncorrelated with the input error term and the observation error term. It is further
required that the correlation matrix between the instrumental variable and input variable is
nonsingular and finite. Then, the m-th estimate of θ∗ becomes
θm =
[1
m
m∑i=1
ρi(φi − γφi+1)′]−1 [
1
m
m∑i=1
ρiCi
],
where Ci = Cπ(si, si+1) is the i-th observation of the contribution.
Convergence of the parameter estimates requires stability of the chain. This can be
interpreted as saying a chain finally settles down to a stable regime independent of its initial
starting point (Meyn & Tweedie (1993)). Positive Harris chains defined in section 3 meet
the stability requirement precisely, and the invariant measure µ describes the stationary
distribution of the chain. Young (1984) proves convergence of the parameter estimates
calculated with samples drawn from a stationary distribution in the following lemma.
Lemma 4.1 (Young, 1984) Let y be the input variable, ξ be the input noise, ε be the
observation noise, and ρ be the instrumental variable in the errors-in-variable model. If the
correlation matrix Cor(ρ, y) is non-singular and finite and the correlation matrix Cor(ρ, ξ) =
0 and Cor(ρ, ε) = 0, then θm → θ∗ in probability.
Proof:
See Young (1984).
The task is to find a candidate for ρ in our algorithm. Following Bradtke & Barto
(1996), we also choose φ(s) as the instrumental variable for the continuous case. Under the
assumption that the Markov chain follows a stationary process with an invariant measure µ,
the following lemma shows that φ(s) satisfies the requirement that it is uncorrelated with
both input and observation error terms.
Lemma 4.2 Assume the Markov chain Φ has an invariant probability measure µ and let
ρ = φ(s), ε = Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) and ξ = γ
∫Sπ P (s, ds′)φ(s′) − γφ(s′). We
have (1) E(ε) = 0, (2) Cor(ρ, ε) = 0, and (3) Cor(ρ, ξ) = 0.
18
Proof:
Let Cs =∫Sπ P (s, ds′)Cπ(s, s′). (1) is immediate by definition. By (1),
Cor(ρ, ε) = E(φ(s)ε)
=
∫
Sπ
µ(ds)
∫
Sπ
P (s, ds′)φ(s)(Cπ(s, s′)− Cs)
=
∫
Sπ
µ(ds)φ(s)
∫
Sπ
P (s, ds′)(Cπ(s, s′)− Cs)
=
∫
Sπ
µ(ds)φ(s)(Cs − Cs)
= 0.
Hence, (2) holds.
Cor(ρ, ξ) = E(φ(s)ξT )
=
∫
Sπ
µ(ds)
∫
Sπ
P (s, ds′)φ(s)(γ
∫
Sπ
P (s, ds′)φ(s′)− γφ(s′))T
=
∫
Sπ
µ(ds)φ(s)
∫
Sπ
P (s, ds′)(γ∫
Sπ
P (s, ds′)φ(s′)− γφ(s′))T
=
∫
Sπ
µ(ds)φ(s)(γ
∫
Sπ
P (s, ds′)φ(s′)− γ
∫
Sπ
P (s, ds′)φ(s′))T
=
∫
Sπ
µ(ds)φ(s)0T
= 0.
Therefore, (3) holds.
For problems with finite state space, Bradtke & Barto (1996) assumes that the number of
linearly independent basis functions is the same as the dimension of the state variable so that
the correlation matrix between the input and instrumental variables is invertible. However,
this assumption defeats the purpose of using continuous function approximation to overcome
the curse of dimensionality since the complexity of computing parameter estimates is the
same as estimating a look-up table value function directly. By imposing stronger assumptions
on the basis functions and discount factor γ, the following lemma shows that the invertibility
of the correlation matrix is guaranteed in the continuous case.
19
Lemma 4.3 (Non-singularity of the correlation matrix) Suppose we have orthonor-
mal basis functions with respect to the invariant measure µ of the Markov chain and γ < 1F
where F is the number of basis functions. Then the correlation matrix
∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
is invertible.
Proof:
For shorthand notation, we write∫Sπ µ(ds)φi(s) = µφi and
∫Sπ P (s, ds′)φi(s
′) = Pφi(s). It
is worth noting that µφi is a constant and Pφi(s) is a function of s. Then, we can write the
correlation matrix explicitly as the F × F matrix
C =
µφ21 − γµφ1Pφ1 µφ1φ2 − γµφ1Pφ2 . . . µφ1φF − γµφ1PφF
µφ2φ1 − γµφ2Pφ1 µφ22 − γµφ2Pφ2 . . . µφ2φF − γµφ2PφF
. . .
. . .
. . .µφF φ1 − γµφF Pφ1 . . . . µφ2
F − γµφF PφF
Since φ(s) is a vector of orthonormal basis functions with respect to the invariant measure
µ, we have
C =
1− γµφ1Pφ1 −γµφ1Pφ2 . . . −γµφ1PφF
−γµφ2Pφ1 1− γµφ2Pφ2 . . . −γµφ2PφF
. . .
. . .
. . .−γµφF Pφ1 . . . . 1− γµφF PφF
.
We know that
µ(Pφi)2 ≤ µPφ2
i = µφ2i = 1.
Then, for all 1 ≤ i, j ≤ F we have
µφiPφj ≤ µφ2i + µ(Pφj)
2
2≤ 1.
Similarly, we have µφiPφj ≥ −1 for all 1 ≤ i, j ≤ F . As a result, C = I − γA where A
is a matrix with entries Ai,j ∈ [−1, 1]. C is invertible iff |C| 6= 0, so it suffices to show
|A− 1γI| 6= 0. In other words, we need to show that 1
γis not a real eigenvalue of A. Suppose
20
λ is an eigenvalue of A and its corresponding eigenvector is v. Let v = max1≤i≤F{|vi|}. Since
Ai,j ∈ [−1, 1], −F v ≤ λv ≤ F v. Hence, all the real eigenvalues of A are bounded between
−F and F . This implies that C is invertible if 1γ
> F . Equivalently, C is non-singular if
γ < 1F.
In general, the discount factor γ may not be smaller than 1F, which can be quite small as
the number of basis functions increases. However, without loss of generality, we can assume
γ < 1F. Since γ ∈ (0, 1), there exists k ∈ N such that γk−1 < 1
Fand the following recursion
must be satisfied if we keep substituting Bellman’s equation (12) back into itself k−2 times:
V π(s1) =
∫
Sπ×...×Sπ
k−1∏i=1
P (si, dsi+1)
{k−1∑i=1
γi−1Cπ(si, si+1) + γk−1V π(sk)
}.
Let P (s1, dsk) =∏k−1
i=1 P (si, dsi+1), C(s1, sk) =∑k−1
i=1 γi−1Cπ(si, si+1) and γ = γk−1. It is
easy to see that µ is still an invariant measure for P . As a result, we have
V π(s1) =
∫
Sπ
P (s1, dsk)(C(s1, sk) + γV π(sk)).
It is of the same form as in (12), so we can make a minor modification to the algorithm to
achieve the invertibility of the correlation matrix by collapsing k − 1 transitions (s1 → sk)
into 1 transition.
Finally, we have the parameter estimates
θm =
[1
m
m∑i=1
φi(φi − γφi+1)′]−1 [
1
m
m∑i=1
φiCi
].
It is worth noting that the summand 1m
∑mi=1 φi(φi − γφi+1)
T could be singular for small
m. The common methods that deal with this numerical problem include pseudo-inverse,
regularized matrix and additional samples (Choi & Van Roy (2006)).
When we only have known linearly independent basis functions, the correlation matrix of
input and instrumental variables is not guaranteed to be invertible. However, it is reasonable
to assume the non-singularity of the correlation matrix. Since singular matrices can be
viewed as the roots of the polynomial function given by the determinant, the set of singular
n × n matrices is a null subset (with Lebesgue measure zero) of Rn×n over the field of real
21
numbers. In other words, if you randomly pick a square matrix over the real numbers, the
probability that it is singular is zero. The following lemma states that the situation where
the correlation matrix is non-singular is extremely rare.
Lemma 4.4 Suppose we have linearly independent basis functions φ = (φ1, . . . , φF ) that are
not zero almost everywhere with respect to the invariant probability measure µ. Then the
correlation matrix
CF =
∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
is invertible for all but at most finitely many γ ∈ (0, 1).
Proof:
The determinant of the correlation matrix CF is a F -th order polynomial function of γ,
so the function is either 0 for all γ or has at most F real roots. When γ = 0, we have
CF = µφφT . It is easy to see that CF is positive semi-definite. With induction, we show
that CF is actually positive definite.
1. Base step: If F = 2, then C2 =
(µφ2
1 µφ1φ2
µφ1φ2 µφ22
). By Cauchy-Schwartz’s inequality,
detC2 = µφ21µφ2
2 − (µφ1φ2)2 ≥ 0.
The equality holds only if φ1 and φ2 are linearly dependent. Hence, detC2 > 0.
2. Induction steps: Suppose CF is positive definite. We need to show that CF+1 is positive
definite. Let a be any non-zero vector in RF+1. Then,
aT CF+1a = µ(F∑
i=1
aiφi + aF+1φF+1)2.
If aF+1 = 0 and a−(F+1) (the vector in RF obtained by excluding aF+1) is nonzero, by
induction hypothesis we have
aT CF+1a = aT−(F+1)CF a−(F+1) > 0.
If aF+1 6= 0 and a−(F+1) = 0,
aT CF+1a = a2F+1µφ2
F+1 > 0,
22
since φF+1 6= 0 almost everywhere. Otherwise, aT CF+1a = 0 only if φF+1 = 1aF+1
∑Fi=1 aiφi,
but this violates the assumption of linear independence of the basis functions. Hence, we
conclude that CF is positive definite.
As a result, the determinant of the correlation matrix CF is not uniformly 0 for all γ and
it has at most F real roots in (0, 1). In other words, CF fails to be invertible for at most F
γ’s.
In practice, one may still run into non-invertible or close to non-invertible correlation
matrices, and we can use some remedial approaches such as perturbing γ or collapsing
transitions.
Lemma 4.5 (Law of large numbers for positive Harris chains) If Φ is a positive Har-
ris chain (see definition 3.8 and 3.10) with invariant probability measure µ, then for each
f ∈ L1(S,B(S), µ),
limn→∞
1
n
n∑i=1
f(si) =
∫
Sµ(ds)f(s)
almost surely.
Proof:
See the proof of Theorem 17.1.7 on page 416 of Meyn & Tweedie (1993).
With the above lemma, we are ready to prove that the parameter estimates converge to
the true value almost surely.
Theorem 4.1 Suppose the Markov chain in the aforementioned errors-in-variable model
follows a positive Harris chain with continuous state space having transition kernel P (x, dy)
and invariant probability measure µ. Further assume the orthonormal basis functions are
bounded and continuous. Then, θm → θ∗ almost surely.
Proof:
First, we note that the following must be satisfied for θ∗ and all s:
Cs =
∫
Sπ
P (s, ds′)Cπ(s, s′) =
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
θ∗
23
Then, we recall that the m-th estimate of θ∗ is defined to be
θm =
[1
m
m∑i=1
φi(φi − γφi+1)T
]−1 [1
m
m∑i=1
Ciφi
].
Since the basis functions are bounded, by lemma 4.3 and 4.5, we have with probability
1,
limm→∞
θm
=
[lim
m→∞1
m
m∑i=1
φi(φi − γφi+1)T
]−1 [lim
m→∞1
m
m∑i=1
Ciφi
]
=
[∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
]−1 [∫
Sπ
µ(ds)φ(s)
∫
Sπ
P (s, ds′)Cπ(s, s′)]
=
[∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
]−1 [∫
Sπ
µ(ds)φ(s)Cs
]
=
[∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
]−1
·[∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
]θ∗
= θ∗
In practice, estimating θ using least squares, which requires matrix inversion at each
iteration, is computationally expensive. Instead, we can use recursive least squares to obtain
the well-known updating formulas (Bradtke & Barto (1996)):
εm = Cm − (φm − γφm+1)T θm−1, (13)
Bm = Bm−1 − Bm−1φm(φm − γφm+1)T Bm−1
1 + (φm − γφm+1)T Bm−1φm
, (14)
θm = θm−1 +εmBm−1φm
1 + (φm − γφm+1)T Bm−1φm
. (15)
It is worth noting that we must specify θ0 and B0. θ0 can be any finite vector, while B0
is usually chosen to be βI for some small positive constant β. The following corollary shows
24
that the RLS estimate of θ∗ also converges to the true parameters. The proof only requires
simple calculation and is virtually the same as the proof of theorem 4.1.
Corollary 4.1 Suppose we have the same assumptions as in theorem 4.1. Further assume
that 1 + (φm − γφm+1)T Bm−1φm 6= 0 for all m, then θm → θ∗ almost surely using recursion
formulas (13), (14) and (15).
5 Convergence of exact policy iteration
Before getting to the convergence theorem, we present the following preliminary convergence
results. Proposition 5.1 proves that value functions of monotonic policies converge to the
optimal value function, while lemma 5.1 is used to prove the convergence of a sequence
of monotonic policies to the optimal policy given the convergence of corresponding value
functions. Lemma 5.2 shows that integration preserves certain properties of the integrand
such as boundedness and continuity required in the proof of the theorem 5.1.
We first make some notational clarification. Following convention, if f, g, h are functions
with the same domain D, we write f = g ≤ h to mean f(x) = g(x) ≤ h(x) for all x ∈ D.
Proposition 5.1 Let (πn)∞n=0 be a sequence of policies generated recursively as follows: given
an initial policy π0, for n ≥ 0,
πn+1 = arg maxπ
MV πn .
Then V πn → V ∗ uniformly.
Proof:
We note that for all n ∈ N, V πn = MπnV πn ≤ MV πn . For any k ∈ N, by monotonicity of
Mπn ,
V πn = MπnV πn ≤ MV πn = Mπn+1Vπn ≤ Mπn+1MV πn = M2
πn+1V πn ≤ · · · ≤ Mk
πn+1V πn .
Then, V πn ≤ limk→∞ Mkπn+1
V πn = V πn+1 . So {V πn} is a non-decreasing sequence.
25
Now we show that MnV π0 ≤ V πn for all n ∈ N with induction.
Base step: n = 1. Since V π0 ≤ V π1 , by monotonicity of Mπ1 ,
MV π0 = Mπ1Vπ0 ≤ Mπ1V
π1 = V π1 .
Induction steps: Suppose MnV π0 ≤ V πn . Again by monotonicity of M and Mπn ,
Mn+1V π0 ≤ MV πn = Mπn+1Vπn ≤ Mπn+1V
πn+1 = V πn+1 .
By the contraction mapping theorem (Royden (1968)), MnV π0 ↗ V ∗ uniformly as n →∞. We also notice that V ∗ = supπ V π. Therefore, V πn ↗ V ∗ uniformly.
Lemma 5.1 Let S,X , Γ be as defined in section 3 and assume Γ is nonempty, compact
and convex-valued, and continuous for each s ∈ S. Let fn : S × X → R be a sequence of
continuous functions for each n. Assume f has the same properties and fn → f uniformly.
Define πn and π by
πn(s) = arg maxx∈Γ(s)
fn(s, x)
and
π(s) = arg maxx∈Γ(s)
f(s, x).
Then, πn → π pointwise. If S is compact, πn → π uniformly.
Proof:
See lemma 3.7 and theorem 3.8 on page 64-65 of Stokey et al. (1989).
Lemma 5.2 Let S,X ,W and Q be defined as in lemma 5.1. If g : S×X×W → R is bounded
and continuous, then Tg(s, x) =∫W Q(s, x, dw)g(s, x, w) is also bounded and continuous.
Proof:
See proof of lemma 9.5 on page 261 of Stokey et al. (1989).
26
Theorem 5.1 (Convergence of exact policy iteration algorithm) Assume that for all
π ∈ Π the infinite-horizon problem can be reduced to a positive Harris chain and the corre-
sponding policy value function is continuous and linear in the parameters. Further assume
that the contribution function is bounded and continuous. Let S,X ,W and Q be defined as
in lemma 5.1. For M, N defined in figure 2, as M →∞ and then N →∞, the exact policy
iteration algorithm converges. That is πn → π∗ uniformly.
Proof:
By assumption, in the inner loop the system evolves according to a stationary Markov process
given a constant policy. By theorem 4.1 or corollary 4.1, as M → ∞, θn,M → θ∗n almost
surely. In other words, as M →∞, by following the fixed policy πn we can obtain the exact
post-decision value function in the linearly parameterized form of φ(sx)′θ∗n. Then define
fn(s, x) = C(s, x) + γ
∫
WQ(s, x, dw)V πn(SM(s, x, w)) = C(s, x) + γφ(sx)′θ∗n,
and
f(s, x) = C(s, x) + γ
∫
WQ(s, x, dw)V ∗(SM(s, x, w)).
Let Vn = MnV π0 and define
fn(s, x) = C(s, x) + γ
∫
WQ(s, x, dw)Vn(SM(s, x, w)).
By theorem 5.1, V πn ↗ V ∗ uniformly, so fn ↗ f pointwise. fn ↗ f uniformly by theorem
9.9 on page 266 of Stokey et al. (1989). By theorem 5.1, we have Vn ≤ V πn , so fn ≤ fn.
In turn, we conclude that fn ↗ f uniformly. Since the contribution function and value
functions for fixed policies are assumed to be bounded and continuous, by lemma 5.2 we
have fn and f are bounded and continuous. By lemma 5.1, πn → π∗ pointwise. Since S is
compact, the convergence is uniform.
In theorem 5.1, we make the strong assumption that the Markov chain obtained by
following any fixed policy in the policy space Π is a positive Harris chain. It is similar to the
condition of having an ergodic Markov chain or an absorbing chain in the discrete case. Recall
that a positive Harris chain has the highly desirable property that the Strong Law of Large
27
Numbers (SLLN) holds for the chain independent of the initial state of the chain, so that
we can show convergence of parameterized policy value functions. There are some sufficient
conditions for positive Harris recurrence such as irreducibility, aperiodicity or “minorization”
hypotheses on the transition kernel (Meyn & Tweedie (1993), Nummelin (1984)), but they
are hard to verify. Owing to our assumptions on compact S and X and continuous transition
probability function Q, the following proposition states that the only global requirement for
each controlled Markov chain Φπ being positive Harris is an irreducibility condition of the
post-decision state space Sπ.
Proposition 5.2 Suppose for any policy π ∈ Π, the controlled Markov chain Φπ is ψ-
irreducible and the support of ψ has non-empty interior. Then, Φπ is positive Harris.
Proof:
Φπ is a weak Feller chain (see definition 3.11), since the transition function is continuous and
the transition probability function Q has Feller property. By the irreducibility hypothesis on
Sπ and theorem 6.2.9 of Meyn & Tweedie (1993), Φπ is a T -Chain (see definition 3.13). By
theorem 6.2.5 of Meyn & Tweedie (1993), every compact set in B(Sπ) is petite (see definition
3.12). As a result, Sπ is petite since it is compact. By proposition 9.1.7 of Meyn & Tweedie
(1993), Φπ is Harris recurrent. By theorem 10.4.4 of Meyn & Tweedie (1993), Φπ admits a
unique (up to a constant multiple) invariant measure. Finally, by theorem 10.4.10 of Meyn
& Tweedie (1993), Φπ is positive Harris.
Hernandez-Lerma & Lasserre (2001) provides some other sufficient conditions for a Markov
chain on a non-compact general state space being positive Harris, which are easy to check
in some applications(see theorem 1.3 of Hernandez-Lerma & Lasserre (2001)). For example,
the conditions are easily checked for additive-noise chains in Rn with
St+1 = f(St) + ξt+1
where f : Rn → Rn and ξt+1 are i.i.d. random vectors with ξ1 being absolutely continuous
with respect to the Lebesgue measure and having a strictly positive density. In our settings,
28
consider a LQR problem with additive linear control and i.i.d. noises, i.e.
St+1 = ASt + xt + Wt+1
with xt = BπSt + bπ for some policy π ∈ Π. Then, the transition of the post-decision state
variable is
Sxt+1 = (A + Bπ)Sx
t + bπ + (A + Bπ)Wt+1.
Since the transition follows an additive noise system, for each linear policy π the problem is
reduced to a positive Harris chain as long as (A + Bπ)W1 satisfies the conditions of being
absolutely continuous and having a strictly positive density.
The condition of having a positive Harris chain in each inner loop may not be guar-
anteed through the properties of the problem. For example, the chain is non-irreducible
but has Harris decomposition, that is the chain can be decomposed into one transient set
and a countable disjoint family of absorbing Harris sets with respective ergodic stationary
measures. If one sample realization of the chain is trapped in some absorbing set (i.e. an
atom of the chain) whose order is less than the number of basis functions, there is not suffi-
cient support to identify the least squares parameter estimates. Hence, a properly designed
exploration step such as adding a random exploration component to the policy function is
necessary in an actual implementation of the algorithm.
Another situation is when the chain admits an invariant probability measure (not neces-
sarily unique) on a rich enough full subset of the state space, which means the subset has
measure 1 and suffices to identify the least squares parameter estimates. It is a rather weak
condition, since there is no regularity or even irreducibility assumptions on the chain. In
this case, if we can initialize the chain according to the invariant probability measure rather
than from some fixed state, the chain becomes a stationary process and we can apply the
Strong Law of Large numbers for stationary processes (Doob (1953)) to achieve convergence
of the parameter estimates. In practice, we need to add a sampling stage to estimate the
invariant probability measure in the inner loop of the algorithm and then initialize the chain
according to the empirical probability measure obtained.
29
6 Convergence of approximate policy iteration
The exact policy iteration algorithm is only conceptual, since both the inner loop and the
outer loop need to go to infinity to ensure convergence. In an actual implementation, we
need to specify stopping rules by setting large enough iteration counters N and M . As a
result, we introduce the approximate policy iteration algorithm, in which the inner loop
stops in finite time.
In our approximate policy iteration algorithm, the estimated value function of the ap-
proximate policy is random, since it depends on the sample path of the chain and also the
iteration number of the inner loop. That is to say, for fixed s ∈ S, V πn(s) is a random
variable. The following theorem and corollary state that given the state space Sπ being
compact and the norm being the sup norm || · ||∞ for continuous functions, the approximate
policy iteration algorithm converges in the mean. In other words, the mean of the norm of
the difference between the optimal value function and the estimated policy value function
using approximate policy iteration converges to 0 if the successive approximations become
better and better. The proof follows the same line as the proof of error bounds for approxi-
mate policy iteration with discrete and deterministic (or in almost sure sense) value function
approximations in Bertsekas & Tsitsiklis (1996).
Lemma 6.1 Let π be some policy and V be a value function satisfying ||V − V π||∞ ≤ ε for
some ε ≥ 0. Let π be a policy that satisfies
MπV (s) ≥ MV (s)− δ
for all s ∈ S and some δ ≥ 0. Then, for all s ∈ S
V π(s) ≥ V π(s)− δ + 2γε
1− γ.
Proof:
Compactness of the state space implies that the sup norm of any continuous function is
30
bounded. Let β = ||V π − V π||∞. Then, for all s ∈ S, V π(s) ≥ V π(s)− β. By monotonicity
of Mπ, we have V π = MπV π ≥ Mπ(V π − β) = MπV π − γβ. By hypothesis and definition of
M , we have MπV −MπV − δ ≤ MV −MπV − δ ≤ 0. Then,
V π − V π ≤ V π −MπV π + γβ
≤ V π −MπV π + γβ + MπV −MπV + δ
= MπV π −MπV + MπV −MπV π + γβ + δ
≤ 2γ||V π − V ||∞ + γβ + δ
≤ 2γε + γβ + δ.
Since β ≤ 2γε + γβ + δ ⇔ β ≤ 2γε+δ1−γ
, we obtain V π(s) = V π(s)− β ≥ V π(s)− δ+2γε1−γ
.
The next theorem proves mean convergence of the approximate policy iteration algorithm
when both policy evaluations and policy updates are performed within error tolerances that
converge to 0 according to a certain rate.
Theorem 6.1 (Mean convergence of approximate policy iteration) Let π0, π1, . . . , πn
be the sequence of policies generated by an approximate policy iteration algorithm and let
V π0 , V π1 , . . . , V πn be the corresponding approximate value functions. Further assume that,
for each fixed policy πn, the MDP is reduced to a Markov chain that admits an invariant
probability measure µπn. Let {εn} and {δn} be positive scalars that bound the mean errors
in approximations to value functions and policies (over all iterations) respectively, that is
∀n ∈ N,
Eµπn ||V πn − V πn ||∞ ≤ εn, (16)
and
Eµπn ||Mπn+1Vπn −MV πn||∞ ≤ δn. (17)
Suppose the sequences {εn} and {δn} converge to 0 and limn→∞∑n−1
i=0 γn−1−iεi = limn−1i=0 γn−1−iδi =
0, e.g. εi = δi = γi. Then, this sequence eventually produces policies whose performance
31
converges to the optimal performance in the mean:
limn→∞
Eµπn ||V πn − V ∗||∞ = 0.
Proof:
Let βn = ||V πn+1−V πn||∞, εn = ||V πn−V πn ||∞ and δn = ||Mπn+1Vπn−MV πn ||∞. It is worth
noting that εn, δn are random with Eµπn εn ≤ εn and Eµπn δn ≤ δn. By lemma 6.1, we have
βn ≤ δn+2γεn
1−γ. Taking the expectation, we have βn ≤ δn+2γεn
1−γ.
Similarly, define αn = ||V πn − V ∗||∞. Then, V πn ≥ V ∗ − αn. By monotonicity of M ,
MV πn ≥ M(V ∗ − αn) = MV ∗ − γαn.
Then,
Mπn+1Vπn ≥ Mπn+1(V
πn − εn)
= Mπn+1Vπn − γεn
≥ MV πn − δn − γεn
≥ M(V πn − εn)− δn − γεn
= MV πn − δn − 2γεn
≥ M(V ∗ − αn)− δn − 2γεn
= V ∗ − δn − 2γεn − γαn.
Since V πn+1 ≥ Mπn+1Vπn − γβn, we have V ∗ − V πn+1 ≤ δn + 2γεn + γβn + γαn. Taking
expectations of both sides, we get
V ∗ − V πn+1 ≤ δn + 2γεn + γβn + γαn ≤ δn + 2γεn
1− γ+ γαn.
Then, we have the recursive inequality αn+1 ≤ δn+2γεn
1−γ+ γαn. By opening the recursion, we
obtain
αn+1 ≤ γnα1 +1
1− γ
n−1∑i=0
γn−1−iδi +2γ
1− γ
n−1∑i=0
γn−1−iεi.
32
Hence, we conclude limn→∞ αn = 0. Finally, we have
limn→∞
Eµπn ||V πn − V ∗||∞ ≤ lim supn→∞
Eµπn
[||V πn − V πn ||∞ + ||V πn − V ∗||∞
]
≤ limn→∞
[εn + αn]
= 0,
which completes the proof.
The above theorem can be directly applied to show that the LS/RLSAPI algorithm
converges in the mean.
Corollary 6.1 (Mean convergence of LS/RLSAPI) Theorem 6.1 holds for the LS/RLS
approximate policy iteration algorithm.
Proof:
We only need to check whether the conditions (16) and (17) governing the error tolerances
for evaluating policies in theorem 6.1 are satisfied. If we can make policy updates exactly,
we can take δn = 0 for all n. If the policy update is done inexactly, that is using an
approximate nonlinear proximal point algorithm, we can force the procedure to be within
the error tolerance satisfying the condition on δn in theorem 6.1. Then, it suffices to show that
the mean errors between the approximate and the true value functions of the approximate
policy in each inner loop can be made arbitrarily small. By Jensen’s inequality, we have for
each s ∈ S,
Eµπn |V πn(s)− V πn(s)| ≤√Eµπn |V πn(s)− V πn(s)|2.
By the Cauchy-Schwartz inequality and the simple relations (9) and (10) between pre-
and post-decision value functions, we have for all s ∈ S,
|V πn(s)− V πn(s)|2 = γ2|φ(sx)′(θn,M − θ∗n)|2 ≤ γ2||φ(sx)||22||θn,M − θ∗n||22.
33
Sπ is compact implies that ||φ(sx)||2 ≤ c for some finite positive constant c for all sx.
Let εn > 0. Since θn,M → θ∗n almost surely as M → ∞, the variance of the parameter
estimates shrinks to 0. Recall that for fixed M , θn,M is a random vector of dimension F .
Let θn,M,i be the i-th component where i ∈ 1 . . . F . Hence, there exists Mn ∈ N such that
maxi=1,...,F V ar(θn,Mn,i) ≤ ε2nγ2Fc2
. Then,
Eµπn ||θn,Mn − θ∗n||22 =F∑
i=1
V ar(θn,Mn,i) ≤ Fε2n
γ2Fc2=
ε2n
γ2c2.
Therefore, in the inner loop of each iteration we can uniformly bound the mean difference
between the approximate value function and the true function of the approximate policy:
Eµπn ||V πn − V πn||∞ ≤√
γ2||φ(sx)||22Eµπn ||θn,Mn − θ∗n||22 ≤ εn.
It is worth noting that the subscript n of M implies that it is not fixed but depends on the
outer loop iteration counter n, since the chain changes as the policy gets updated.
Hence, we conclude that theorem 6.1 applies to the LS/RLSAPI algorithm.
Remark: It is worth pointing out that the variances of the parameter estimates depend on
the true variance of samples, which may be unknown. In an actual implementation of the
algorithm, we use an estimated variance from samples instead.
7 Extension to unknown basis functions
To apply the mean convergence results in section 6, we make the strong assumption that the
value function of all policies in the policy space are spanned by a finite set of orthonormal
basis functions. By assuming that the value function of all policies are in some specific
function spaces, we can extend the convergence results to unknown basis functions. For
simplicity, we restrict ourselves to a 1-dimensional state space, a closed interval [a, b], and
only consider the function space Ck[a, b], the set of all continuous functions with up to k-th
derivative.
34
7.1 Orthogonal polynomials
We first introduce the idea of orthogonal polynomials by defining the inner product with
respect to a weighting function w on Ck[a, b] to be
< f, g >w=
∫ b
a
f(s)g(s)w(s)ds.
This inner product defines a quadratic semi-norm ||f ||2w =< f, f >w. Let Gw = {gwn }∞n=1 be
a set of orthogonal basis functions with respect to w and GwN = {gw
n }Nn=1 be the finite subset
of order N . Let Gw and GwN denote the function spaces spanned by G and GN respectively.
Given any f , the best least-square approximation of f with respect to w onto GwN is the
solution to the following optimization problem:
infg∈Gw
N
∫ b
a
(f(s)− g(s))2w(s)ds,
and the solution is fwN =
∑Nn=1
<f,gwn >w
||gwn ||2w gw
n .
7.2 Chebyshev polynomial approximation
We focus on one specific weighting function: the Chebyshev weighting function c(s) = 1√1−s2
on [−1, 1]. Chebyshev approximations are good for non-periodic smooth functions because
it has the desirable uniform convergence property. It is worth noting that the weighting
functions can be easily modified to extend the result to an arbitrary closed interval [a, b],
e.g. the general Chebyshev weighting function is cg(s) = (1− (2x−a−bb−a
)2)−12 .
The family of Chebyshev polynomials T = {tn}∞n=0 is defined as tn(s) = cos(n cos−1 s)
(Judd (1998)). It can also be recursively defined as
t0(s) = 1,
t1(s) = s,
tn+1(s) = 2stn(s)− tn−1(s), n ≥ 1.
We normalize them by letting t0 = t0π
and tn = 2tnπ
for all n ≥ 1. Let CN denote the function
space spanned by the finite orthonormal basis set TN = {tn}Nn=0.
35
The following lemma shows that the Chebyshev least squares approximations are not
much worse than the best polynomial approximation in the sup norm. Hence, uniform
convergence of the Chebyshev polynomial approximation follows.
Lemma 7.1 (Uniform convergence of Chebyshev approximation) For all k, for all
f ∈ Ck[a, b],
||f − Cn||∞ ≤ (4 +4
π2ln n)
(n− k)!
n!(π
2)k(
b− a
2)k||f (k)||∞
and hence Cn → f uniformly.
Proof:
See page 214 of Judd (1998).
Let µπ be the invariant measure of the Markov chain of following a fixed policy π. Suppose
V π ∈ Ck[−1, 1] for all π ∈ Π. Assume that the invariant probability measure µπ has a
continuous density function fπ i.e. µπ(ds) = fπ(s)ds, and fπ is strictly positive on [−1, 1]
and has derivative up to k-th order. Let V π = V π√
fπ
c. We consider the following Chebyshev
least square approximation problem,
infg∈CN
(V (s)− g(s))2c(s)ds.
The solution to this problem is the N -th degree Chebyshev least squares approximation
g∗(s) = CN(s) =N∑
j=0
cjtj(s)
where cj =∫ 1
−1V π(s)tj(s)c(s)ds. We are now ready to extend the mean convergence result
developed in theorem 6.1 to Chebyshev basis functions.
Theorem 7.1 (Mean convergence of LS/RLSAPI with exact Chebyshev polyno-
mials) Suppose we have the same assumptions as in theorem 6.1 and further assume that,
for any policy π ∈ Π, the value function V π is in Ck[−1, 1] and the invariant density function
fπ is known. Theorem 6.1 holds for the LS/RLS approximate policy iteration algorithm with
Chebyshev polynomial approximation.
36
Proof:
Let ε > 0 and B = ||√
cfπ ||∞. Lemma 7.1 implies that there exists N ∈ N such that
||V π − CN ||∞ ≤ ε2B
. Let T π = {tn√
cfπ }∞n=0. It is easy to see that T π is an orthonormal
basis set with respect to fπ. Finding the best least squares approximation for the value
function of policy π on the basis set T πN in LS/RLSAPI is reduced to the following least
squares approximation problem,
infg∈T π
N
∫ 1
−1
(V π(s)− g(s))2fπ(s)ds.
It can be verified that the solution is
g∗(s) = V πN (s) = CN(s)
√c(s)
fπ(s).
Hence,
||V π − V πN ||∞ ≤ B
ε
2B=
ε
2.
The above result is employed in the inner loop of the LS/RLSAPI algorithm to construct a
finite set of basis functions of order F = N +1 given a desired approximation error tolerance
ε. Corollary 6.1 states that in each inner loop we can obtain V πN , a statistical estimate of
V πN , such that Eµπ ||V π
N − V πN ||∞ ≤ ε
2. As a result, we have
Eµπ ||V πN − V π||∞ ≤ Eµπ ||V π
N − V πN ||∞ + ||V π
N − V π||∞ ≤ ε.
Hence, theorem 6.1 applies.
However, there are two technical difficulties in practice. First, we need to know the exact
invariant density function fπ in order to construct a finite basis set. In addition, the bound
of the k-th derivative of V π√
fπ
cmay also be unknown. In an actual implementation of the
algorithm, we may let the bound of (V π√
fπ
c)(k) be some large number to determine the order
of the basis set before evaluating the value function of the corresponding policy. To tackle
the first difficulty, we need to call a procedure that produces approximations of fπ. Since
we have sequential observations of states from the Markov chain, we can obtain estimates
of the invariant density with increasing accuracy and construct approximate basis functions
from the estimated density function. The following lemma and theorem provide theoretical
37
support to the asymptotic property of the algorithm using approximate Chebyshev basis
functions.
Lemma 7.2 Suppose Φ is a positive Harris chain with invariant probability measure µ. For
f, fi ∈ L1(S,B(S), µ), assume (fi), f are bounded and fi → f uniformly.
limn→∞
1
n
n∑i=1
fi(si) =
∫
Sµ(ds)f(s)
almost surely.
Proof:
Let ε > 0. Since fi → f uniformly, there exists N ∈ N such that for all s ∈ S and i > N ,
|fi(s)− f(s)| < ε. Since fi and f are bounded, there exists B > 0 such that |f(s)| ≤ B and
|fi(s)| ≤ B for all s ∈ S and i ∈ N. Hence, there exists M1 > N large enough such that for
any finite sequence (si)Ni=1 in Sπ, 1
M1
∑Ni=1 |fi(si)− f(si)| ≤ 2BN
M1< ε
4and M1−N
M1< 1
4. Then,
∣∣∣∣∣1
M1
M1∑i=1
fi(si)− f(si)
∣∣∣∣∣ ≤ 1
M1
M1∑i=1
|fi(si)− f(si)|
=1
M1
N∑i=1
|fi(si)− f(si)|+ 1
M1
M1∑i=N+1
|fi(si)− f(si)|
<ε
2.
By lemma 4.5, there exists M2 ∈ N such that for all n > M2,∣∣∣∣∣1
n
n∑i=1
f(si)−∫
Sµ(ds)f(s)
∣∣∣∣∣ <ε
2.
Let M = max{M1,M2}. We have
∣∣∣∣∣1
M
M∑i=1
fi(si)−∫
Sµ(ds)f(s)
∣∣∣∣∣ ≤∣∣∣∣∣
1
M
M∑i=1
fi(si)− 1
M
M∑i=1
f(si)
∣∣∣∣∣ +
∣∣∣∣∣1
M
M∑i=1
f(si)−∫
Sµ(ds)f(s)
∣∣∣∣∣< ε.
Hence, we conclude that limn→∞ 1n
∑ni=1 fi(si) =
∫S µ(ds)f(s) almost surely.
38
Theorem 7.2 (Mean convergence of LS/RLSAPI with approximate Chebyshev
polynomials) Suppose, for any policy π ∈ Π, the value function V π is in Ck[a, b] and there
is a procedure producing a sequence of functions (fπi ) such that fπ
i converges to the invariant
density fπ uniformly. We construct the set of approximate basis functions at each iteration
i in the inner loop by letting T π,iN = {tn
√c
fπi}N
n=0. Then, theorem 6.1 holds for the LS/RLS
approximate policy iteration algorithm with approximate Chebyshev polynomials.
Proof:
Let the vector of approximate basis functions at iteration i in the inner loop be φi(s). Then,
the parameter estimate becomes
θm =
[1
m
m∑i=1
φi,i(φi,i − γφi,i+1)′]−1 [
1
m
m∑i=1
φi,iCi
].
By lemma 7.2, θm → θ∗ almost surely. Hence, for arbitrary ε > 0, we can also achieve
a statistical estimate V πN of V π
N such that Eµπ ||V πN − V π
N ||∞ ≤ ε2. The rest of the proof is
virtually the same as the proof of theorem 7.1.
7.3 Invariant density estimation
One of the key assumptions in theorem 7.2 is that the density estimates converge uniformly
to the true invariant density of the chain. There are a variety of common methods for
estimating density functions from a finite data set, including histogram, frequency polygon,
kernel, nearest neighbor, orthogonal series, wavelet, spline, and likelihood based procedures
(Scott (1992), Silverman (1986), Simonoff (1996), Devroye & Gyorfi (1985), Wand & Jones
(1995)). However, the consistency and asymptotic normality are mostly established for data
with independent and identically distributed (i.i.d.) samples. Roussas (1969), Rosenblatt
(1970), Yakowitz (1989), Gyorfi (1981) and Castellana & Leadbetter (1986) studied the
consistency and convergence properties of kernel density estimates from Markov processes
under regularity conditions. Nevertheless, the convergence results are mostly obtained in
the L1 or L2 sense. Kristensen (2008) proves uniform convergence of the kernel estimates of
density function under certain assumptions on the underlying Markov chain in the following
theorem.
39
Assume that the time-homogeneous Markov chain has an invariant density f . Since the
observed sampling sequence is not initialized according to the invariant measure (we have
µπ({s0}) = 1), the process is non-stationary. For i ∈ N, define
pn(x|y) =
∫
Rdp(x|z)pn−1(z|y)dz
where p1(x|y) = p(x|y), the transition density. Since pn → f in the total variation norm
(Meyn & Tweedie (1993)), asymptotically we can achieve the invariant density pointwise by
fn(s) =1
nh
n∑i=1
K(si − s
h),
where K is some kernel function with bandwidth h satisfying some regularity conditions
(Kristensen (2008)). To obtain the uniform convergence of f towards the invariant density
f , the Strong Doeblin Condition (Holden (2000)) is imposed:
There exists n ≥ 1 and ρ ∈ (0, 1) such that pn(y|x) ≥ ρf(x).
Theorem 7.3 (Uniform convergence of approximate invariant density) Assume that
the Markov chain Φ satisfies the strong Doeblin condition, and its transition density s →p(s|s′) is k times differentiable and ∂k
∂sk p(s|s′) is uniformly continuous for all s ∈ Rd. Also,
||s||qf(s) is bounded for some q ≥ d. Then,
sups∈Rd
|fn(s)− f(s)| = OP (hk) + OP (
√ln(n)
nhd).
Proof:
See proof of theorem 4 of Kristensen (2008)
7.4 Fourier series approximation
Another example of the special function space is C1p [−π, π], the set of differentiable con-
tinuous functions with f(−π) = f(π). The subscript p indicates that f can be extended
to a periodic function over R with period 2π. The corresponding weighting function is the
constant function 12π
and the orthogonal basis set is {einx}∞n=0. Again, it is easy to modify
40
the weighting function and the orthogonal basis set to extend to arbitrary interval [a, b]. The
N -th order Fourier approximation for a given function f ∈ C1p [−π, π] is
SN(s) =N∑
n=−N
cneins =a0
2+
N∑
k=1
(ak cos ks + bk sin ks)
where cn = 12π
∫ π
−πf(s)e−insds, ak = 1
π
∫ π
−πf(s) cos(ks)ds and bk = 1
π
∫ π
−πf(s) sin(ks)ds. SN
converges uniformly to f with rate O(N− 12 ) (Strichartz (2000)). Assuming that all policy
value functions V π are all in C1p [−π, π], we can derive similar convergence results as in the
case of Chebyshev approximation, but the technical difficulties do not go away. Fourier
series approximations can also be applied to non-periodic functions, but the convergence is
not uniform (especially on the boundary) and many terms are needed for good fits (Judd
(1998)). Konidaris & Osentoski (2008) empirically evaluates the function approximation
scheme with the Fourier series basis and demonstrates that it performs well compared to
radial basis functions and the polynomial basis.
8 Conclusion
In this paper we propose a policy iteration algorithm for infinite-horizon Markov decision
process problems with continuous state and action spaces. With LS/RLS updating, the exact
algorithm is provably convergent under the assumptions that the stochastic system evolves
according to a positive Harris chain for any fixed policy in the policy space and the true
post-decision value functions of policies are bounded, continuous and spanned by a finite
set of known basis functions. If the basis functions are orthonormal, then we can relax an
invertibility condition of the correlation matrix that ensures convergence of the regression
parameters. We have also shown that the LS/RLSAPI algorithm is convergent in the mean,
meaning that the mean error between the approximate policy value function and the optimal
value function goes to 0 as successive approximations become more accurate. If the true
policy value functions are not spanned by the set of basis functions, the algorithm may not
converge to the optimal value function but a similar error bound between the approximate
value function using the LS/RLS policy iteration and the optimal value function can be
derived as in Bertsekas & Tsitsiklis (1996). Furthermore, the mean convergence results can be
41
extended to the case when the true value functions are in some special function spaces so that
we can construct a finite set of basis functions from orthogonal polynomials. Our next step
would be searching for provably convergent algorithms using other function approximations
(parametric or non-parametric) suitable for MDP problems with value functions of unknown
form. More advanced and sophisticated value function updating rules and approximation
techniques will be considered, including neural networks, local polynomial regression (Fan
& Gijbels (1996)) and kernel-based reinforcement learning (Ormoneit & Sen (2002)).
Acknowledgement
The first author would like to thank Professor Erhan Cinlar, Ning Hao, Yang Feng, Ke Wan,
Jingjin Zhang, and Yue Niu for many inspiring discussions. Thanks to Peter Frazier for the
proofreading and helpful comments. This work was supported by Castle Lab at Princeton
University.
References
Antos, A., Munos, R. & Szepesvari, C. (2007a), Fitted Q-iteration in continuous action-space
MDPs, in ‘Proceedings of Neural Information Processing Systems Conference (NIPS),
Vancouver, Canada, 2007’.
Antos, A., Szepesvari, C. & Munos, R. (2007b), Value-Iteration Based Fitted Policy Iteration:
Learning with a Single Trajectory, in ‘Approximate Dynamic Programming and Reinforce-
ment Learning, 2007. ADPRL 2007. IEEE International Symposium on’, pp. 330–337.
Antos, A., Szepesvari, C. & Munos, R. (2008), ‘Learning near-optimal policies with Bellman-
residual minimization based fitted policy iteration and a single sample path’, Machine
Learning 71(1), 89–129.
Baird, L. (1995), ‘Residual algorithms: Reinforcement learning with function approxima-
tion’, Proceedings of the Twelfth International Conference on Machine Learning pp. 30–37.
42
Bellman, R. & Dreyfus, S. (1959), ‘Functional Approximations and Dynamic Programming’,
Mathematical Tables and Other Aids to Computation 13(68), 247–251.
Bellman, R., Kalaba, R. & Kotkin, B. (1963), ‘Polynomial approximation— a new com-
putational technique in dynamic programming: Allocation processes’, Mathematics of
Computation 17(82), 155–161.
Bertsekas, D. (1995), ‘A Counterexample to Temporal Differences Learning’, Neural Com-
putation 7(2), 270–279.
Bertsekas, D. & Shreve, S. (1978), Stochastic Optimal Control: The Discrete-Time Case,
Academic Press, Inc. Orlando, FL, USA.
Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific Bel-
mont, MA.
Boyan, J. (1999), Least-squares temporal difference learning, in ‘Proceedings of the Sixteenth
International Conference on Machine Learning’, pp. 49–56.
Boyan, J. & Moore, A. (1995), ‘Generalization in Reinforcement Learning: Safely Approxi-
mating the Value Function’, Advances In Neural Information Processing Systems pp. 369–
376.
Bradtke, S. (1993), ‘Reinforcement Learning Applied to Linear Quadratic Regulation’, Ad-
vances In Neural Information Processing Systems pp. 295–302.
Bradtke, S. & Barto, A. (1996), ‘Linear Least-Squares algorithms for temporal difference
learning’, Machine Learning 22(1), 33–57.
Bradtke, S., Ydstie, B. & Barto, A. (1994), ‘Adaptive linear quadratic control using policy
iteration’, American Control Conference, 1994 3, 3475–3479.
Castellana, J. & Leadbetter, M. (1986), ‘On smoothed probability density estimation for
stationary processes’, Stochastic processes and their applications 21(2), 179–193.
43
Choi, D. & Van Roy, B. (2006), ‘A Generalized Kalman Filter for Fixed Point Approximation
and Efficient Temporal-Difference Learning’, Discrete Event Dynamic Systems 16(2), 207–
239.
Deisenroth, M. P., Peters, J. & Rasmussen, C. E. (2008), Approximate Dynamic Program-
ming with Gaussian Processes, in ‘Proceedings of the 2008 American Control Conference
(ACC 2008)’, pp. 4480–4485.
Devroye, L. & Gyorfi, L. (1985), Nonparametric density estimation. The L1 view, John Wiley
and Sons, New York.
Doob, J. (1953), Stochastic Processes, John Wiley & Sons, New York.
Engel, Y., Mannor, S. & Meir, R. (2005), ‘Reinforcement learning with Gaussian processes’,
ACM International Conference Proceeding Series 119, 201–208.
Fan, J. & Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, Chapman &
Hall/CRC, London.
Gordon, G. (1995), ‘Stable function approximation in dynamic programming’, Proceedings
of the Twelfth International Conference on Machine Learning pp. 261–268.
Gordon, G. (2001), ‘Reinforcement learning with function approximation converges to a
region’, Advances in Neural Information Processing Systems 13, 1040–1046.
Gyorfi, L. (1981), ‘Strongly consistent density estimate from ergodic sample’, J. Multivariate
Anal. 11, 81–84.
Hernandez-Lerma, O. & Lasserre, J. (2001), ‘Further criteria for positive Harris recurrence
of Markov chains’, Proceedings-American Mathematical Society 129(5), 1521–1524.
Holden, L. (2000), ‘Convergence of Markov chains in the relative supremum norm’, Journal
of Applied Probability 37(4), 1074–1083.
Judd, K. (1998), Numerical Methods in Economics, MIT Press Cambridge, MA.
Konidaris, G. & Osentoski, S. (2008), ‘Value Function Approximation in Reinforcement
Learning using the Fourier Basis’, Technical Report UM-CS-2008-19.
44
Kristensen, D. (2008), ‘Uniform Convergence Rates of Kernel Estimators with Heterogenous,
Dependent Data’, CREATES Research Paper 2008-37.
Lagoudakis, M. & Parr, R. (2003), ‘Least-Squares Policy Iteration’, Journal of Machine
Learning Research 4(6), 1107–1149.
Landelius, T. & Knutsson, H. (1996), ‘Greedy adaptive critics for LQR problems: Conver-
gence proofs (Tech. Rep. No. LiTH-ISY-R-1896)’, Linkoping, Sweden: Computer Vision
Laboratory.
Luque, J. (1987), A nonlinear proximal point algorithm, in ‘26th IEEE Conference on Deci-
sion and Control’, Vol. 26, pp. 816–817.
Mahadevan, S. & Maggioni, M. (2007), ‘Proto-value Functions: A Laplacian Framework
for Learning Representation and Control in Markov Decision Processes’, The Journal of
Machine Learning Research 8, 2169–2231.
Melo, F., Lisboa, P. & Ribeiro, M. (2007), ‘Convergence of Q-learning with linear function
approximation’, Proceedings of the European Control Conference 2007 pp. 2671–2678.
Meyn, S. (1997), ‘The policy iteration algorithm for average reward Markov decisionprocesses
with general state space’, Automatic Control, IEEE Transactions on 42(12), 1663–1680.
Meyn, S. & Tweedie, R. (1993), Markov chains and stochastic stability, Springer, New York.
Munos, R. & Szepesvari, C. (2005), Finite time bounds for sampling based fitted value
iteration, in ‘Proceedings of the 22nd international conference on Machine learning’, ACM
New York, NY, USA, pp. 880–887.
Nummelin, E. (1984), General Irreducible Markov Chains and Non-Negative Operators, Cam-
bridge University Press, Cambridge.
Ormoneit, D. & Sen, S. (2002), ‘Kernel-Based Reinforcement Learning’, Machine Learning
49(2), 161–178.
45
Papavassiliou, V. & Russell, S. (1999), ‘Convergence of Reinforcement Learning with General
Function Approximators’, Proceedings of the Sixteenth International Joint Conference on
Artificial Intelligence pp. 748–757.
Perkins, T. & Precup, D. (2003), ‘A Convergent Form of Approximate Policy Iteration’,
Advances In Neural Information Processing Systems pp. 1627–1634.
Powell, W. B. (2007), Approximate Dynamic Programming: Solving the curses of dimen-
sionality, John Wiley and Sons, New York.
Puterman, M. L. (1994), Markov Decision Processes, John Wiley & Sons, New York.
Reetz, D. (1977), ‘Approximate Solutions of a Discounted Markovian Decision Process’,
Bonner Mathematische Schriften 98, 77–92.
Rosenblatt, M. (1970), ‘Density estimates and Markov sequences’, Nonparametric Techniques
in Statistical Inference pp. 199–213.
Roussas, G. (1969), ‘Nonparametric estimation of the transition distribution function of a
Markov process’, Ann. Math. Statist 40, 1386–1400.
Royden, H. (1968), Real analysis, Macmillan New York.
Rummery, G. & Niranjan, M. (1994), On-line Q-learning using connectionist systems, Cam-
bridge University Engineering Department.
Schweitzer, P. & Seidmann, A. (1985), ‘Generalized polynomial approximations in Markovian
decision processes’, Journal of mathematical analysis and applications 110(2), 568–582.
Scott, D. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization, John
Wiley and Sons, New York.
Silverman, B. (1986), Density Estimation for Statistics and Data Analysis, Chapman &
Hall/CRC London.
Simonoff, J. (1996), Smoothing Methods in Statistics, Springer, New York.
46
Stokey, N., Prescott, E. & Lucas, R. (1989), Recursive Methods in Economic Dynamics,
Harvard University Press Cambridge, MA.
Strichartz, R. (2000), The Way of Analysis, Jones & Bartlett Publishers, Sudbury, MA.
Sutton, R. & Barto, A. (1998), Reinforcement Learning: An Introduction, MIT Press Cam-
bridge, MA.
Szita, I. (2007), Rewarding Excursions: Extending Reinforcement Learning to Complex Do-
mains, Eotvos Lorand University, Budapest.
Tesauro, G. (1994), ‘TD-Gammon, a self-teaching backgammon program, achieves master-
lever play’, Neural computation 6(2), 215–219.
Tsitsiklis, J. & Van Roy, B. (1996), ‘Feature-based methods for large scale dynamic pro-
gramming’, Machine Learning 22(1), 59–94.
Tsitsiklis, J. & Van Roy, B. (1997), ‘An analysis of temporal-difference learning with function
approximation’, Automatic Control, IEEE Transactions on 42(5), 674–690.
Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997), A Neuro-Dynamic Programming
Approach to Retailer Inventory Management, in ‘Proceedings of the 36th IEEE Conference
on Decision and Control, 1997’, Vol. 4.
Wand, M. & Jones, M. (1995), Kernel Smoothing, Chapman & Hall/CRC London.
Whitt, W. (1978), ‘Approximations of Dynamic Programs, I’, Mathematics of Operations
Research 3(3), 231–243.
Yakowitz, S. (1989), ‘Nonparametric density and regression estimation for Markov sequences
without mixing assumptions’, Journal of Multivariate Analysis 30(1), 124–136.
Young, P. (1984), Recursive estimation and time-series analysis: an introduction, Springer-
Verlag, New York.
47