Reinforcement Learning with
Function Approximation
Joseph Christian G. Noel
November 2011
Abstract
Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The
main goal is for an agent to learn to act in an unknown environment in a way that
maximizes a reward that it receives from the environment. As the number of states
in an environment grows larger, it becomes more important for the agent to generalize
what it has learned from some states to other similar states. An agent is able to do
this with a number of function approximation techniques.
This report presents a general overview of reinforcement learning when combined
with function approximation. There are two main sources for this report, Neuro-
dynamic Programming [Bertsekas and Tsitsiklis 1996] and Reinforcement Learning:
An Introduction [Sutton and Barto 1998]. [Bertsekas and Tsitsiklis 1996] discusses RL
from a mathematicians perspective, and is very hard to read. [Sutton and Barto 1998]
is intuitive, but does not discuss theoretical issues like convergence properties. The
goal of this report is to create an overview that encompasses these two sources, that is
both easy to read and contains sufficient theoretical background. We restrict ourselves
to online gradient-based methods in this report.
After discussing the mathematical theory we implement some of the techniques on
the cart pole and mountain car domain and report on the results.
iii
iv
Contents
Abstract iii
1 Introduction 1
1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 RL with Function Approximation . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Bellman Equations for the Value Functions . . . . . . . . . . . . . . . . 5
2.6 Exploration versus Exploitation . . . . . . . . . . . . . . . . . . . . . . . 5
3 Reinforcement Learning Algorithms 9
3.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Monte-Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.2 Eligibility Traces for TD(λ) . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 Convergence of TD . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.4 Sarsa (λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.5 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.6 Convergence of Q-Learning . . . . . . . . . . . . . . . . . . . . . 13
4 Reinforcement Learning with Function Approximation 15
4.1 Function Approximation (Regression) . . . . . . . . . . . . . . . . . . . 15
4.2 Gradient Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Convergence of SGD for Markov Processes . . . . . . . . . . . . 18
4.3 TD Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 Control with Function Approximation . . . . . . . . . . . . . . . 19
4.3.2 Convergence of TD With Function Approximation . . . . . . . . 19
v
vi Contents
4.4 Residual Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.1 Convergence of Residual Gradients . . . . . . . . . . . . . . . . . 20
4.4.2 Control with Residual Gradients . . . . . . . . . . . . . . . . . . 21
5 Experimental Results 23
5.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Mountain Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Cart Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Optimal Parameter Values . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Final Remarks 27
Bibliography 29
Chapter 1
Introduction
1.1 Artificial Intelligence
Artificial intelligence (AI) has been a common fixture in science fiction stories and
in people’s imaginations for centuries. However, the field of formal AI research only
started in the summer 1956 at a conference at Dartmouth College. Since then, the
goal of the field of artificial intelligence has been to create intelligent agents that can
mimic or go beyond human level intelligence. Within the field of AI there are many
subfields which each study a specific aspect of what we humans usually define as “In-
telligence.” Examples of this subfields are computer vision, logic, planning, robotics,
and machine learning. This paper deals with a particular branch of machine learning
called reinforcement learning.
1.1.1 Machine Learning
Machine learning is concerned with the design and development of algorithms that
allow computers to improve their performance over time based on data. There are
three main types of machine learning algorithms: supervised learning, unsupervised
learning, and reinforcement learning. In supervised learning, the algorithm is given a
set of examples for training, and uses this to infer a function mapping that will enable
it to generalize to unseen test data. In unsupervised learning, the algorithm is given
just the test data and has to infer some inherent structure from the data. We discuss
reinforcement learning in this report.
1.2 Reinforcement Learning
Reinforcement learning (RL) is a fundamental problem in artificial intelligence. In RL
an agent learns to act in an unknown environment. The agent interacts with the en-
vironment by performing actions which changes the state of the environment. After
each action, the agent also receives a “reward” signal from the environment. The goal
of the agent is to maximize the cumulative reward it receives from the environment.
This interaction can be seen in Figure 1.1.
1
2 Introduction
Figure 1.1: The Agent-Environment architecture in RL. Retrieved from [Sutton and Barto
1998]
As an example, imagine a reinforcement learning agent trying to learn how to play
the game of blackjack. The environment signals it receives can be what cards it has
in its hand as well as the one card shown by the dealer. The set of actions it can do
can be {hit, hold}. The rewards signals the agent receives can be a reward of −1 when
it busts or is beaten by the dealer’s hand, a reward of 1 when it beats the dealer’s
hand or when the dealer busts, and a reward of 0 the remaining times. Given enough
episodes/trials, the RL agent will eventually learn the optimal actions to do given what
cards it has in its hands in a way to maximize the rewards it gets. In effect, it will
learn how to play blackjack optimally.
1.2.1 RL with Function Approximation
In the blackjack example above, the number of possible states is bounded by the number
of permutations of the 52 cards in a normal card deck. However as the number of states
in an environment gets larger and larger, it becomes infeasible for an agent to visit all
possible states enough times to find the optimal actions for those states. Thus, it
becomes important to be able to generalize the learning experiences in a particular
state to the other states in the environment. A common way to do this is through
function approximation. The agent extracts the relevant information it needs from the
state through feature extraction, and uses the resulting feature vector to calculate the
approximate value it gets from being in that state. This is done by doing a dot product
between the feature vector and a parameter vector. In this way, similar state features
will also have similar values. The goal of RL with function approximation is then to
learn the best values for this parameter vector.
Combining reinforcement learning with function approximation techniques allows
the agent to generalize and hence handle large (even infinite) number of states.
Chapter 2
Background
2.1 Model
The reinforcement learning model consists of a set of states S, a set of actions A, and
transition rules between states depending on the action. At a given state s ∈ S at
time t, an agent chooses an action a ∈ A, transitions to a new state s′, and receives
a reward rt. A series of actions will eventually lead the agent to the terminal state or
goal state, which ends an episode. At this point the environment is reset and the agent
starts again from an initial state and the process repeats itself. Examples of episodes
can be a single game of chess or a single run through of a maze.
An agent follows a policy π, which describes how an agent should act at a given
time. Formally, π(s, a) is a mapping that gives the probability of taking action a when
in state s. A policy π is a proper policy if when following this policy there is a positive
probability that an agent will eventually reach the goal state, i.e., it will not infinitely
cycle through some states and never terminate an episode. There is always at least one
policy that is better than or equal to all other policies, and we denote all these optimal
policies as π∗. Optimal policies will be explained in more detail in the section on value
functions.
2.2 Rewards
The goal of reinforcement learning is to maximize the expected reward Rt,
Rt = rt + rt+1 + rt+2...rT
where T is the final time step. This is the finite horizon model, wherein the agent
tries to maximize the reward for the next T steps without regard for the succeeding
steps after it. There are drawbacks with this model, however. First, in most domains
the agent does not know how many time steps it will take to reach the end state, and
will need to properly handle infinite time steps. Second, it is possible for an agent to
be “lazy” in a finite horizon model. For example if the horizon is ten steps, the agent
could forever be trying to maximize the reward that happens ten steps from the current
time step, without ever doing the action that actually receives that reward.
3
4 Background
These drawbacks are fixed by adding a discount factor γ to the sum,
Rt = rt + γrt+1 + γ2rt+2... =∞∑k=0
γkrt+k
where 0 ≤ γ < 1. This is the infinite discounted horizon model. The agent considers
all rewards into the future, but γ acts as a discount factor so that the infinite sum is
bounded to a finite value. The value of γ defines how much the agent takes into account
future rewards. When γ = 0, the agent takes the short-term view and tries to maximize
the reward only for the next time step. As γ → 1, the agent considers future rewards
more strongly and takes the longer term view.
2.3 Markov Decision Processes
An environment 〈S,A, T,R, λ〉 defines the set of possible states S, actions A and tran-
sition probabilities T . The transition probability at time t of an agent moving from
state st to the next state s′ upon performing action at, and receiving reward r, depends
on the history of all state and action pairs before t. That is, what happens next is
defined by a probability
Pr(st+1 = s′, rt+1 = r|st, at, rt, st−1, at−1, rt−1, ..., r1, s0, a0).
However, there is a class of environments for which the relevant information from
the history of all states, actions, and rewards before time t is encapsulated in the state
at time t. This is called the Markov property, and hence tasks which exhibit this
property are called Markov decision processes. The Markov property states that
Pr(st+1 = s′, rt+1 = r|st, at) = Pr(st+1 = s′, rt+1 = r|st, at, rt, st−1, at−1, rt−1, ..., r1, s0, a0).
2.4 Value Functions
As said previously, reinforcement learning agents strive to maximize the expected re-
ward when at a given state at time t. It does this by estimating value functions for
a given state or state-action pair. The state value function V is an estimate of the
expected rewards the agent expects to receive on being at state s and then following a
policy π thereafter.
V π(s) = Eπ(Rt|st = s).
The state-action value function Q is an estimate of the expected rewards the agent
expects to receive on being at state s, taking action a, and then following policy π
thereafter.
§2.5 Bellman Equations for the Value Functions 5
Qπ(s, a) = Eπ(Rt|st = s, at = a).
We can now properly define the optimal policy π∗. First, we define a policy π to
be better than or equal to policy π′ if and only if V π(s) ≥ V π′(s) for all s ∈ S. There
is always at least one policy that is better than or equal to all other policies, and we
denote all these optimal policies as π∗. The state value functions for these optimal
policies are given as
V ∗(s) = maxπ
V π(s).
The state-action value functions for the optimal policies are given as
Q∗(s, a) = maxπ
Qπ(s, a).
The optimal policy π∗ is therefore the policy that chooses the action a that maxi-
mizes Q(s, a) for all s ∈ S,
π∗(s) = arg maxa
Q(s, a).
Reinforcement learning uses the value functions to approximate the optimal policy.
Simply choosing the immediate action that maximizes V ∗(s) or Q∗(s, a) at state s leads
to an optimal policy.
2.5 Bellman Equations for the Value Functions
Reinforcement learning uses the Bellman equations for reformulating the value function.
The Bellman equations expresses the relationship between the value of a state and the
value of its successor state in the case of the state value function, and the value of a
state-action combination and the value of the succeeding state and action in the case
of state-action value function. This allows V π and Qπto be defined recursively. The
Bellman equation for V π is
V π(s) = maxa
E[rt+1 + γV π(s+ 1)|st = s, at = a].
The Bellman equation for Qπ is
Qπ(s, a) = E[rt+1 + γmaxa′
Qπ(st+1, a′)|st = s, at = a].
2.6 Exploration versus Exploitation
One of the fundamental problems in reinforcement learning is balancing exploration
and exploitation. In exploitation, the agent exploits what it already knows and does a
greedy selection when choosing an action at a particular time step,
6 Background
at = arg maxa
Q(st, a)
In this way the agent is maximizing the rewards it receives given what it already
knows. However, what if there is another action that actually gives a better reward
than the one returned by a greedy selection? The agent’s current value function esti-
mates just may not be reflecting this yet. If the agent always chooses the greedy action,
it will never find out about the actually better action.
This is where exploration comes in. One method of exploration is ε-greedy. At a
probability ε, the agent chooses an action randomly from the set of available actions
instead of doing a greedy selection. This allows the agent to discover new actions that
are actually better than what it currently perceives to be best, and eventually find the
optimal policy.
Another way of doing exploration is through optimistic initialization. The model
parameters for the value functions are initialized such that the expected reward coming
from the states and actions are higher than they actually are. The agent will then seek
out all these states and actions through the greedy selection until the expected rewards
for these states and actions drop to its actual value. We use optimistic initialization to
enforce exploration in the RL domains we implement for this paper.
As the timestep t → ∞, continuing exploration allows all states to eventually be
visited infinite times, and this is a key requirement for the convergence of RL algo-
rithms to the optimal policy π∗.
§2.6 Exploration versus Exploitation 7
Figure 2.1: Average performance for different values of ε on the 10-armed bandit. Retrieved
from [Sutton and Barto 1998]
8 Background
Chapter 3
Reinforcement Learning
Algorithms
A variety of reinforcement learning algorithms have already been developed for finite
state space environments. In most of these cases the state and state-action value
functions are most often stored in a tabular format corresponding to the state space. We
discuss these algorithms before moving on to their function approximation extensions
in the next chapter.
3.1 Dynamic Programming
Dynamic programming (DP) is a set of algorithms that can be used to compute optimal
policies. A complete and perfect model of the environment as a Markov decision process
is required. We show two popular DP methods, policy iteration and value iteration.
3.1.1 Policy Iteration
In policy iteration, DP first evaluates V π for an initial policy π, and then uses this to
find a better policy π′. It repeats the process again until π′ converges into the optimal
policy. Let P (s, a, s′) be the transition probability of moving from state s to s′ upon
doing action a, and let R(s, a, s′) be the returned reward after moving from state s to
state s′ upon doing action a. As before, π(s, a) is the probability of doing action a at
state s under policy π.
For a given policy π, V π(s) is calculated as
V π(s) =∑a
π(s, a)∑s′
P (s, a, s′)[R(s, a, s′) + γV π(s′)].
Qπ can then be easily calculated from V π using the following equation:
Qπ(s, a) =∑s′
P (s, a, s′)[R(s, a, s′) + γV π(s′)]
9
10 Reinforcement Learning Algorithms
After computing the value functions, we can easily get the improved policy π′ by
letting
π′(s) = arg maxa
Qπ(s, a)
DP then repeats the process with π′. When π′ is as good as, but not better than π
such that V π = V π′, then both π and π′ are already optimal policies.
3.1.2 Value Iteration
In policy iteration each iteration does a policy evaluation and calculates V π, and this
step can take a long time because calculating V π is itself an iterative computation
that loops through the entire state space repeatedly. The value iteration algorithm
improves on this by combining policy improvement and a truncated policy evaluation
into a single update step
Vk+1(s) = maxa
∑s′
P (s, a, s′)[R(s, a, s′) + γVk(s′)]
For any initial V0, the sequence {Vk} will eventually converge to V ∗ as long as
0 < γ < 1.
3.2 Monte-Carlo Methods
Monte-Carlo (MC) methods take the average of returns of a state to estimate their
value functions. Unlike dynamic programming, Monte-Carlo does not even require any
knowledge of the environment. All it needs is some function that generates episodes
by following the policy π. Each episode will contain a set of states that were passed
through by the policy, a set of actions done at each passed through state, and the
returns following the first occurrence of that state-action pair. Let Return(s) and
Return(s, a) be a set of return values Rt for a state or state action pair, respectively,
one for each episode, and N be the total number of episodes that has been generated
from π. MC estimates V π and Qπ as
V π(s) =
∑Ni Returni(s)
N
Qπ(s, a) =
∑Ni Returni(s, a)
N
As in DP, policy improvement again follows as
π′(s) = arg maxa
Qπ(s, a)
After which the process starts again with a new set of episodes generated from π′.
The algorithm terminates when V π(s) ≈ V π′(s) or Qπ(s, a) ≈ Qπ′
(s, a).
§3.3 Temporal-Difference Learning 11
3.3 Temporal-Difference Learning
Temporal-difference (TD) learning is a combination of monte-carlo methods and dy-
namic programming. Like MC methods, TD learns directly from experience and does
not need to have a model of the environment. Like DP, TD learning bootstraps by
updating value estimates from earlier estimates. This allows TD algorithms to update
the value functions already before the end of an episode, they only need to wait for the
next time step. This property defines TD as an online learning method. The simplest
TD algorithms focus on the policy evaluation or the prediction problem. Algorithms
like TD(0) do this by estimating the value function V π for a given policy π. More
sophisticated algorithms like Sarsa and Q-Learning go further by solving the control
problem in which they find an optimal policy π∗ instead of just using a given policy.
The most common TD learning algorithms are TD(λ), Sarsa(λ), and Q-Learning.
3.3.1 TD(0)
TD(0) is the simplest TD algorithm for evaluating a policy. It works by treating the
return Rt as the sum of the reward immediately following st and the expected returns
in the future. The state value function can then be updated at every time step by
V (st) = V (st) + α[rt+1 + γV (st+1)− V (st)]
where α > 0 is a learning rate parameter of the algorithm. TD(0) is thus easily
implemented as an on-line, fully incremental algorithm and does not need to wait for
the termination of the episode to begin updating the value estimates. TD(0) has been
proved to converge to V π for the states that are visited infinitely often. For convergence
to be guaranteed, π should be a proper policy and the learning rate αt should have the
following constraints:
Learning rate constraints:
• αt > 0, ∀t ∈ T
•∑∞
t=0 αt =∞
•∑∞
t=0 α2t <∞
3.3.2 Eligibility Traces for TD(λ)
The temporal-difference algorithms discussed in the previous section have so far all
been one-step methods, in that they consider the return of only the one next reward.
This is in contrast with monte-carlo methods where all the returns until the end of the
episode are considered. Eligibility traces are a method of bridging the gap between
these two kinds of learning algorithms. As we have seen, the return in monte carlo
methods are
RMCt = rt + γrt+1 + γ2rt+2 + ...+ γT−t−1rT .
12 Reinforcement Learning Algorithms
The one step return in TD(0) is
RTD(0)t = rt + γV (st+1)
where γV (st+1) replaces the γrt+1 + γ2rt+2 + ... + γT−t−1rT terms of the monte-
carlo methods. Eligibility traces interpolates between these two returns. The TD(λ)
algorithm, where λ is the eligibility trace parameter and 0 ≤ λ ≤ 1 now defines the
λ− return as
Rλt = (1− λ)
∞∑n=1
λn−1R(n)t .
When λ = 0, the λ − return reduces to the one-step return RTD(0). When λ = 1,
the λ−return becomes equal to RMC . Setting λ to values less than 1 and greater than
0 allows a TD algorithm to vary in the space between the two extremes.
Eligibility traces are very important for temporal-difference methods. In fact, con-
vergence guarantees of the algorithms relies on the eligibility trace values having a few
specific properties. Let et(s)be the eligibility trace value for any state s at any time t.
Then for TD to converge the following must hold:
Eligibility traces constraints:
• et(s) ≥ 0
• e0(s) = 0. Eligibility traces are initially 0.
• et(s) ≤ et−1(s) if st 6= s. Eligibility traces remain at 0 until the first time that
state s is visited.
• et(s) ≤ et−1(s) + 1 if st = s. Eligibility traces may increase by at most 1 with
every visit to state s.
• et(s) is completely determined by s0, s1, ..., st.
• et(s) is bounded above by a deterministic constant C.
3.3.3 Convergence of TD
Assuming that the learning rate constraints in section 3.3.1 and the eligibility traces
constraints in section 3.3.2 hold. Then if the policy π is a proper policy, TD converges
to V π with a probability of 1.
3.3.4 Sarsa (λ)
Sarsa(λ) is an on-policy control TD learning algorithm. On-policy methods estimate
the value of a policy while simultaneously using it for control. This is in contrast to
off-policy methods, which uses two different policies: a behavior policy for generating
§3.3 Temporal-Difference Learning 13
behaviors from the agent, and an estimation policy which is the policy to be evaluated
and improved. Sarsa uses only one policy and changes this policy along the way.
Sarsa gets its name from the state-action-reward-state-action cycle of the algorithm.
Instead of learning the state value function V π as in TD(0), Sarsa learns the state-action
value function Qπ for policy π. At each time step Sarsa updates the state action value
through
Q(st, at) = Q(st, at) + α[rt+1 + γQ(st+1, at+1)−Q(st, at)].
3.3.5 Q-Learning
Q-leaning is an off-policy control TD algorithm. Off-policy methods uses two different
policies: a behavior policy for generating behaviors from the agent, and an estimation
policy which is the policy to be evaluated and improved. In off-policy algorithms
the two policies need not even be related. As in Sarsa, Q-learning uses the state-
action value function Q(s, a). The difference with Sarsa is that Q-learning right away
directly approximates the optimal state-action value function Q∗, irrespective of the
actual policy being followed by the agent. The simple one-step Q-learning algorithm
is defined by its update function
Q(st, at) = Q(st, at) + α[rt+1 + γmaxa
Q(st+1, at+1)−Q(st, at)].
3.3.6 Convergence of Q-Learning
Assuming that the learning rate constraints in section 3.3.1 hold. Then Q-Learning
converges to the optimal state-action value function Q∗ with probability 1.
14 Reinforcement Learning Algorithms
Chapter 4
Reinforcement Learning with
Function Approximation
The reinforcement learning algorithms discussed in the previous chapter assume that
the value functions can be represented as a table with one entry for each state or state-
action pair. However, this is only practical for very few tasks with a limited number of
states and actions. In environments with large numbers of states and actions, using a
table to store the value functions becomes impractical and may even make computing
them intractable. Moreso, a lot of environments will have continuous state and action
spaces, making the size of the table infinite.
Another problem with tabular methods is that it does not use generalization. Given
two states s and s′, the value of V (s) does not say anything about the value of V (s′).
Ideally, value functions should be able to generalize so that having a good estimate of
V (s) will help get a good estimate of V (s′).
Combining the traditional reinforcement learning algorithms with function approx-
imation techniques solves both these problems.
4.1 Function Approximation (Regression)
Function approximation takes example data generated by a function and generalizes
from them to construct a function that approximates the original function. Because
it needs sample data to learn the function, it is a type of supervised learning that
was discussed earlier. A general form of function approximation used in reinforcement
learning is
fw(x) = 〈w, φ(x)〉
where w and φ(x) are n-element vectors with w, φ ∈ Rn. w is a vector of weight
values and φ(x) is feature mapping column vector of the input values
φ(x) = (φ1(x), φ2(x), ..., φn(x))T
15
16 Reinforcement Learning with Function Approximation
We define a matrix Φ for m states
Φ =
φ1(s1) φ1(s2) . . . φ1(sm)
φ2(s1) φ2(s2) . . . φ2(sm)...
......
...
φn(s1) φn(s2) . . . φn(sm)
In the case of reinforcement learning, these input values are the states or state-
action pairs. Translating this into the state and state-action value functions is simply
Vw(s) = 〈w, φ(s)〉
Qw(s, a) = 〈w, φ(s, a)〉
Finding the optimal policy means finding the values of w that best approximates
the optimal value functions V ∗ and Q∗, or just V π and Qπ under policy evaluation.
4.2 Gradient Descent Methods
Gradient-based methods are among the most widely used function optimization tech-
niques. To find a local minimum of a differentiable function, gradient descent takes
steps towards the negative of the gradient of the function at the current point. The
gradient of a function points to the direction of its greatest rate of increase, hence the
negative of the gradient points to its greatest rate of decrease. Gradients can easily
be calculated from the first-order derivatives of a function, making gradient descent a
first-order optimization algorithm.
One class of functions for which gradient descent works particularly well are convex
functions. A function f : X− > R is convex if for all x1, x2 ∈ X,λ ∈ [0, 1],
f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2)
Meaning that for any interval on the domain, the function values are less than or
equal to the function values at the extreme points of the interval. This means that in
the case of convex functions, there is only one local minimum and it is also the global
minimum.
In supervised learning, a common function to minimize is the squared error. If f(x)
is the unknown function that we are trying to learn and g(w, φ(x)) = 〈w, φ(x)〉 is our
estimator of f , the total squared error Err over all inputs x is
Err =1
2Ex[f(x)− g(w, φ(x))]2
which is the objective we want to minimize. However, it is impossible to calculate
the expectation because we do not know the values that f will return for all possible in-
puts. Usually, we only have a sample of n input-output pairs (x1, y1), (x2, y2), ...(xn, yn).
§4.2 Gradient Descent Methods 17
We can therefore only reduce the error over these empirical observations
Err =1
2
n∑i=1
[yi − g(w, φ(xi))]2.
To optimize g to be a more accurate estimate of f , we take the gradient of Err
over w and use this to update w
w = w − α∇wErr
where α is the step size. What this update does is move w a small gradient step
towards the direction that minimizes Err. w is updated until it converges to a local
optima.
To calculate ∇wErr
∇wErr =
n∑i=1
[yi − g(w, φ(xi))]∇wg(w, φ(xi))
=
n∑i=1
[yi − g(w, φ(xi))]φ(xi)
This method of doing gradient descent over all samples x, y is called batch gradient
descent.
4.2.1 Stochastic Gradient Descent
However, there will be situations where it is not possible to compute the gradient over
all samples. This may be because the samples are coming one at a time, or there may be
too many of them (even infinite) that it is intractable to calculate their entire sum for
the gradient. To be able to do function approximation in these situations, the method
we use is stochastic gradient descent (SGD). As we shall see later, this is the method
we will use to incorporate function approximation to our TD learning algorithms.
Let err(x) = [y−g(w, φ(x))] be the error of a single sample input and output (x, y),
Err is therefore the sum of the squares of serr
Err =1
2
∑x
err(x)2
Instead of taking the gradient of Err and using that to update w, we only take
the gradient of err at a single sample (x, y), ∇werr(x), and use that as an estimate of
∇wErr.
18 Reinforcement Learning with Function Approximation
∇werr(x) = [y − g(w, φ(x))]∇wg(w, φ(x))
= [y − g(w, φ(x))]φ(x).
We then use ∇werr(x) to update w
w = w − α∇werr(x). (4.1)
4.2.2 Convergence of SGD for Markov Processes
Stochastic gradient descent for RL is a special case because RL data is generated by
a Markov process. Hence, the convergence guarantee we show is specific to Markov
chains only. Let {Xi}i=1,... be a time-homogenous Markov process, A(·) be a mapping
which maps every X ∈ χ to a d× d matrix, b(·) map each X to a vector b(X). Under
the following assumptions:
1. The learning rates αt are deterministic, non-negative, and satisfy the learning-
rate constraints in Section 3.3.1.
2. The Markov process {Xi} has a steady state distribution π such that limt→∞ P (xt |Xo) = π(Xt). E0[·] is the expectation with respect to this invariant distribution.
3. The matrix A = E0[A(Xt)] is negative definite.
4. ∃ constant K such that ‖A(x)‖ ≤ K and ‖b(X)‖ ≤ K, ∀X ∈ χ.
5. For any initial state X0, the expectation of A(Xt) and b(Xt) converges exponen-
tially fast to the steady-state expectation A and b.
The stochastic algorithm
wt+1 = wt + αt(A(Xt)wt + b(Xt)) (4.2)
converges with probability 1 to the unique solution w∗ of the system Aw∗ + b = 0.
This means that given the above assumptions, SGD will eventually converge to a local
optima for Markov processes.
Not that Equation 4.1 takes almost the same form as Equation 4.2. One can choose
A and b such that that it matches the SGD update in Equation 4.1, and hence make
it fall under the same convergence guarantee.
4.3 TD Learning
We wish to to optimize Vt, our estimate of V π at time t. Recall that V π is an estimate
of the expected rewards the agent expects to receive on being at state s and following
§4.3 TD Learning 19
policy π thereafter, which we designate Rt. In RL, rewards come one sample at a time,
and we need to be able to update our estimates of the value functions given this one
sample. Hence, we do not have sample values of Rt nor of V π, and gradient descent
on objectives with those values are not possible. Instead, the error we minimize is the
Bellman error. The Bellman error at a single time step t is defined as
e(st) =1
2[rt+1 + γVt(st+1)− Vt(st)]2.
Recall that V (s) is approximated as a linear function Vw(s) = 〈w, φ(s)〉. However
when we take the gradient of e(st), we treat the rt+1 + γVt(st+1) term as just a sample
constant value Rt and not as a function of w. Therefore the gradient of e(st) is just
∇we(st) = [rt+1 + γVt(st+1)− Vt(st)]∇wVt(st)= [rt+1 + γVt(st+1)− Vt(st)]φ(st).
Using this gradient to update w results in the following update operation
wt+1 = wt + α∇wtE(wt)
= wt + α[rt+1 + γVt(st+1)− Vt(st)]φ(s)
where α is again a step size value.
4.3.1 Control with Function Approximation
For state-action value functions, the Bellman error given state s, action a, at time t is
e(st, at) =1
2[rt+1 + γQt(st+1, at+1)−Qt(st, at)]2.
Again, we treat the rt+1 + γQt(st+1, at+1) term as just a sample constant value of
Rt and not as a function of w. Therefore the gradient of e(st) is just
∇we(st, at) = [rt+1 + γQt(st+1, at+1)−Qt(st, at)]φ(st)
and we update w using this gradient to get
wt+1 = wt + α[rt+1 + γQt(st+1, at+1)−Qt(st, at)]φ(s).
4.3.2 Convergence of TD With Function Approximation
We now provide convergence guarantees for policy evaluation under TD(0). Assuming
the same step size constraints as in section 3.3.1 hold. Additional constraints for con-
20 Reinforcement Learning with Function Approximation
vergence are:
1. The state space is an aperiodic Markov chain when we follow π and all states are
visited an infinite number of times during an infinitely long episode.
2. The policy π is a proper policy.
3. The feature mapping φ(s) is a linearly independent function on the state space.
(The matrix Φ has full rank)
If all the constraints hold, then TD(0) converges to V π. This is based on the
convergence of SGD for Markov processes discussed earlier in Section 4.2.2.
4.4 Residual Gradients
Recall that when we took the gradient of e(st) when applying gradient descent on TD-
learning, we treated the rt+1 + γVt(st+1) term as just a sample constant value of Rtand not as a function of w. Since Vt(st+1) = 〈w, φ(st+1)〉 is in fact a function of w, this
actually makes TD-learning not a proper gradient descent method. Hence, TD(0) can
diverge when the above constraints aren’t met.
There is another form of RL algorithm with function approximation that is exactly
gradient descent, and it is called residual gradients (RG). With residual gradients, we
now consider Vt(st) to also be a function of w as it is. Since RG is a proper gradient
descent method, convergence is much more robust than with TD(0).
The gradient of e(st) with respect to w now becomes
∇we(st) = [rt+1 + γVt(st+1)− Vt(st)][∇wγVt(st+1)−∇wVt(st)]= [rt+1 + γVt(st+1)− Vt(st)][γφ(st+1)− φ(st)].
The TD update for the weights is now
wt+1 = wt − α[rt+1 + γVt(st+1)− Vt(st)][γφ(st+1)− φ(st)].
4.4.1 Convergence of Residual Gradients
Because residual gradients is directly a stochastic gradient method, convergence for
policy evaluation with a fixed policy π is guaranteed based on Section 4.2.2. No other
constraints are necessary.
§4.4 Residual Gradients 21
4.4.2 Control with Residual Gradients
For the control problem, the gradient for e(st, at) is now
∇we(st, at) = [rt+1 + γQt(st+1, st+1)−Qt(st, at)][∇wγQt(st+1, at+1)−∇wVt(st, at)]= [rt+1 + γQt(st+1, at+1)−Qt(st, at)][γφ(st+1, at+1)− φ(st, at)]
and the update for w becomes
wt+1 = wt + α[rt+1 + γQt(st+1, at+1)−Qt(st, at)][γφ(st+1)− φ(st+1)]
22 Reinforcement Learning with Function Approximation
Chapter 5
Experimental Results
We now implement some of the techniques discusses in this paper to show the results
of RL with function approximation. The two techniques we use are Sarsa and Residual
Gradients. We implement it on two domains that require function approximation, Cart
Pole and Mountain Car. Additionally, we use two different feature mappings for state
features, Radial Basis Functions (RBF) Coding and Tile Coding [Sutton and Barto
1998].
5.1 Domains
5.1.1 Mountain Car
In the mountain car domain, the agent tries to drive a car up a hill towards the goal
position. It has a two dimensional state space, the position of the car in the hill and
the velocity of the car. Dimension s1 is the position of the car and is a continuous
value bounded between [−1.2, 0.5]. Dimension s2 is the velocity and is also a contin-
uous value and is bounded between [−0.07, 0.07]. The agent can choose among three
actions, a ∈ {−1, 0, 1} which corresponds to accelerating left, neutral, right. The goal
of the agent is to get the car into the rightmost position, that is, at the state with
s1 = 0.5. At each time step the agents received a reward of -1 until it reaches the goal
state wherein it receives a reward of 1.
5.1.2 Cart Pole
In the cart pole domain, the agent tries to balance a pole hinged on top of a cart by
moving the cart along a frictionless track. It has a four dimensional state space, the
position of the cart, cart velocity, the angle of the pole, and the pole angular velocity.
Dimension s1 is the position of the cart, and is bounded by [−2.4, 2.4]. Dimension s2
is the cart velocity and has infinite state space [−∞,∞]. Dimension s3 is the pole
angle and is bounded by [−12, 12], any angular position outside of this bounds results
in failure. Dimension s4 is the angular velocity of the pole, and also has infinite state
space [−∞,∞]. The agent receives zero rewards at each time step until the angle of
23
24 Experimental Results
Figure 5.1: The Mountain Car Domain
the pole exceeds the bounds, at which point the agent receives a reward of -1 and an
episode ends.
Figure 5.2: The Cart Pole Domain
5.2 Results
5.2.1 Optimal Parameter Values
Optimal parameter values for each domain was found by repeatedly testing different
values for each parameter and recording the best results. For tile coding we used 10
fillings, for RBF coding we used 10 radial basic functions with variance of 0.05 each.
5.2.2 Results
As seen in Figures 5.3 and 5.4, Sarsa is able to converge faster than using residual
gradients. This matches up with the discussion in Section 4.4 that while TD learning
is not directly a stochastic gradient descent method and hence diverges more often,
§5.2 Results 25
α γ λ
Sarsa - Tile Coding 0.01 1 0.7Sarsa - RBF Coding 0.01 1 0.7RG - Tile Coding 0.1 1 0.7RG - RBF Coding 0.2 1 0.9
Table 5.1: Optimal Parameter Values for the Cart Pole Domain
α γ λ
Sarsa - Tile Coding 0.1 1 0.9Sarsa - RBF Coding 0.1 1 0.9RG - Tile Coding 0.1 1 0.8RG - RBF Coding 0.3 1 0.9
Table 5.2: Optimal Parameter Values for the Mountain Car Domain
when it does converge the rate of convergence is faster than residual gradients which
is a true SGD method. In all cases except one, using tile coding converges faster than
RBF coding.
Tuning parameters has a big effect on the learning performance of the agent. For
some sub-optimal parameter values, the agent never learns the optimal policy for the
domain.
Figure 5.3: Average Results for the Cart Pole domain. Higher is better.
26 Experimental Results
Figure 5.4: Average Results for the Mountain Car domain. Lower is better.
Chapter 6
Final Remarks
As we have discussed in this report, combining reinforcement learning with function
approximation techniques allows an agent to learn to operate in environments with
infinitely large number of states. It does this by letting an agent generalize what it has
learned in some states to other similar states.
We first discussed traditional reinforcement learning methods that stores the value
of the states in a tabular format, and proceeded to discuss the function approxima-
tion extensions to these methods. We used linear function approximation where we
optimized the parameters with respect to a mean square error using stochastic gra-
dient decent. The function approximation used in Sarsa isn’t a direct SGD method
and hence there are times when Sarsa will diverge. Residual gradients is a direct SGD
method and hence its convergence is more robust. However when Sarsa does converge,
it’s rate of convergence is usually faster than residual gradients. The faster convergence
of Sarsa is mainly an experimental result, not theoretical. There may be situations in
which residual gradients converges faster. Finally, we implemented the RL with func-
tion approximation techniques discussed on two domains and showed the results.
Although the RL theory for environments with state values stored in tabular formats
are quite mature, the theory for RL with function approximation as discussed here are
still very much being actively developed, with new techniques and methods still being
discovered. More work in this particular aspect of reinforcement learning is needed and
will provide better results in the future.
27
28 Final Remarks
Bibliography
Baird, L. 1995. Residual algorithms: Reinforcement learning with function ap-
proximation. In Proceedings of the Twelfth International Conference on Machine
Learning (1995).
Bertsekas, D. P. and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming.
Athena Scientific. (p. iii)
Rummery, G. A. and Niranjan, M. 1994. On-line q-learning using connectionist
systems. Technical report.
Sutton, R. and Barto, A. 1998. Reinforcement Learning. The MIT Press.
(pp. iii, 2, 7, 23)
Watkins, C. 1989. Learning from delayed rewards.
Watkins, C. and Dayan, P. 1992. Q-learning. Machine Learning .
29