reinforcement learning with function...

Reinforcement Learning with

Function Approximation

Joseph Christian G. Noel

November 2011

Abstract

Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The

main goal is for an agent to learn to act in an unknown environment in a way that

maximizes a reward that it receives from the environment. As the number of states

in an environment grows larger, it becomes more important for the agent to generalize

what it has learned from some states to other similar states. An agent is able to do

this with a number of function approximation techniques.

This report presents a general overview of reinforcement learning when combined

with function approximation. There are two main sources for this report, Neuro-

dynamic Programming [Bertsekas and Tsitsiklis 1996] and Reinforcement Learning:

An Introduction [Sutton and Barto 1998]. [Bertsekas and Tsitsiklis 1996] discusses RL

from a mathematicians perspective, and is very hard to read. [Sutton and Barto 1998]

is intuitive, but does not discuss theoretical issues like convergence properties. The

goal of this report is to create an overview that encompasses these two sources, that is

both easy to read and contains sufficient theoretical background. We restrict ourselves

to online gradient-based methods in this report.

After discussing the mathematical theory we implement some of the techniques on

the cart pole and mountain car domain and report on the results.

iii

Contents

Abstract iii

1 Introduction 1

1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 RL with Function Approximation . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.5 Bellman Equations for the Value Functions . . . . . . . . . . . . . . . . 5

2.6 Exploration versus Exploitation . . . . . . . . . . . . . . . . . . . . . . . 5

3 Reinforcement Learning Algorithms 9

3.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Monte-Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.2 Eligibility Traces for TD(λ) . . . . . . . . . . . . . . . . . . . . . 11

3.3.3 Convergence of TD . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.4 Sarsa (λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.5 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.6 Convergence of Q-Learning . . . . . . . . . . . . . . . . . . . . . 13

4 Reinforcement Learning with Function Approximation 15

4.1 Function Approximation (Regression) . . . . . . . . . . . . . . . . . . . 15

4.2 Gradient Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 17

4.2.2 Convergence of SGD for Markov Processes . . . . . . . . . . . . 18

4.3 TD Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.1 Control with Function Approximation . . . . . . . . . . . . . . . 19

4.3.2 Convergence of TD With Function Approximation . . . . . . . . 19

v

vi Contents

4.4 Residual Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4.1 Convergence of Residual Gradients . . . . . . . . . . . . . . . . . 20

4.4.2 Control with Residual Gradients . . . . . . . . . . . . . . . . . . 21

5 Experimental Results 23

5.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Mountain Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.2 Cart Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2.1 Optimal Parameter Values . . . . . . . . . . . . . . . . . . . . . . 24

5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Final Remarks 27

Bibliography 29

Chapter 1

Introduction

1.1 Artificial Intelligence

Artificial intelligence (AI) has been a common fixture in science fiction stories and

in people’s imaginations for centuries. However, the field of formal AI research only

started in the summer 1956 at a conference at Dartmouth College. Since then, the

goal of the field of artificial intelligence has been to create intelligent agents that can

mimic or go beyond human level intelligence. Within the field of AI there are many

subfields which each study a specific aspect of what we humans usually define as “In-

telligence.” Examples of this subfields are computer vision, logic, planning, robotics,

and machine learning. This paper deals with a particular branch of machine learning

called reinforcement learning.

1.1.1 Machine Learning

Machine learning is concerned with the design and development of algorithms that

allow computers to improve their performance over time based on data. There are

three main types of machine learning algorithms: supervised learning, unsupervised

learning, and reinforcement learning. In supervised learning, the algorithm is given a

set of examples for training, and uses this to infer a function mapping that will enable

it to generalize to unseen test data. In unsupervised learning, the algorithm is given

just the test data and has to infer some inherent structure from the data. We discuss

reinforcement learning in this report.

1.2 Reinforcement Learning

Reinforcement learning (RL) is a fundamental problem in artificial intelligence. In RL

an agent learns to act in an unknown environment. The agent interacts with the en-

vironment by performing actions which changes the state of the environment. After

each action, the agent also receives a “reward” signal from the environment. The goal

of the agent is to maximize the cumulative reward it receives from the environment.

This interaction can be seen in Figure 1.1.

1

2 Introduction

Figure 1.1: The Agent-Environment architecture in RL. Retrieved from [Sutton and Barto

1998]

As an example, imagine a reinforcement learning agent trying to learn how to play

the game of blackjack. The environment signals it receives can be what cards it has

in its hand as well as the one card shown by the dealer. The set of actions it can do

can be {hit, hold}. The rewards signals the agent receives can be a reward of −1 when

it busts or is beaten by the dealer’s hand, a reward of 1 when it beats the dealer’s

hand or when the dealer busts, and a reward of 0 the remaining times. Given enough

episodes/trials, the RL agent will eventually learn the optimal actions to do given what

cards it has in its hands in a way to maximize the rewards it gets. In effect, it will

learn how to play blackjack optimally.

1.2.1 RL with Function Approximation

In the blackjack example above, the number of possible states is bounded by the number

of permutations of the 52 cards in a normal card deck. However as the number of states

in an environment gets larger and larger, it becomes infeasible for an agent to visit all

possible states enough times to find the optimal actions for those states. Thus, it

becomes important to be able to generalize the learning experiences in a particular

state to the other states in the environment. A common way to do this is through

function approximation. The agent extracts the relevant information it needs from the

state through feature extraction, and uses the resulting feature vector to calculate the

approximate value it gets from being in that state. This is done by doing a dot product

between the feature vector and a parameter vector. In this way, similar state features

will also have similar values. The goal of RL with function approximation is then to

learn the best values for this parameter vector.

Combining reinforcement learning with function approximation techniques allows

the agent to generalize and hence handle large (even infinite) number of states.

Chapter 2

Background

2.1 Model

The reinforcement learning model consists of a set of states S, a set of actions A, and

transition rules between states depending on the action. At a given state s ∈ S at

time t, an agent chooses an action a ∈ A, transitions to a new state s′, and receives

a reward rt. A series of actions will eventually lead the agent to the terminal state or

goal state, which ends an episode. At this point the environment is reset and the agent

starts again from an initial state and the process repeats itself. Examples of episodes

can be a single game of chess or a single run through of a maze.

An agent follows a policy π, which describes how an agent should act at a given

time. Formally, π(s, a) is a mapping that gives the probability of taking action a when

in state s. A policy π is a proper policy if when following this policy there is a positive

probability that an agent will eventually reach the goal state, i.e., it will not infinitely

cycle through some states and never terminate an episode. There is always at least one

policy that is better than or equal to all other policies, and we denote all these optimal

policies as π∗. Optimal policies will be explained in more detail in the section on value

functions.

2.2 Rewards

The goal of reinforcement learning is to maximize the expected reward Rt,

Rt = rt + rt+1 + rt+2...rT

where T is the final time step. This is the finite horizon model, wherein the agent

tries to maximize the reward for the next T steps without regard for the succeeding

steps after it. There are drawbacks with this model, however. First, in most domains

the agent does not know how many time steps it will take to reach the end state, and

will need to properly handle infinite time steps. Second, it is possible for an agent to

be “lazy” in a finite horizon model. For example if the horizon is ten steps, the agent

could forever be trying to maximize the reward that happens ten steps from the current

time step, without ever doing the action that actually receives that reward.

3

4 Background

These drawbacks are fixed by adding a discount factor γ to the sum,

Rt = rt + γrt+1 + γ2rt+2... =∞∑k=0

γkrt+k

where 0 ≤ γ < 1. This is the infinite discounted horizon model. The agent considers

all rewards into the future, but γ acts as a discount factor so that the infinite sum is

bounded to a finite value. The value of γ defines how much the agent takes into account

future rewards. When γ = 0, the agent takes the short-term view and tries to maximize

the reward only for the next time step. As γ → 1, the agent considers future rewards

more strongly and takes the longer term view.

2.3 Markov Decision Processes

An environment 〈S,A, T,R, λ〉 defines the set of possible states S, actions A and tran-

sition probabilities T . The transition probability at time t of an agent moving from

state st to the next state s′ upon performing action at, and receiving reward r, depends

on the history of all state and action pairs before t. That is, what happens next is

defined by a probability

Pr(st+1 = s′, rt+1 = r|st, at, rt, st−1, at−1, rt−1, ..., r1, s0, a0).

However, there is a class of environments for which the relevant information from

the history of all states, actions, and rewards before time t is encapsulated in the state

at time t. This is called the Markov property, and hence tasks which exhibit this

property are called Markov decision processes. The Markov property states that

Pr(st+1 = s′, rt+1 = r|st, at) = Pr(st+1 = s′, rt+1 = r|st, at, rt, st−1, at−1, rt−1, ..., r1, s0, a0).

2.4 Value Functions

As said previously, reinforcement learning agents strive to maximize the expected re-

ward when at a given state at time t. It does this by estimating value functions for

a given state or state-action pair. The state value function V is an estimate of the

expected rewards the agent expects to receive on being at state s and then following a

policy π thereafter.

V π(s) = Eπ(Rt|st = s).

The state-action value function Q is an estimate of the expected rewards the agent

expects to receive on being at state s, taking action a, and then following policy π

thereafter.

§2.5 Bellman Equations for the Value Functions 5

Qπ(s, a) = Eπ(Rt|st = s, at = a).

We can now properly define the optimal policy π∗. First, we define a policy π to

be better than or equal to policy π′ if and only if V π(s) ≥ V π′(s) for all s ∈ S. There

is always at least one policy that is better than or equal to all other policies, and we

denote all these optimal policies as π∗. The state value functions for these optimal

policies are given as

V ∗(s) = maxπ

V π(s).

The state-action value functions for the optimal policies are given as

Q∗(s, a) = maxπ

Qπ(s, a).

The optimal policy π∗ is therefore the policy that chooses the action a that maxi-

mizes Q(s, a) for all s ∈ S,

π∗(s) = arg maxa

Q(s, a).

Reinforcement learning uses the value functions to approximate the optimal policy.

Simply choosing the immediate action that maximizes V ∗(s) or Q∗(s, a) at state s leads

to an optimal policy.

2.5 Bellman Equations for the Value Functions

Reinforcement learning uses the Bellman equations for reformulating the value function.

The Bellman equations expresses the relationship between the value of a state and the

value of its successor state in the case of the state value function, and the value of a

state-action combination and the value of the succeeding state and action in the case

of state-action value function. This allows V π and Qπto be defined recursively. The

Bellman equation for V π is

V π(s) = maxa

E[rt+1 + γV π(s+ 1)|st = s, at = a].

The Bellman equation for Qπ is

Qπ(s, a) = E[rt+1 + γmaxa′

Qπ(st+1, a′)|st = s, at = a].

2.6 Exploration versus Exploitation

One of the fundamental problems in reinforcement learning is balancing exploration

and exploitation. In exploitation, the agent exploits what it already knows and does a

greedy selection when choosing an action at a particular time step,

6 Background

at = arg maxa

Q(st, a)

In this way the agent is maximizing the rewards it receives given what it already

knows. However, what if there is another action that actually gives a better reward

than the one returned by a greedy selection? The agent’s current value function esti-

mates just may not be reflecting this yet. If the agent always chooses the greedy action,

it will never find out about the actually better action.

This is where exploration comes in. One method of exploration is ε-greedy. At a

probability ε, the agent chooses an action randomly from the set of available actions

instead of doing a greedy selection. This allows the agent to discover new actions that

are actually better than what it currently perceives to be best, and eventually find the

optimal policy.

Another way of doing exploration is through optimistic initialization. The model

parameters for the value functions are initialized such that the expected reward coming

from the states and actions are higher than they actually are. The agent will then seek

out all these states and actions through the greedy selection until the expected rewards

for these states and actions drop to its actual value. We use optimistic initialization to

enforce exploration in the RL domains we implement for this paper.

As the timestep t → ∞, continuing exploration allows all states to eventually be

visited infinite times, and this is a key requirement for the convergence of RL algo-

rithms to the optimal policy π∗.

§2.6 Exploration versus Exploitation 7

Figure 2.1: Average performance for different values of ε on the 10-armed bandit. Retrieved

from [Sutton and Barto 1998]

8 Background

Chapter 3

Reinforcement Learning

Algorithms

A variety of reinforcement learning algorithms have already been developed for finite

state space environments. In most of these cases the state and state-action value

functions are most often stored in a tabular format corresponding to the state space. We

discuss these algorithms before moving on to their function approximation extensions

in the next chapter.

3.1 Dynamic Programming

Dynamic programming (DP) is a set of algorithms that can be used to compute optimal

policies. A complete and perfect model of the environment as a Markov decision process

is required. We show two popular DP methods, policy iteration and value iteration.

3.1.1 Policy Iteration

In policy iteration, DP first evaluates V π for an initial policy π, and then uses this to

find a better policy π′. It repeats the process again until π′ converges into the optimal

policy. Let P (s, a, s′) be the transition probability of moving from state s to s′ upon

doing action a, and let R(s, a, s′) be the returned reward after moving from state s to

state s′ upon doing action a. As before, π(s, a) is the probability of doing action a at

state s under policy π.

For a given policy π, V π(s) is calculated as

V π(s) =∑a

π(s, a)∑s′

P (s, a, s′)[R(s, a, s′) + γV π(s′)].

Qπ can then be easily calculated from V π using the following equation:

Qπ(s, a) =∑s′

P (s, a, s′)[R(s, a, s′) + γV π(s′)]

9

10 Reinforcement Learning Algorithms

After computing the value functions, we can easily get the improved policy π′ by

letting

π′(s) = arg maxa

Qπ(s, a)

DP then repeats the process with π′. When π′ is as good as, but not better than π

such that V π = V π′, then both π and π′ are already optimal policies.

3.1.2 Value Iteration

In policy iteration each iteration does a policy evaluation and calculates V π, and this

step can take a long time because calculating V π is itself an iterative computation

that loops through the entire state space repeatedly. The value iteration algorithm

improves on this by combining policy improvement and a truncated policy evaluation

into a single update step

Vk+1(s) = maxa

∑s′

P (s, a, s′)[R(s, a, s′) + γVk(s′)]

For any initial V0, the sequence {Vk} will eventually converge to V ∗ as long as

0 < γ < 1.

3.2 Monte-Carlo Methods

Monte-Carlo (MC) methods take the average of returns of a state to estimate their

value functions. Unlike dynamic programming, Monte-Carlo does not even require any

knowledge of the environment. All it needs is some function that generates episodes

by following the policy π. Each episode will contain a set of states that were passed

through by the policy, a set of actions done at each passed through state, and the

returns following the first occurrence of that state-action pair. Let Return(s) and

Return(s, a) be a set of return values Rt for a state or state action pair, respectively,

one for each episode, and N be the total number of episodes that has been generated

from π. MC estimates V π and Qπ as

V π(s) =

∑Ni Returni(s)

N

Qπ(s, a) =

∑Ni Returni(s, a)

N

As in DP, policy improvement again follows as

π′(s) = arg maxa

Qπ(s, a)

After which the process starts again with a new set of episodes generated from π′.

The algorithm terminates when V π(s) ≈ V π′(s) or Qπ(s, a) ≈ Qπ′

(s, a).

§3.3 Temporal-Difference Learning 11

3.3 Temporal-Difference Learning

Temporal-difference (TD) learning is a combination of monte-carlo methods and dy-

namic programming. Like MC methods, TD learns directly from experience and does

not need to have a model of the environment. Like DP, TD learning bootstraps by

updating value estimates from earlier estimates. This allows TD algorithms to update

the value functions already before the end of an episode, they only need to wait for the

next time step. This property defines TD as an online learning method. The simplest

TD algorithms focus on the policy evaluation or the prediction problem. Algorithms

like TD(0) do this by estimating the value function V π for a given policy π. More

sophisticated algorithms like Sarsa and Q-Learning go further by solving the control

problem in which they find an optimal policy π∗ instead of just using a given policy.

The most common TD learning algorithms are TD(λ), Sarsa(λ), and Q-Learning.

3.3.1 TD(0)

TD(0) is the simplest TD algorithm for evaluating a policy. It works by treating the

return Rt as the sum of the reward immediately following st and the expected returns

in the future. The state value function can then be updated at every time step by

V (st) = V (st) + α[rt+1 + γV (st+1)− V (st)]

where α > 0 is a learning rate parameter of the algorithm. TD(0) is thus easily

implemented as an on-line, fully incremental algorithm and does not need to wait for

the termination of the episode to begin updating the value estimates. TD(0) has been

proved to converge to V π for the states that are visited infinitely often. For convergence

to be guaranteed, π should be a proper policy and the learning rate αt should have the

following constraints:

Learning rate constraints:

• αt > 0, ∀t ∈ T

•∑∞

t=0 αt =∞

•∑∞

t=0 α2t <∞

3.3.2 Eligibility Traces for TD(λ)

The temporal-difference algorithms discussed in the previous section have so far all

been one-step methods, in that they consider the return of only the one next reward.

This is in contrast with monte-carlo methods where all the returns until the end of the

episode are considered. Eligibility traces are a method of bridging the gap between

these two kinds of learning algorithms. As we have seen, the return in monte carlo

methods are

RMCt = rt + γrt+1 + γ2rt+2 + ...+ γT−t−1rT .


The one step return in TD(0) is

RTD(0)t = rt + γV (st+1)

where γV (st+1) replaces the γrt+1 + γ2rt+2 + ... + γT−t−1rT terms of the monte-

carlo methods. Eligibility traces interpolates between these two returns. The TD(λ)

algorithm, where λ is the eligibility trace parameter and 0 ≤ λ ≤ 1 now defines the

λ− return as

Rλt = (1− λ)

∞∑n=1

λn−1R(n)t .

When λ = 0, the λ − return reduces to the one-step return RTD(0). When λ = 1,

the λ−return becomes equal to RMC . Setting λ to values less than 1 and greater than

0 allows a TD algorithm to vary in the space between the two extremes.

Eligibility traces are very important for temporal-difference methods. In fact, con-

vergence guarantees of the algorithms relies on the eligibility trace values having a few

specific properties. Let et(s)be the eligibility trace value for any state s at any time t.

Then for TD to converge the following must hold:

Eligibility traces constraints:

• et(s) ≥ 0

• e0(s) = 0. Eligibility traces are initially 0.

• et(s) ≤ et−1(s) if st 6= s. Eligibility traces remain at 0 until the first time that

state s is visited.

• et(s) ≤ et−1(s) + 1 if st = s. Eligibility traces may increase by at most 1 with

every visit to state s.

• et(s) is completely determined by s0, s1, ..., st.

• et(s) is bounded above by a deterministic constant C.

3.3.3 Convergence of TD

Assuming that the learning rate constraints in section 3.3.1 and the eligibility traces

constraints in section 3.3.2 hold. Then if the policy π is a proper policy, TD converges

to V π with a probability of 1.

3.3.4 Sarsa (λ)

Sarsa(λ) is an on-policy control TD learning algorithm. On-policy methods estimate

the value of a policy while simultaneously using it for control. This is in contrast to

off-policy methods, which uses two different policies: a behavior policy for generating

§3.3 Temporal-Difference Learning 13

behaviors from the agent, and an estimation policy which is the policy to be evaluated

and improved. Sarsa uses only one policy and changes this policy along the way.

Sarsa gets its name from the state-action-reward-state-action cycle of the algorithm.

Instead of learning the state value function V π as in TD(0), Sarsa learns the state-action

value function Qπ for policy π. At each time step Sarsa updates the state action value

through

Q(st, at) = Q(st, at) + α[rt+1 + γQ(st+1, at+1)−Q(st, at)].

3.3.5 Q-Learning

Q-leaning is an off-policy control TD algorithm. Off-policy methods uses two different

policies: a behavior policy for generating behaviors from the agent, and an estimation

policy which is the policy to be evaluated and improved. In off-policy algorithms

the two policies need not even be related. As in Sarsa, Q-learning uses the state-

action value function Q(s, a). The difference with Sarsa is that Q-learning right away

directly approximates the optimal state-action value function Q∗, irrespective of the

actual policy being followed by the agent. The simple one-step Q-learning algorithm

is defined by its update function

Q(st, at) = Q(st, at) + α[rt+1 + γmaxa

Q(st+1, at+1)−Q(st, at)].

3.3.6 Convergence of Q-Learning

Assuming that the learning rate constraints in section 3.3.1 hold. Then Q-Learning

converges to the optimal state-action value function Q∗ with probability 1.

Chapter 4

Reinforcement Learning with

Function Approximation

The reinforcement learning algorithms discussed in the previous chapter assume that

the value functions can be represented as a table with one entry for each state or state-

action pair. However, this is only practical for very few tasks with a limited number of

states and actions. In environments with large numbers of states and actions, using a

table to store the value functions becomes impractical and may even make computing

them intractable. Moreso, a lot of environments will have continuous state and action

spaces, making the size of the table infinite.

Another problem with tabular methods is that it does not use generalization. Given

two states s and s′, the value of V (s) does not say anything about the value of V (s′).

Ideally, value functions should be able to generalize so that having a good estimate of

V (s) will help get a good estimate of V (s′).

Combining the traditional reinforcement learning algorithms with function approx-

imation techniques solves both these problems.

4.1 Function Approximation (Regression)

Function approximation takes example data generated by a function and generalizes

from them to construct a function that approximates the original function. Because

it needs sample data to learn the function, it is a type of supervised learning that

was discussed earlier. A general form of function approximation used in reinforcement

learning is

fw(x) = 〈w, φ(x)〉

where w and φ(x) are n-element vectors with w, φ ∈ Rn. w is a vector of weight

values and φ(x) is feature mapping column vector of the input values

φ(x) = (φ1(x), φ2(x), ..., φn(x))T

15

16 Reinforcement Learning with Function Approximation

We define a matrix Φ for m states

Φ =

φ1(s1) φ1(s2) . . . φ1(sm)

φ2(s1) φ2(s2) . . . φ2(sm)...

......

...

φn(s1) φn(s2) . . . φn(sm)

In the case of reinforcement learning, these input values are the states or state-

action pairs. Translating this into the state and state-action value functions is simply

Vw(s) = 〈w, φ(s)〉

Qw(s, a) = 〈w, φ(s, a)〉

Finding the optimal policy means finding the values of w that best approximates

the optimal value functions V ∗ and Q∗, or just V π and Qπ under policy evaluation.

4.2 Gradient Descent Methods

Gradient-based methods are among the most widely used function optimization tech-

niques. To find a local minimum of a differentiable function, gradient descent takes

steps towards the negative of the gradient of the function at the current point. The

gradient of a function points to the direction of its greatest rate of increase, hence the

negative of the gradient points to its greatest rate of decrease. Gradients can easily

be calculated from the first-order derivatives of a function, making gradient descent a

first-order optimization algorithm.

One class of functions for which gradient descent works particularly well are convex

functions. A function f : X− > R is convex if for all x1, x2 ∈ X,λ ∈ [0, 1],

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2)

Meaning that for any interval on the domain, the function values are less than or

equal to the function values at the extreme points of the interval. This means that in

the case of convex functions, there is only one local minimum and it is also the global

minimum.

In supervised learning, a common function to minimize is the squared error. If f(x)

is the unknown function that we are trying to learn and g(w, φ(x)) = 〈w, φ(x)〉 is our

estimator of f , the total squared error Err over all inputs x is

Err =1

2Ex[f(x)− g(w, φ(x))]2

which is the objective we want to minimize. However, it is impossible to calculate

the expectation because we do not know the values that f will return for all possible in-

puts. Usually, we only have a sample of n input-output pairs (x1, y1), (x2, y2), ...(xn, yn).

§4.2 Gradient Descent Methods 17

We can therefore only reduce the error over these empirical observations

Err =1

2

n∑i=1

[yi − g(w, φ(xi))]2.

To optimize g to be a more accurate estimate of f , we take the gradient of Err

over w and use this to update w

w = w − α∇wErr

where α is the step size. What this update does is move w a small gradient step

towards the direction that minimizes Err. w is updated until it converges to a local

optima.

To calculate ∇wErr

∇wErr =

n∑i=1

[yi − g(w, φ(xi))]∇wg(w, φ(xi))

=

n∑i=1

[yi − g(w, φ(xi))]φ(xi)

This method of doing gradient descent over all samples x, y is called batch gradient

descent.

4.2.1 Stochastic Gradient Descent

However, there will be situations where it is not possible to compute the gradient over

all samples. This may be because the samples are coming one at a time, or there may be

too many of them (even infinite) that it is intractable to calculate their entire sum for

the gradient. To be able to do function approximation in these situations, the method

we use is stochastic gradient descent (SGD). As we shall see later, this is the method

we will use to incorporate function approximation to our TD learning algorithms.

Let err(x) = [y−g(w, φ(x))] be the error of a single sample input and output (x, y),

Err is therefore the sum of the squares of serr

Err =1

2

∑x

err(x)2

Instead of taking the gradient of Err and using that to update w, we only take

the gradient of err at a single sample (x, y), ∇werr(x), and use that as an estimate of

∇wErr.


∇werr(x) = [y − g(w, φ(x))]∇wg(w, φ(x))

= [y − g(w, φ(x))]φ(x).

We then use ∇werr(x) to update w

w = w − α∇werr(x). (4.1)

4.2.2 Convergence of SGD for Markov Processes

Stochastic gradient descent for RL is a special case because RL data is generated by

a Markov process. Hence, the convergence guarantee we show is specific to Markov

chains only. Let {Xi}i=1,... be a time-homogenous Markov process, A(·) be a mapping

which maps every X ∈ χ to a d× d matrix, b(·) map each X to a vector b(X). Under

the following assumptions:

1. The learning rates αt are deterministic, non-negative, and satisfy the learning-

rate constraints in Section 3.3.1.

2. The Markov process {Xi} has a steady state distribution π such that limt→∞ P (xt |Xo) = π(Xt). E0[·] is the expectation with respect to this invariant distribution.

3. The matrix A = E0[A(Xt)] is negative definite.

4. ∃ constant K such that ‖A(x)‖ ≤ K and ‖b(X)‖ ≤ K, ∀X ∈ χ.

5. For any initial state X0, the expectation of A(Xt) and b(Xt) converges exponen-

tially fast to the steady-state expectation A and b.

The stochastic algorithm

wt+1 = wt + αt(A(Xt)wt + b(Xt)) (4.2)

converges with probability 1 to the unique solution w∗ of the system Aw∗ + b = 0.

This means that given the above assumptions, SGD will eventually converge to a local

optima for Markov processes.

Not that Equation 4.1 takes almost the same form as Equation 4.2. One can choose

A and b such that that it matches the SGD update in Equation 4.1, and hence make

it fall under the same convergence guarantee.

4.3 TD Learning

We wish to to optimize Vt, our estimate of V π at time t. Recall that V π is an estimate

of the expected rewards the agent expects to receive on being at state s and following

§4.3 TD Learning 19

policy π thereafter, which we designate Rt. In RL, rewards come one sample at a time,

and we need to be able to update our estimates of the value functions given this one

sample. Hence, we do not have sample values of Rt nor of V π, and gradient descent

on objectives with those values are not possible. Instead, the error we minimize is the

Bellman error. The Bellman error at a single time step t is defined as

e(st) =1

2[rt+1 + γVt(st+1)− Vt(st)]2.

Recall that V (s) is approximated as a linear function Vw(s) = 〈w, φ(s)〉. However

when we take the gradient of e(st), we treat the rt+1 + γVt(st+1) term as just a sample

constant value Rt and not as a function of w. Therefore the gradient of e(st) is just

∇we(st) = [rt+1 + γVt(st+1)− Vt(st)]∇wVt(st)= [rt+1 + γVt(st+1)− Vt(st)]φ(st).

Using this gradient to update w results in the following update operation

wt+1 = wt + α∇wtE(wt)

= wt + α[rt+1 + γVt(st+1)− Vt(st)]φ(s)

where α is again a step size value.

4.3.1 Control with Function Approximation

For state-action value functions, the Bellman error given state s, action a, at time t is

e(st, at) =1

2[rt+1 + γQt(st+1, at+1)−Qt(st, at)]2.

Again, we treat the rt+1 + γQt(st+1, at+1) term as just a sample constant value of

Rt and not as a function of w. Therefore the gradient of e(st) is just

∇we(st, at) = [rt+1 + γQt(st+1, at+1)−Qt(st, at)]φ(st)

and we update w using this gradient to get

wt+1 = wt + α[rt+1 + γQt(st+1, at+1)−Qt(st, at)]φ(s).

4.3.2 Convergence of TD With Function Approximation

We now provide convergence guarantees for policy evaluation under TD(0). Assuming

the same step size constraints as in section 3.3.1 hold. Additional constraints for con-


vergence are:

1. The state space is an aperiodic Markov chain when we follow π and all states are

visited an infinite number of times during an infinitely long episode.

2. The policy π is a proper policy.

3. The feature mapping φ(s) is a linearly independent function on the state space.

(The matrix Φ has full rank)

If all the constraints hold, then TD(0) converges to V π. This is based on the

convergence of SGD for Markov processes discussed earlier in Section 4.2.2.

4.4 Residual Gradients

Recall that when we took the gradient of e(st) when applying gradient descent on TD-

learning, we treated the rt+1 + γVt(st+1) term as just a sample constant value of Rtand not as a function of w. Since Vt(st+1) = 〈w, φ(st+1)〉 is in fact a function of w, this

actually makes TD-learning not a proper gradient descent method. Hence, TD(0) can

diverge when the above constraints aren’t met.

There is another form of RL algorithm with function approximation that is exactly

gradient descent, and it is called residual gradients (RG). With residual gradients, we

now consider Vt(st) to also be a function of w as it is. Since RG is a proper gradient

descent method, convergence is much more robust than with TD(0).

The gradient of e(st) with respect to w now becomes

∇we(st) = [rt+1 + γVt(st+1)− Vt(st)][∇wγVt(st+1)−∇wVt(st)]= [rt+1 + γVt(st+1)− Vt(st)][γφ(st+1)− φ(st)].

The TD update for the weights is now

wt+1 = wt − α[rt+1 + γVt(st+1)− Vt(st)][γφ(st+1)− φ(st)].

4.4.1 Convergence of Residual Gradients

Because residual gradients is directly a stochastic gradient method, convergence for

policy evaluation with a fixed policy π is guaranteed based on Section 4.2.2. No other

constraints are necessary.

§4.4 Residual Gradients 21

4.4.2 Control with Residual Gradients

For the control problem, the gradient for e(st, at) is now

∇we(st, at) = [rt+1 + γQt(st+1, st+1)−Qt(st, at)][∇wγQt(st+1, at+1)−∇wVt(st, at)]= [rt+1 + γQt(st+1, at+1)−Qt(st, at)][γφ(st+1, at+1)− φ(st, at)]

and the update for w becomes

wt+1 = wt + α[rt+1 + γQt(st+1, at+1)−Qt(st, at)][γφ(st+1)− φ(st+1)]

Chapter 5

Experimental Results

We now implement some of the techniques discusses in this paper to show the results

of RL with function approximation. The two techniques we use are Sarsa and Residual

Gradients. We implement it on two domains that require function approximation, Cart

Pole and Mountain Car. Additionally, we use two different feature mappings for state

features, Radial Basis Functions (RBF) Coding and Tile Coding [Sutton and Barto

1998].

5.1 Domains

5.1.1 Mountain Car

In the mountain car domain, the agent tries to drive a car up a hill towards the goal

position. It has a two dimensional state space, the position of the car in the hill and

the velocity of the car. Dimension s1 is the position of the car and is a continuous

value bounded between [−1.2, 0.5]. Dimension s2 is the velocity and is also a contin-

uous value and is bounded between [−0.07, 0.07]. The agent can choose among three

actions, a ∈ {−1, 0, 1} which corresponds to accelerating left, neutral, right. The goal

of the agent is to get the car into the rightmost position, that is, at the state with

s1 = 0.5. At each time step the agents received a reward of -1 until it reaches the goal

state wherein it receives a reward of 1.

5.1.2 Cart Pole

In the cart pole domain, the agent tries to balance a pole hinged on top of a cart by

moving the cart along a frictionless track. It has a four dimensional state space, the

position of the cart, cart velocity, the angle of the pole, and the pole angular velocity.

Dimension s1 is the position of the cart, and is bounded by [−2.4, 2.4]. Dimension s2

is the cart velocity and has infinite state space [−∞,∞]. Dimension s3 is the pole

angle and is bounded by [−12, 12], any angular position outside of this bounds results

in failure. Dimension s4 is the angular velocity of the pole, and also has infinite state

space [−∞,∞]. The agent receives zero rewards at each time step until the angle of

23

24 Experimental Results

Figure 5.1: The Mountain Car Domain

the pole exceeds the bounds, at which point the agent receives a reward of -1 and an

episode ends.

Figure 5.2: The Cart Pole Domain

5.2 Results

5.2.1 Optimal Parameter Values

Optimal parameter values for each domain was found by repeatedly testing different

values for each parameter and recording the best results. For tile coding we used 10

fillings, for RBF coding we used 10 radial basic functions with variance of 0.05 each.

5.2.2 Results

As seen in Figures 5.3 and 5.4, Sarsa is able to converge faster than using residual

gradients. This matches up with the discussion in Section 4.4 that while TD learning

is not directly a stochastic gradient descent method and hence diverges more often,

§5.2 Results 25

α γ λ

Sarsa - Tile Coding 0.01 1 0.7Sarsa - RBF Coding 0.01 1 0.7RG - Tile Coding 0.1 1 0.7RG - RBF Coding 0.2 1 0.9

Table 5.1: Optimal Parameter Values for the Cart Pole Domain

α γ λ

Sarsa - Tile Coding 0.1 1 0.9Sarsa - RBF Coding 0.1 1 0.9RG - Tile Coding 0.1 1 0.8RG - RBF Coding 0.3 1 0.9

Table 5.2: Optimal Parameter Values for the Mountain Car Domain

when it does converge the rate of convergence is faster than residual gradients which

is a true SGD method. In all cases except one, using tile coding converges faster than

RBF coding.

Tuning parameters has a big effect on the learning performance of the agent. For

some sub-optimal parameter values, the agent never learns the optimal policy for the

domain.

Figure 5.3: Average Results for the Cart Pole domain. Higher is better.

26 Experimental Results

Figure 5.4: Average Results for the Mountain Car domain. Lower is better.

Chapter 6

Final Remarks

As we have discussed in this report, combining reinforcement learning with function

approximation techniques allows an agent to learn to operate in environments with

infinitely large number of states. It does this by letting an agent generalize what it has

learned in some states to other similar states.

We first discussed traditional reinforcement learning methods that stores the value

of the states in a tabular format, and proceeded to discuss the function approxima-

tion extensions to these methods. We used linear function approximation where we

optimized the parameters with respect to a mean square error using stochastic gra-

dient decent. The function approximation used in Sarsa isn’t a direct SGD method

and hence there are times when Sarsa will diverge. Residual gradients is a direct SGD

method and hence its convergence is more robust. However when Sarsa does converge,

it’s rate of convergence is usually faster than residual gradients. The faster convergence

of Sarsa is mainly an experimental result, not theoretical. There may be situations in

which residual gradients converges faster. Finally, we implemented the RL with func-

tion approximation techniques discussed on two domains and showed the results.

Although the RL theory for environments with state values stored in tabular formats

are quite mature, the theory for RL with function approximation as discussed here are

still very much being actively developed, with new techniques and methods still being

discovered. More work in this particular aspect of reinforcement learning is needed and

will provide better results in the future.

27

28 Final Remarks

Bibliography

Baird, L. 1995. Residual algorithms: Reinforcement learning with function ap-

proximation. In Proceedings of the Twelfth International Conference on Machine

Learning (1995).

Bertsekas, D. P. and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming.

Athena Scientific. (p. iii)

Rummery, G. A. and Niranjan, M. 1994. On-line q-learning using connectionist

systems. Technical report.

Sutton, R. and Barto, A. 1998. Reinforcement Learning. The MIT Press.

(pp. iii, 2, 7, 23)

Watkins, C. 1989. Learning from delayed rewards.

Watkins, C. and Dayan, P. 1992. Q-learning. Machine Learning .

29

reinforcement learning with function...

Documents