dr. itamar arel college of engineering department of electrical engineering and computer science

11

ECE-517: Reinforcement LearningECE-517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence

Lecture 12: Generalization and Function Lecture 12: Generalization and Function ApproximationApproximation

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2012Fall 2012

October 23, 2012October 23, 2012

ECE 517: Reinforcement Learning in AI 22

OutlineOutline

IntroductionIntroduction

Value Prediction with function approximationValue Prediction with function approximation

Gradient Descent frameworkGradient Descent framework On-Line Gradient-Descent TD(On-Line Gradient-Descent TD()) Linear methodsLinear methods

Control with Function ApproximationControl with Function Approximation


IntroductionIntroduction

We have so far assumed a tabular view of value or We have so far assumed a tabular view of value or state-value functionsstate-value functions

Inherently limits our problem-space to small Inherently limits our problem-space to small state/action setsstate/action sets Space requirements – storage of valuesSpace requirements – storage of values Computation complexity – sweeping/updating the valuesComputation complexity – sweeping/updating the values Communication constraints – getting the data where it Communication constraints – getting the data where it

needs to goneeds to go

Reality is very different – high-dimensional state Reality is very different – high-dimensional state representations are commonrepresentations are common

We will next look at We will next look at generalizationsgeneralizations – an attempt by – an attempt by the agent to learn about a large state set while the agent to learn about a large state set while visiting/ experiencing only a small subset of itvisiting/ experiencing only a small subset of it People do it – how can machines achieve the same goal?People do it – how can machines achieve the same goal?


General ApproachGeneral Approach

Luckily, many approximation techniques have been Luckily, many approximation techniques have been developeddeveloped e.g. multivariate function approximation schemese.g. multivariate function approximation schemes

We will utilize such techniques in a RL contextWe will utilize such techniques in a RL context


Value Prediction with FAValue Prediction with FA

As usual, let’s start with prediction of As usual, let’s start with prediction of VV

Instead of using a table for Instead of using a table for VVtt, the latter will be , the latter will be

represented in a parameterized functional formrepresented in a parameterized functional form

We’ll assume that We’ll assume that VVtt is a sufficiently smooth differentiable is a sufficiently smooth differentiable

function of , for all function of , for all ss. .

For example, a neural network can be trained to predict For example, a neural network can be trained to predict VV where are the connection weightswhere are the connection weights

We will require that is much smaller than the state setWe will require that is much smaller than the state set

When a single state is backed up, the change generalizes When a single state is backed up, the change generalizes to affect the values of many other statesto affect the values of many other states

Ttttt n)( ,,)2( ),1(

transpose

t

t

t

tt sfsV

,)(


Adapt Supervised Learning AlgorithmsAdapt Supervised Learning Algorithms

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}


Performance MeasuresPerformance Measures

Let us assume that training examples all take the formLet us assume that training examples all take the form

A common performance metric is the mean-squared A common performance metric is the mean-squared error (MSE) over a distribution error (MSE) over a distribution PP ::

Q: Why use Q: Why use PP? Is MSE the best metric?? Is MSE the best metric?

Let us assume that Let us assume that PP is always the distribution of is always the distribution of states at which backups are donestates at which backups are done

On-policy distributionOn-policy distribution: the distribution created : the distribution created while following the policy being evaluatedwhile following the policy being evaluated Stronger results are available for this distribution.Stronger results are available for this distribution.

)( , tt sVn of sdescriptio

2 )()()()(

Ss

tt sVsVsPMSE


Gradient DescentGradient Descent

.)(

)(,,

)2(

)(,

)1(

)( )(

:is space in this point any at gradient Its

space.parameter theoffunction any be Let

T

tttt

t

n

ffff

f

(1)

(2)t t(1), t(2) T

t1

t

f (t )

We iteratively move down the gradient:

(1)


Gradient Descent in RLGradient Descent in RL

Let’s now consider the case where the target output, Let’s now consider the case where the target output, vvtt, for sample , for sample tt is not the true value (unavailable) is not the true value (unavailable)In such cases we perform an In such cases we perform an approximate updateapproximate update, , such thatsuch that

where where vvtt is an unbiased estimate of the target output. is an unbiased estimate of the target output.

Example of Example of vvtt are: are: Monte Carlo methods:Monte Carlo methods: vvtt == RRtt

TD(TD():): RRtt

The general gradient-descent is guaranteed to The general gradient-descent is guaranteed to converge to a local minimumconverge to a local minimum

)()(

)(2

)()(2

22

1

tttttt

ttttttttt

sVsVv

sVvsVsV

t

tt


On-Line Gradient-Descent TD(On-Line Gradient-Descent TD())


Residual Gradient DescentResidual Gradient Descent

The following statement is not completely accurate:The following statement is not completely accurate:

since it suggests thatsince it suggests that which is not true, e.g. which is not true, e.g.

so, we should be writing (residual GD): so, we should be writing (residual GD):

Comment:Comment: the whole scheme is no longer supervised the whole scheme is no longer supervised learning based!learning based!

)()( )(2

21 ttttttttttt sVsVvsVv

tt

0 tvt

11111 )()( ttttttt sVvsVrvtt

)()()( 111 ttttttttt sVsVsVvtt


Linear MethodsLinear Methods

One of the most important special cases of GD FAOne of the most important special cases of GD FA

VVtt becomes a linear function of the parameters vectorbecomes a linear function of the parameters vector

For every state, there is a (real valued) column For every state, there is a (real valued) column vector vector of featuresof features

The features can be constructed from the states in The features can be constructed from the states in many waysmany ways

The The linearlinear approximate state-value function is given approximate state-value function is given byby

Tssss n)(,),2(),1(

n

ists

Ttt iisV

1

)()()(

?)( sVt


Nice Properties of Linear FA MethodsNice Properties of Linear FA Methods

The gradient is very simple:The gradient is very simple:

For MSE, the error surface is simple: quadratic surface For MSE, the error surface is simple: quadratic surface with a single (global) minimumwith a single (global) minimum

Linear gradient descent TD(Linear gradient descent TD() converges:) converges: Step size decreases appropriatelyStep size decreases appropriately On-line sampling (states sampled from the on-policy On-line sampling (states sampled from the on-policy

distribution)distribution) Converges to parameter vector with property:Converges to parameter vector with property:

Vt (s)

s

MSE()

1 1

MSE()

best parameter vector(Tsitsiklis & Van Roy, 1997)


Limitations of Pure Linear MethodsLimitations of Pure Linear Methods

Many applications require a mixture (e.g. product) of the Many applications require a mixture (e.g. product) of the different feature componentsdifferent feature components Linear form prohibits direct representation of the Linear form prohibits direct representation of the

interactions between featuresinteractions between features IntuitionIntuition: feature : feature ii is good only in the absence of is good only in the absence of

feature feature jj

ExampleExample: Pole Balancing task: Pole Balancing task High angular velocity can be good or bad …High angular velocity can be good or bad …

If the angle is high If the angle is high imminent danger of falling imminent danger of falling ( (bad statebad state))

If the angle is low If the angle is low the pole is righting itself the pole is righting itself ( (good stategood state))

In such cases we need to introduce features that express a In such cases we need to introduce features that express a mixture of other featuresmixture of other features


0

Coarse Coding – Feature Composition/ExtractionCoarse Coding – Feature Composition/Extraction


Shaping Generalization in Coarse CodingShaping Generalization in Coarse Coding

• If we train at one point (state), If we train at one point (state), XX, the parameters of all circles , the parameters of all circles intersecting intersecting XX will be affected will be affected• ConsequenceConsequence: the value function of all points within the : the value function of all points within the union of the circles will be affected union of the circles will be affected • Greater affects for points that have more circles “in Greater affects for points that have more circles “in common” with common” with XX


Learning and Coarse CodingLearning and Coarse Coding

All three cases have the same number of features (50), All three cases have the same number of features (50), learning rate is 0.2/m (m – the number of features present in learning rate is 0.2/m (m – the number of features present in each example)each example)


0

Tile CodingTile Coding

Binary feature for each tileBinary feature for each tile

Number of features present at any Number of features present at any one time is constantone time is constant

Binary features means weighted Binary features means weighted sum easy to computesum easy to compute

Easy to compute indices of the Easy to compute indices of the features presentfeatures present


0

Tile Coding Cont.Tile Coding Cont.

Irregular tilings

Hashing


Control with Function ApproximationControl with Function Approximation

Learning state-action valuesLearning state-action values Training examples of the form:

The general gradient-descent rule:The general gradient-descent rule:

Gradient-descent Sarsa(Gradient-descent Sarsa() (backward view):) (backward view):

and ttt v, asn of descriptio

t1

t v t Qt (st ,at )

Q(st ,at )

t1

t t

e t

where

t rt1 Qt (st1,at1) Qt (st ,at )e t

e t 1

Q t (st ,at )


GPI with Linear Gradient Descent Sarsa(GPI with Linear Gradient Descent Sarsa())


GPI Linear Gradient Descent Watkins’ Q(GPI Linear Gradient Descent Watkins’ Q())


Mountain-Car Task ExampleMountain-Car Task Example

Challenge: driving an underpowered car up a steep Challenge: driving an underpowered car up a steep mountain roadmountain road Gravity is stronger than its engineGravity is stronger than its engine

Solution approach: Solution approach: build enough inertia from other build enough inertia from other slope to carry it up the opposite slopeslope to carry it up the opposite slope

Example of a task where things can get worse in a Example of a task where things can get worse in a sense (farther from the goal) before they get bettersense (farther from the goal) before they get better Hard to solve using classic control schemesHard to solve using classic control schemes

Reward is -1 for all steps until the episode terminatesReward is -1 for all steps until the episode terminates

Actions full throttle forward (+1), full throttle reverse Actions full throttle forward (+1), full throttle reverse (-1) and zero throttle (0)(-1) and zero throttle (0)

Two 9x9 overlapping tiles were used to represent the Two 9x9 overlapping tiles were used to represent the continuous state spacecontinuous state space


Mountain-Car TaskMountain-Car Task


Mountain-Car Results (five 9 by 9 tilings were used)Mountain-Car Results (five 9 by 9 tilings were used)


SummarySummary

GeneralizationGeneralization is an important RL attribute is an important RL attributeAdapting supervised-learning function Adapting supervised-learning function approximation methodsapproximation methods Each backup is treated as a learning exampleEach backup is treated as a learning exampleGradient-descent methodsGradient-descent methodsLinear gradient-descent methodsLinear gradient-descent methods Radial basis functionsRadial basis functions Tile codingTile codingNonlinear gradient-descent methods? Nonlinear gradient-descent methods? NN Backpropagation?NN Backpropagation?Subtleties involving function approximation, Subtleties involving function approximation, bootstrapping and the on-policy/off-policy bootstrapping and the on-policy/off-policy distinctiondistinction

dr. itamar arel college of engineering department of electrical engineering and computer science

Documents

example of vt

gradient descentwe

thatwhere vt

distribution p

residual gradient descentthe

function approximationece

function approximationdr

state setwhen