dr. itamar arel college of engineering department of electrical engineering and computer science
DESCRIPTION
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation. October 23, 2012. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2012. Outline. - PowerPoint PPT PresentationTRANSCRIPT
11
ECE-517: Reinforcement LearningECE-517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence
Lecture 12: Generalization and Function Lecture 12: Generalization and Function ApproximationApproximation
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2012Fall 2012
October 23, 2012October 23, 2012
ECE 517: Reinforcement Learning in AI 22
OutlineOutline
IntroductionIntroduction
Value Prediction with function approximationValue Prediction with function approximation
Gradient Descent frameworkGradient Descent framework On-Line Gradient-Descent TD(On-Line Gradient-Descent TD()) Linear methodsLinear methods
Control with Function ApproximationControl with Function Approximation
ECE 517: Reinforcement Learning in AI 33
IntroductionIntroduction
We have so far assumed a tabular view of value or We have so far assumed a tabular view of value or state-value functionsstate-value functions
Inherently limits our problem-space to small Inherently limits our problem-space to small state/action setsstate/action sets Space requirements – storage of valuesSpace requirements – storage of values Computation complexity – sweeping/updating the valuesComputation complexity – sweeping/updating the values Communication constraints – getting the data where it Communication constraints – getting the data where it
needs to goneeds to go
Reality is very different – high-dimensional state Reality is very different – high-dimensional state representations are commonrepresentations are common
We will next look at We will next look at generalizationsgeneralizations – an attempt by – an attempt by the agent to learn about a large state set while the agent to learn about a large state set while visiting/ experiencing only a small subset of itvisiting/ experiencing only a small subset of it People do it – how can machines achieve the same goal?People do it – how can machines achieve the same goal?
ECE 517: Reinforcement Learning in AI 44
General ApproachGeneral Approach
Luckily, many approximation techniques have been Luckily, many approximation techniques have been developeddeveloped e.g. multivariate function approximation schemese.g. multivariate function approximation schemes
We will utilize such techniques in a RL contextWe will utilize such techniques in a RL context
ECE 517: Reinforcement Learning in AI 55
Value Prediction with FAValue Prediction with FA
As usual, let’s start with prediction of As usual, let’s start with prediction of VV
Instead of using a table for Instead of using a table for VVtt, the latter will be , the latter will be
represented in a parameterized functional formrepresented in a parameterized functional form
We’ll assume that We’ll assume that VVtt is a sufficiently smooth differentiable is a sufficiently smooth differentiable
function of , for all function of , for all ss. .
For example, a neural network can be trained to predict For example, a neural network can be trained to predict VV where are the connection weightswhere are the connection weights
We will require that is much smaller than the state setWe will require that is much smaller than the state set
When a single state is backed up, the change generalizes When a single state is backed up, the change generalizes to affect the values of many other statesto affect the values of many other states
Ttttt n)( ,,)2( ),1(
transpose
t
t
t
tt sfsV
,)(
ECE 517: Reinforcement Learning in AI 66
Adapt Supervised Learning AlgorithmsAdapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
ECE 517: Reinforcement Learning in AI 77
Performance MeasuresPerformance Measures
Let us assume that training examples all take the formLet us assume that training examples all take the form
A common performance metric is the mean-squared A common performance metric is the mean-squared error (MSE) over a distribution error (MSE) over a distribution PP ::
Q: Why use Q: Why use PP? Is MSE the best metric?? Is MSE the best metric?
Let us assume that Let us assume that PP is always the distribution of is always the distribution of states at which backups are donestates at which backups are done
On-policy distributionOn-policy distribution: the distribution created : the distribution created while following the policy being evaluatedwhile following the policy being evaluated Stronger results are available for this distribution.Stronger results are available for this distribution.
)( , tt sVn of sdescriptio
2 )()()()(
Ss
tt sVsVsPMSE
ECE 517: Reinforcement Learning in AI 88
Gradient DescentGradient Descent
.)(
)(,,
)2(
)(,
)1(
)( )(
:is space in this point any at gradient Its
space.parameter theoffunction any be Let
T
tttt
t
n
ffff
f
(1)
(2)t t(1), t(2) T
t1
t
f (t )
We iteratively move down the gradient:
(1)
ECE 517: Reinforcement Learning in AI 99
Gradient Descent in RLGradient Descent in RL
Let’s now consider the case where the target output, Let’s now consider the case where the target output, vvtt, for sample , for sample tt is not the true value (unavailable) is not the true value (unavailable)In such cases we perform an In such cases we perform an approximate updateapproximate update, , such thatsuch that
where where vvtt is an unbiased estimate of the target output. is an unbiased estimate of the target output.
Example of Example of vvtt are: are: Monte Carlo methods:Monte Carlo methods: vvtt == RRtt
TD(TD():): RRtt
The general gradient-descent is guaranteed to The general gradient-descent is guaranteed to converge to a local minimumconverge to a local minimum
)()(
)(2
)()(2
22
1
tttttt
ttttttttt
sVsVv
sVvsVsV
t
tt
ECE 517: Reinforcement Learning in AI 1010
On-Line Gradient-Descent TD(On-Line Gradient-Descent TD())
ECE 517: Reinforcement Learning in AI 1111
Residual Gradient DescentResidual Gradient Descent
The following statement is not completely accurate:The following statement is not completely accurate:
since it suggests thatsince it suggests that which is not true, e.g. which is not true, e.g.
so, we should be writing (residual GD): so, we should be writing (residual GD):
Comment:Comment: the whole scheme is no longer supervised the whole scheme is no longer supervised learning based!learning based!
)()( )(2
21 ttttttttttt sVsVvsVv
tt
0 tvt
11111 )()( ttttttt sVvsVrvtt
)()()( 111 ttttttttt sVsVsVvtt
ECE 517: Reinforcement Learning in AI 1212
Linear MethodsLinear Methods
One of the most important special cases of GD FAOne of the most important special cases of GD FA
VVtt becomes a linear function of the parameters vectorbecomes a linear function of the parameters vector
For every state, there is a (real valued) column For every state, there is a (real valued) column vector vector of featuresof features
The features can be constructed from the states in The features can be constructed from the states in many waysmany ways
The The linearlinear approximate state-value function is given approximate state-value function is given byby
Tssss n)(,),2(),1(
n
ists
Ttt iisV
1
)()()(
?)( sVt
ECE 517: Reinforcement Learning in AI 1313
Nice Properties of Linear FA MethodsNice Properties of Linear FA Methods
The gradient is very simple:The gradient is very simple:
For MSE, the error surface is simple: quadratic surface For MSE, the error surface is simple: quadratic surface with a single (global) minimumwith a single (global) minimum
Linear gradient descent TD(Linear gradient descent TD() converges:) converges: Step size decreases appropriatelyStep size decreases appropriately On-line sampling (states sampled from the on-policy On-line sampling (states sampled from the on-policy
distribution)distribution) Converges to parameter vector with property:Converges to parameter vector with property:
Vt (s)
s
MSE()
1 1
MSE()
best parameter vector(Tsitsiklis & Van Roy, 1997)
ECE 517: Reinforcement Learning in AI 1414
Limitations of Pure Linear MethodsLimitations of Pure Linear Methods
Many applications require a mixture (e.g. product) of the Many applications require a mixture (e.g. product) of the different feature componentsdifferent feature components Linear form prohibits direct representation of the Linear form prohibits direct representation of the
interactions between featuresinteractions between features IntuitionIntuition: feature : feature ii is good only in the absence of is good only in the absence of
feature feature jj
ExampleExample: Pole Balancing task: Pole Balancing task High angular velocity can be good or bad …High angular velocity can be good or bad …
If the angle is high If the angle is high imminent danger of falling imminent danger of falling ( (bad statebad state))
If the angle is low If the angle is low the pole is righting itself the pole is righting itself ( (good stategood state))
In such cases we need to introduce features that express a In such cases we need to introduce features that express a mixture of other featuresmixture of other features
ECE 517: Reinforcement Learning in AI 1515
0
Coarse Coding – Feature Composition/ExtractionCoarse Coding – Feature Composition/Extraction
ECE 517: Reinforcement Learning in AI 1616
Shaping Generalization in Coarse CodingShaping Generalization in Coarse Coding
• If we train at one point (state), If we train at one point (state), XX, the parameters of all circles , the parameters of all circles intersecting intersecting XX will be affected will be affected• ConsequenceConsequence: the value function of all points within the : the value function of all points within the union of the circles will be affected union of the circles will be affected • Greater affects for points that have more circles “in Greater affects for points that have more circles “in common” with common” with XX
ECE 517: Reinforcement Learning in AI 1717
Learning and Coarse CodingLearning and Coarse Coding
All three cases have the same number of features (50), All three cases have the same number of features (50), learning rate is 0.2/m (m – the number of features present in learning rate is 0.2/m (m – the number of features present in each example)each example)
ECE 517: Reinforcement Learning in AI 1818
0
Tile CodingTile Coding
Binary feature for each tileBinary feature for each tile
Number of features present at any Number of features present at any one time is constantone time is constant
Binary features means weighted Binary features means weighted sum easy to computesum easy to compute
Easy to compute indices of the Easy to compute indices of the features presentfeatures present
ECE 517: Reinforcement Learning in AI 1919
0
Tile Coding Cont.Tile Coding Cont.
Irregular tilings
Hashing
ECE 517: Reinforcement Learning in AI 2020
Control with Function ApproximationControl with Function Approximation
Learning state-action valuesLearning state-action values Training examples of the form:
The general gradient-descent rule:The general gradient-descent rule:
Gradient-descent Sarsa(Gradient-descent Sarsa() (backward view):) (backward view):
and ttt v, asn of descriptio
t1
t v t Qt (st ,at )
Q(st ,at )
t1
t t
e t
where
t rt1 Qt (st1,at1) Qt (st ,at )e t
e t 1
Q t (st ,at )
ECE 517: Reinforcement Learning in AI 2121
GPI with Linear Gradient Descent Sarsa(GPI with Linear Gradient Descent Sarsa())
ECE 517: Reinforcement Learning in AI 2222
GPI Linear Gradient Descent Watkins’ Q(GPI Linear Gradient Descent Watkins’ Q())
ECE 517: Reinforcement Learning in AI 2323
Mountain-Car Task ExampleMountain-Car Task Example
Challenge: driving an underpowered car up a steep Challenge: driving an underpowered car up a steep mountain roadmountain road Gravity is stronger than its engineGravity is stronger than its engine
Solution approach: Solution approach: build enough inertia from other build enough inertia from other slope to carry it up the opposite slopeslope to carry it up the opposite slope
Example of a task where things can get worse in a Example of a task where things can get worse in a sense (farther from the goal) before they get bettersense (farther from the goal) before they get better Hard to solve using classic control schemesHard to solve using classic control schemes
Reward is -1 for all steps until the episode terminatesReward is -1 for all steps until the episode terminates
Actions full throttle forward (+1), full throttle reverse Actions full throttle forward (+1), full throttle reverse (-1) and zero throttle (0)(-1) and zero throttle (0)
Two 9x9 overlapping tiles were used to represent the Two 9x9 overlapping tiles were used to represent the continuous state spacecontinuous state space
ECE 517: Reinforcement Learning in AI 2424
Mountain-Car TaskMountain-Car Task
ECE 517: Reinforcement Learning in AI 2525
Mountain-Car Results (five 9 by 9 tilings were used)Mountain-Car Results (five 9 by 9 tilings were used)
ECE 517: Reinforcement Learning in AI 2626
SummarySummary
GeneralizationGeneralization is an important RL attribute is an important RL attributeAdapting supervised-learning function Adapting supervised-learning function approximation methodsapproximation methods Each backup is treated as a learning exampleEach backup is treated as a learning exampleGradient-descent methodsGradient-descent methodsLinear gradient-descent methodsLinear gradient-descent methods Radial basis functionsRadial basis functions Tile codingTile codingNonlinear gradient-descent methods? Nonlinear gradient-descent methods? NN Backpropagation?NN Backpropagation?Subtleties involving function approximation, Subtleties involving function approximation, bootstrapping and the on-policy/off-policy bootstrapping and the on-policy/off-policy distinctiondistinction