1 ece-517: reinforcement learning in artificial intelligence lecture 12: generalization and function...

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010 October 13, 2010 ECE 517: Reinforcement Learning in AI 2 Outline Introduction Value Prediction with function approximation Gradient Descent framework On-Line Gradient-Descent TD( ) On-Line Gradient-Descent TD( ) Linear methods Linear methods Control with Function Approximation ECE 517: Reinforcement Learning in AI 3 Introduction We have so far assumed a tabular view of value or state- value functions Inherently limits our problem-space to small state/action sets Space requirements storage of values Space requirements storage of values Computation complexity sweeping/updating the values Computation complexity sweeping/updating the values Communication constraints getting the data where it needs to go Communication constraints getting the data where it needs to go Reality is very different high-dimensional state representations are common We will next look at generalizations an attempt by the agent to learn about a large state set while visiting/ experiencing only a small subset of it People do it how can machines achieve the same goal? People do it how can machines achieve the same goal? ECE 517: Reinforcement Learning in AI 4 General Approach Luckily, many approximation techniques have been developed e.g. multivariate function approximation schemes e.g. multivariate function approximation schemes We will utilize such techniques in a RL context ECE 517: Reinforcement Learning in AI 5 Value Prediction with FA As usual, lets start with prediction of V Instead of using a table for V t, the latter will be represented in a parameterized functional form Well assume that V t is a sufficiently smooth differentiable function of, for all s. For example, a neural network can be trained to predict V where are the connection weights We will require that is much smaller than the state set When a single state is backed up, the change generalizes to affect the values of many other states transpose ECE 517: Reinforcement Learning in AI 6 Adapt Supervised Learning Algorithms Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output actual output) Training example = {input, target output} ECE 517: Reinforcement Learning in AI 7 Performance Measures Let us assume that training examples all take the form A common performance metric is the mean-squared error (MSE) over a distribution P : Q: Why use P ? Is MSE the best metric? Let us assume that P is always the distribution of states at which backups are done On-policy distribution: the distribution created while following the policy being evaluated Stronger results are available for this distribution. Stronger results are available for this distribution. ECE 517: Reinforcement Learning in AI 8 Gradient Descent We iteratively move down the gradient: ECE 517: Reinforcement Learning in AI 9 Gradient Descent in RL Lets now consider the case where the target output, v t, for sample t is not the true value (unavailable) In such cases we perform an approximate update, such that where v t is an unbiased estimate of the target output. Example of v t are: Monte Carlo methods: v t = R t Monte Carlo methods: v t = R t TD( ): R t TD( ): R t The general gradient-descent is guaranteed to converge to a local minimum ECE 517: Reinforcement Learning in AI 10 On-Line Gradient-Descent TD( ) ECE 517: Reinforcement Learning in AI 11 Residual Gradient Descent The following statement is not completely accurate: since it suggests that which is not true, e.g. so, we should be writing (residual GD): Comment: the whole scheme is no longer supervised learning based! ECE 517: Reinforcement Learning in AI 12 Linear Methods One of the most important special cases of GD FA V t becomes a linear function of the parameters vector For every state, there is a (real valued) column vector of features The features can be constructed from the states in many ways The linear approximate state-value function is given by ECE 517: Reinforcement Learning in AI 13 Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple: quadratic surface with a single (global) minimum Linear gradient descent TD( ) converges: Step size decreases appropriately Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution) On-line sampling (states sampled from the on-policy distribution) Converges to parameter vector with property: Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997) ECE 517: Reinforcement Learning in AI 14 Limitations of Pure Linear Methods Many applications require a mixture (e.g. product) of the different feature components Linear form prohibits direct representation of the interactions between features Linear form prohibits direct representation of the interactions between features Intuition: feature i is good only in the absence of feature j Intuition: feature i is good only in the absence of feature j Example: Pole Balancing task High angular velocity can be good or bad High angular velocity can be good or bad If the angle is high imminent danger of falling (bad state) If the angle is high imminent danger of falling (bad state) If the angle is low the pole is righting itself (good state) If the angle is low the pole is righting itself (good state) In such cases we need to introduce features that express a mixture of other features ECE 517: Reinforcement Learning in AI 15 0 Coarse Coding Feature Composition/Extraction ECE 517: Reinforcement Learning in AI 16 Shaping Generalization in Coarse Coding If we train at one point (state), X, the parameters of all circles intersecting X will be affected If we train at one point (state), X, the parameters of all circles intersecting X will be affected Consequence: the value function of all points within the union of the circles will be affected Consequence: the value function of all points within the union of the circles will be affected Greater affects for points that have more circles in common with X Greater affects for points that have more circles in common with X ECE 517: Reinforcement Learning in AI 17 Learning and Coarse Coding All three cases have the same number of features (50), learning rate is 0.2/m (m the number of features present in each example) ECE 517: Reinforcement Learning in AI 18 0 Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the features present ECE 517: Reinforcement Learning in AI 19 0 Tile Coding Cont. Irregular tilings Hashing ECE 517: Reinforcement Learning in AI 20 0 Radial Basis Functions (RBFs) Natural extension of coarse coding to continuous-valued features e.g., Gaussians ECE 517: Reinforcement Learning in AI 21 Control with Function Approximation Learning state-action values Training examples of the form: The general gradient-descent rule: Gradient-descent Sarsa( ) (backward view): ECE 517: Reinforcement Learning in AI 22 GPI with Linear Gradient Descent Sarsa( ) ECE 517: Reinforcement Learning in AI 23 GPI Linear Gradient Descent Watkins Q( ) ECE 517: Reinforcement Learning in AI 24 Mountain-Car Task Example Challenge: driving an underpowered car up a steep mountain road Gravity is stronger than its engine Gravity is stronger than its engine Solution approach: build enough inertia from other slope to carry it up the opposite slope Example of a task where things can get worse in a sense (farther from the goal) before they get better Hard to solve using classic control schemes Hard to solve using classic control schemes Reward is -1 for all steps until the episode terminates Actions full throttle forward (+1), full throttle reverse (- 1) and zero throttle (0) Two 9x9 overlapping tiles were used to represent the continuous state space ECE 517: Reinforcement Learning in AI 25 Mountain-Car Task ECE 517: Reinforcement Learning in AI 26 Mountain-Car Results (five 9 by 9 tilings were used) ECE 517: Reinforcement Learning in AI 27 Mountain Car with Radial Basis Functions ECE 517: Reinforcement Learning in AI 28 Tsitsiklis and Van Roys Counterexample Counter-example to DP policy evaluation with least- squares linear function approximation ECE 517: Reinforcement Learning in AI 29 Summary Generalization is an important RL attribute Adapting supervised-learning function approximation methods Each backup is treated as a learning example Each backup is treated as a learning example Gradient-descent methods Linear gradient-descent methods Radial basis functions Radial basis functions Tile coding Tile coding Nonlinear gradient-descent methods? NN Backpropagation? NN Backpropagation? Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

1 ece-517: reinforcement learning in artificial intelligence lecture 12: generalization and function...

Documents