1 ece 517: reinforcement learning in artificial intelligence lecture 14: planning and learning dr....
TRANSCRIPT
11
ECE 517: Reinforcement LearningECE 517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence
Lecture 14: Planning and LearningLecture 14: Planning and Learning
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2015Fall 2015
October 27, 2015October 27, 2015
ECE 517: Reinforcement Learning in AI
Final projects - logistics Final projects - logistics
Projects can be done in groups of up to 3 studentsProjects can be done in groups of up to 3 students
Details on projects will be posted soonDetails on projects will be posted soon Students are encouraged to propose a topicStudents are encouraged to propose a topic Please email me your top three choices for a project along Please email me your top three choices for a project along
with a preferred with a preferred date for your presentationdate for your presentation
Presentation dates:Presentation dates: Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)
Format: 20 min presentation + 5 min Q&AFormat: 20 min presentation + 5 min Q&A ~5 min for background and motivation~5 min for background and motivation ~15 for description of your work, results, conclusions~15 for description of your work, results, conclusions
Written report due: Monday, Dec. 7Written report due: Monday, Dec. 7
Format similar to project reportFormat similar to project report
22
ECE 517: Reinforcement Learning in AI
Final projects – sample topicsFinal projects – sample topics
DQN – Playing Atari games using RLDQN – Playing Atari games using RL
Teris player using RL (and NN)Teris player using RL (and NN)
Curiosity based TD learning*Curiosity based TD learning*
Reinforcement Learning of Local Shape in the Reinforcement Learning of Local Shape in the Game of GoGame of Go
AIBO learning to walkAIBO learning to walk
Study of value function definitions for TD learningStudy of value function definitions for TD learning
Imitation learning in RLImitation learning in RL
33
ECE 517: Reinforcement Learning in AI 44
OutlineOutline
IntroductionIntroduction
Use of environment modelsUse of environment models
Integration of planning and learning methodsIntegration of planning and learning methods
ECE 517: Reinforcement Learning in AI 55
IntroductionIntroduction
Earlier we discussed Monte Carlo and temporal-Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternativesdifference methods as distinct alternativesThen showed how they can be seamlessly integrated Then showed how they can be seamlessly integrated by using eligibility traces such as in TD(by using eligibility traces such as in TD())Planning methodsPlanning methods: e.g. Dynamic Programming and : e.g. Dynamic Programming and heuristic searchheuristic search
Rely on knowledge of a modelRely on knowledge of a model Model – any information that helps the agent predict the Model – any information that helps the agent predict the
way the environment will behaveway the environment will behave
Learning methodsLearning methods: Monte Carlo and Temporal : Monte Carlo and Temporal Difference LearningDifference Learning
Do not require a modelDo not require a model
Our goal:Our goal: Explore the extent to which the two methods Explore the extent to which the two methods can be intermixedcan be intermixed
ECE 517: Reinforcement Learning in AI 66
The original ideaThe original idea
ECE 517: Reinforcement Learning in AI 77
The original idea (cont.)The original idea (cont.)
ECE 517: Reinforcement Learning in AI 88
ModelsModels
Model:Model: anything the agent can use to predict how anything the agent can use to predict how the environment will respond to its actionsthe environment will respond to its actions
Distribution models:Distribution models: provide description of all provide description of all possibilities (of next states and rewards) and their possibilities (of next states and rewards) and their probabilitiesprobabilities
e.g. Dynamic Programminge.g. Dynamic Programming
Example - sum of a dozen dice – produce all possible sums Example - sum of a dozen dice – produce all possible sums and their probabilities of occurringand their probabilities of occurring
Sample models:Sample models: produce just one sample experience produce just one sample experienceIn our example - produce individual sums drawn according In our example - produce individual sums drawn according to this probability distributionto this probability distribution
Both types of models can be used to (mimic) Both types of models can be used to (mimic) produce produce simulated experiencesimulated experience
Often sample models are much easier to come byOften sample models are much easier to come by
ECE 517: Reinforcement Learning in AI 99
PlanningPlanning
Planning:Planning: any computational process that uses a any computational process that uses a model to create or improve a policymodel to create or improve a policy
Planning in AI:Planning in AI: State-space planning (such as in RL) – search for policyState-space planning (such as in RL) – search for policy Plan-space planning (e.g., partial-order planner)Plan-space planning (e.g., partial-order planner)
e.g. evolutionary methodse.g. evolutionary methods
We take the following (unusual) view:We take the following (unusual) view: All state-space planning methods involve computing All state-space planning methods involve computing
value functionsvalue functions, either explicitly or implicitly, either explicitly or implicitly They all apply backups to simulated experienceThey all apply backups to simulated experience
ECE 517: Reinforcement Learning in AI 1010
Planning (cont.)Planning (cont.)
Classical DP methods are state-space planning methodsClassical DP methods are state-space planning methods
Heuristic search methods are state-space planning Heuristic search methods are state-space planning methodsmethods
Planning methods rely on Planning methods rely on ““realreal”” experience as input, but experience as input, but in many cases they can be applied to simulated in many cases they can be applied to simulated experience just as wellexperience just as well
ExampleExample: a planning method based on Q-learning:: a planning method based on Q-learning:
Random-Sample One-Step Tabular Q-Planning
ECE 517: Reinforcement Learning in AI 1111
Learning, Planning, and ActingLearning, Planning, and Acting
Two uses of real Two uses of real experience:experience:
Model learning:Model learning: to to improve the modelimprove the model
Direct RL:Direct RL: to directly to directly improve the value improve the value function and policyfunction and policy
Improving value function Improving value function and/or policy via a and/or policy via a model is sometimes model is sometimes called called indirect RLindirect RL oror model-based RL.model-based RL. Here,Here, we call itwe call it planningplanning..
Q: What are the Q: What are the advantages/disadvantagadvantages/disadvantages of each?es of each?
ECE 517: Reinforcement Learning in AI 1212
Direct vs. Indirect RLDirect vs. Indirect RL
Indirect methods:Indirect methods: make fuller use of make fuller use of
experience: get experience: get better policy with better policy with fewer fewer environment environment interactionsinteractions
Direct methodsDirect methods simplersimpler not affected by not affected by
bad modelsbad models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur
simultaneouslysimultaneously and in parallelQ: Which scheme do you think applies to Q: Which scheme do you think applies to humans?humans?
ECE 517: Reinforcement Learning in AI 1313
The Dyna-Q Architecture The Dyna-Q Architecture (Sutton 1990)(Sutton 1990)
ECE 517: Reinforcement Learning in AI 1414
The Dyna-Q AlgorithmThe Dyna-Q Algorithm
model learning
(update)
planning
direct RL
Random-sample single-step tabular Q-planning method
ECE 517: Reinforcement Learning in AI 1515
Dyna-Q on a Simple MazeDyna-Q on a Simple Maze
rewards = 0 until
goal reached, when
reward = 1
ECE 517: Reinforcement Learning in AI 1616
Dyna-Q Snapshots: Midway in 2Dyna-Q Snapshots: Midway in 2ndnd Episode Episode
Recall that in a planning context …Recall that in a planning context … Exploration – trying actions that improve the modelExploration – trying actions that improve the model Exploitation – Behaving in the optimal way given the current Exploitation – Behaving in the optimal way given the current
modelmodel
Balance between the two is always a key challenge!Balance between the two is always a key challenge!
ECE 517: Reinforcement Learning in AI 1717
Variations on the Dyna-Q agentVariations on the Dyna-Q agent
(Regular) Dyna-Q(Regular) Dyna-Q Soft exploration/exploitation with constant Soft exploration/exploitation with constant
rewardsrewards
Dyna-Q+Dyna-Q+ Encourages exploration of state-action pairs that Encourages exploration of state-action pairs that
have not been visited in a long time (in real have not been visited in a long time (in real interaction with the environment)interaction with the environment)
If If nn is the number of steps elapsed between two is the number of steps elapsed between two consecutive visits to consecutive visits to ((s,as,a)), then the reward is , then the reward is larger as a function of larger as a function of nn
Dyna-ACDyna-AC Actor-Critic learning rather that Q-learningActor-Critic learning rather that Q-learning
ECE 517: Reinforcement Learning in AI 1818
More on Dyna-Q+ ?More on Dyna-Q+ ?
Uses an Uses an ““exploration bonusexploration bonus””:: Keeps track of time since each state-action Keeps track of time since each state-action
pair was tried for realpair was tried for real An extra reward is added for transitions caused An extra reward is added for transitions caused
by state-action pairs related to how long ago by state-action pairs related to how long ago they were tried: the longer unvisited, the more they were tried: the longer unvisited, the more reward for visitingreward for visiting
The agent (indirectly) The agent (indirectly) ““plansplans”” how to visit long how to visit long unvisited statesunvisited states
ECE 517: Reinforcement Learning in AI 1919
When the Model is Wrong: Blocking Maze (cont.)When the Model is Wrong: Blocking Maze (cont.)
The maze example was oversimplifiedThe maze example was oversimplified
In reality many things could go wrongIn reality many things could go wrong Environment could be stochasticEnvironment could be stochastic Model can be imperfect (local minimum, Model can be imperfect (local minimum,
stochasticity or no convergence)stochasticity or no convergence) Partial experience could be misleadingPartial experience could be misleading
When the model is incorrect, the planning When the model is incorrect, the planning process will compute a suboptimal policyprocess will compute a suboptimal policy
This is actually a learning opportunityThis is actually a learning opportunity Discovery and correction of the modeling errorDiscovery and correction of the modeling error
ECE 517: Reinforcement Learning in AI 2020
When the Model is Wrong: Blocking Maze (cont.)When the Model is Wrong: Blocking Maze (cont.)
The changed environment is harder
ECE 517: Reinforcement Learning in AI 2121
Shortcut MazeShortcut Maze
The changed environment is easier
ECE 517: Reinforcement Learning in AI 2222
Prioritized SweepingPrioritized Sweeping
In the Dyna agents presented, simulated In the Dyna agents presented, simulated transitions are started in uniformly chosen state-transitions are started in uniformly chosen state-action pairsaction pairs
Probably not optimalProbably not optimal
Which states or state-action pairs should be Which states or state-action pairs should be generated during planninggenerated during planning??Work backwards from states whose values have Work backwards from states whose values have just changed:just changed:
Maintain a queue of state-action pairs whose values Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of would change a lot if backed up, prioritized by the size of the changethe change
When a new backup occurs, insert predecessors When a new backup occurs, insert predecessors according to their prioritiesaccording to their priorities
Always perform backups from first in queueAlways perform backups from first in queue
Moore and Atkeson 1993; Peng and Williams, 1993Moore and Atkeson 1993; Peng and Williams, 1993
ECE 517: Reinforcement Learning in AI 2323
Prioritized SweepingPrioritized Sweeping
ECE 517: Reinforcement Learning in AI 2424
Prioritized Sweeping vs. Dyna-QPrioritized Sweeping vs. Dyna-Q
Both use N = 5 backups per
environmental interaction
ECE 517: Reinforcement Learning in AI 2525
Trajectory SamplingTrajectory Sampling
Trajectory sampling:Trajectory sampling: perform backups along simulated perform backups along simulated trajectoriestrajectories
This samples from the This samples from the on-policy distributionon-policy distribution Distribution constructed from experience (visits)Distribution constructed from experience (visits)
Advantages when function approximation is used Advantages when function approximation is used
Focusing of computation: can cause vast uninteresting Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored:parts of the state space to be (usefully) ignored:
Initial states
Reachable under optimal control
Irrelevant states
ECE 517: Reinforcement Learning in AI 2626
SummarySummary
Discussed close relationship between planning Discussed close relationship between planning and learningand learningImportant distinction between Important distinction between distributiondistribution modelsmodels and and sample modelssample modelsLooked at some ways to integrate planning and Looked at some ways to integrate planning and learninglearning
synergy among planning, acting, model learningsynergy among planning, acting, model learning
Distribution of backupsDistribution of backups: focus of the computation: focus of the computation prioritized sweepingprioritized sweeping trajectory sampling: backup along trajectoriestrajectory sampling: backup along trajectories