1 ece 517: reinforcement learning in artificial intelligence lecture 14: planning and learning dr....

11

ECE 517: Reinforcement LearningECE 517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence

Lecture 14: Planning and LearningLecture 14: Planning and Learning

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2015Fall 2015

October 27, 2015October 27, 2015

ECE 517: Reinforcement Learning in AI

Final projects - logistics Final projects - logistics

Projects can be done in groups of up to 3 studentsProjects can be done in groups of up to 3 students

Details on projects will be posted soonDetails on projects will be posted soon Students are encouraged to propose a topicStudents are encouraged to propose a topic Please email me your top three choices for a project along Please email me your top three choices for a project along

with a preferred with a preferred date for your presentationdate for your presentation

Presentation dates:Presentation dates: Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)

Format: 20 min presentation + 5 min Q&AFormat: 20 min presentation + 5 min Q&A ~5 min for background and motivation~5 min for background and motivation ~15 for description of your work, results, conclusions~15 for description of your work, results, conclusions

Written report due: Monday, Dec. 7Written report due: Monday, Dec. 7

Format similar to project reportFormat similar to project report

22

ECE 517: Reinforcement Learning in AI

Final projects – sample topicsFinal projects – sample topics

DQN – Playing Atari games using RLDQN – Playing Atari games using RL

Teris player using RL (and NN)Teris player using RL (and NN)

Curiosity based TD learning*Curiosity based TD learning*

Reinforcement Learning of Local Shape in the Reinforcement Learning of Local Shape in the Game of GoGame of Go

AIBO learning to walkAIBO learning to walk

Study of value function definitions for TD learningStudy of value function definitions for TD learning

Imitation learning in RLImitation learning in RL

33

ECE 517: Reinforcement Learning in AI 44

OutlineOutline

IntroductionIntroduction

Use of environment modelsUse of environment models

Integration of planning and learning methodsIntegration of planning and learning methods


IntroductionIntroduction

Earlier we discussed Monte Carlo and temporal-Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternativesdifference methods as distinct alternativesThen showed how they can be seamlessly integrated Then showed how they can be seamlessly integrated by using eligibility traces such as in TD(by using eligibility traces such as in TD())Planning methodsPlanning methods: e.g. Dynamic Programming and : e.g. Dynamic Programming and heuristic searchheuristic search

Rely on knowledge of a modelRely on knowledge of a model Model – any information that helps the agent predict the Model – any information that helps the agent predict the

way the environment will behaveway the environment will behave

Learning methodsLearning methods: Monte Carlo and Temporal : Monte Carlo and Temporal Difference LearningDifference Learning

Do not require a modelDo not require a model

Our goal:Our goal: Explore the extent to which the two methods Explore the extent to which the two methods can be intermixedcan be intermixed


The original ideaThe original idea


The original idea (cont.)The original idea (cont.)


ModelsModels

Model:Model: anything the agent can use to predict how anything the agent can use to predict how the environment will respond to its actionsthe environment will respond to its actions

Distribution models:Distribution models: provide description of all provide description of all possibilities (of next states and rewards) and their possibilities (of next states and rewards) and their probabilitiesprobabilities

e.g. Dynamic Programminge.g. Dynamic Programming

Example - sum of a dozen dice – produce all possible sums Example - sum of a dozen dice – produce all possible sums and their probabilities of occurringand their probabilities of occurring

Sample models:Sample models: produce just one sample experience produce just one sample experienceIn our example - produce individual sums drawn according In our example - produce individual sums drawn according to this probability distributionto this probability distribution

Both types of models can be used to (mimic) Both types of models can be used to (mimic) produce produce simulated experiencesimulated experience

Often sample models are much easier to come byOften sample models are much easier to come by


PlanningPlanning

Planning:Planning: any computational process that uses a any computational process that uses a model to create or improve a policymodel to create or improve a policy

Planning in AI:Planning in AI: State-space planning (such as in RL) – search for policyState-space planning (such as in RL) – search for policy Plan-space planning (e.g., partial-order planner)Plan-space planning (e.g., partial-order planner)

e.g. evolutionary methodse.g. evolutionary methods

We take the following (unusual) view:We take the following (unusual) view: All state-space planning methods involve computing All state-space planning methods involve computing

value functionsvalue functions, either explicitly or implicitly, either explicitly or implicitly They all apply backups to simulated experienceThey all apply backups to simulated experience


Planning (cont.)Planning (cont.)

Classical DP methods are state-space planning methodsClassical DP methods are state-space planning methods

Heuristic search methods are state-space planning Heuristic search methods are state-space planning methodsmethods

Planning methods rely on Planning methods rely on ““realreal”” experience as input, but experience as input, but in many cases they can be applied to simulated in many cases they can be applied to simulated experience just as wellexperience just as well

ExampleExample: a planning method based on Q-learning:: a planning method based on Q-learning:

Random-Sample One-Step Tabular Q-Planning


Learning, Planning, and ActingLearning, Planning, and Acting

Two uses of real Two uses of real experience:experience:

Model learning:Model learning: to to improve the modelimprove the model

Direct RL:Direct RL: to directly to directly improve the value improve the value function and policyfunction and policy

Improving value function Improving value function and/or policy via a and/or policy via a model is sometimes model is sometimes called called indirect RLindirect RL oror model-based RL.model-based RL. Here,Here, we call itwe call it planningplanning..

Q: What are the Q: What are the advantages/disadvantagadvantages/disadvantages of each?es of each?


Direct vs. Indirect RLDirect vs. Indirect RL

Indirect methods:Indirect methods: make fuller use of make fuller use of

experience: get experience: get better policy with better policy with fewer fewer environment environment interactionsinteractions

Direct methodsDirect methods simplersimpler not affected by not affected by

bad modelsbad models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur

simultaneouslysimultaneously and in parallelQ: Which scheme do you think applies to Q: Which scheme do you think applies to humans?humans?


The Dyna-Q Architecture The Dyna-Q Architecture (Sutton 1990)(Sutton 1990)


The Dyna-Q AlgorithmThe Dyna-Q Algorithm

model learning

(update)

planning

direct RL

Random-sample single-step tabular Q-planning method


Dyna-Q on a Simple MazeDyna-Q on a Simple Maze

rewards = 0 until

goal reached, when

reward = 1


Dyna-Q Snapshots: Midway in 2Dyna-Q Snapshots: Midway in 2ndnd Episode Episode

Recall that in a planning context …Recall that in a planning context … Exploration – trying actions that improve the modelExploration – trying actions that improve the model Exploitation – Behaving in the optimal way given the current Exploitation – Behaving in the optimal way given the current

modelmodel

Balance between the two is always a key challenge!Balance between the two is always a key challenge!


Variations on the Dyna-Q agentVariations on the Dyna-Q agent

(Regular) Dyna-Q(Regular) Dyna-Q Soft exploration/exploitation with constant Soft exploration/exploitation with constant

rewardsrewards

Dyna-Q+Dyna-Q+ Encourages exploration of state-action pairs that Encourages exploration of state-action pairs that

have not been visited in a long time (in real have not been visited in a long time (in real interaction with the environment)interaction with the environment)

If If nn is the number of steps elapsed between two is the number of steps elapsed between two consecutive visits to consecutive visits to ((s,as,a)), then the reward is , then the reward is larger as a function of larger as a function of nn

Dyna-ACDyna-AC Actor-Critic learning rather that Q-learningActor-Critic learning rather that Q-learning


More on Dyna-Q+ ?More on Dyna-Q+ ?

Uses an Uses an ““exploration bonusexploration bonus””:: Keeps track of time since each state-action Keeps track of time since each state-action

pair was tried for realpair was tried for real An extra reward is added for transitions caused An extra reward is added for transitions caused

by state-action pairs related to how long ago by state-action pairs related to how long ago they were tried: the longer unvisited, the more they were tried: the longer unvisited, the more reward for visitingreward for visiting

The agent (indirectly) The agent (indirectly) ““plansplans”” how to visit long how to visit long unvisited statesunvisited states


When the Model is Wrong: Blocking Maze (cont.)When the Model is Wrong: Blocking Maze (cont.)

The maze example was oversimplifiedThe maze example was oversimplified

In reality many things could go wrongIn reality many things could go wrong Environment could be stochasticEnvironment could be stochastic Model can be imperfect (local minimum, Model can be imperfect (local minimum,

stochasticity or no convergence)stochasticity or no convergence) Partial experience could be misleadingPartial experience could be misleading

When the model is incorrect, the planning When the model is incorrect, the planning process will compute a suboptimal policyprocess will compute a suboptimal policy

This is actually a learning opportunityThis is actually a learning opportunity Discovery and correction of the modeling errorDiscovery and correction of the modeling error


When the Model is Wrong: Blocking Maze (cont.)When the Model is Wrong: Blocking Maze (cont.)

The changed environment is harder


Shortcut MazeShortcut Maze

The changed environment is easier


Prioritized SweepingPrioritized Sweeping

In the Dyna agents presented, simulated In the Dyna agents presented, simulated transitions are started in uniformly chosen state-transitions are started in uniformly chosen state-action pairsaction pairs

Probably not optimalProbably not optimal

Which states or state-action pairs should be Which states or state-action pairs should be generated during planninggenerated during planning??Work backwards from states whose values have Work backwards from states whose values have just changed:just changed:

Maintain a queue of state-action pairs whose values Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of would change a lot if backed up, prioritized by the size of the changethe change

When a new backup occurs, insert predecessors When a new backup occurs, insert predecessors according to their prioritiesaccording to their priorities

Always perform backups from first in queueAlways perform backups from first in queue

Moore and Atkeson 1993; Peng and Williams, 1993Moore and Atkeson 1993; Peng and Williams, 1993


Prioritized SweepingPrioritized Sweeping


Prioritized Sweeping vs. Dyna-QPrioritized Sweeping vs. Dyna-Q

Both use N = 5 backups per

environmental interaction


Trajectory SamplingTrajectory Sampling

Trajectory sampling:Trajectory sampling: perform backups along simulated perform backups along simulated trajectoriestrajectories

This samples from the This samples from the on-policy distributionon-policy distribution Distribution constructed from experience (visits)Distribution constructed from experience (visits)

Advantages when function approximation is used Advantages when function approximation is used

Focusing of computation: can cause vast uninteresting Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored:parts of the state space to be (usefully) ignored:

Initial states

Reachable under optimal control

Irrelevant states


SummarySummary

Discussed close relationship between planning Discussed close relationship between planning and learningand learningImportant distinction between Important distinction between distributiondistribution modelsmodels and and sample modelssample modelsLooked at some ways to integrate planning and Looked at some ways to integrate planning and learninglearning

synergy among planning, acting, model learningsynergy among planning, acting, model learning

Distribution of backupsDistribution of backups: focus of the computation: focus of the computation prioritized sweepingprioritized sweeping trajectory sampling: backup along trajectoriestrajectory sampling: backup along trajectories

1 ece 517: reinforcement learning in artificial intelligence lecture 14: planning and learning dr....

Documents

learning methodsece

td learningimitation

game of goaibo learning

policyplanspace planning

planning cont

tdlplanning methods

temporaldifference methods

classical dp methods