exploration: part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part...
TRANSCRIPT
![Page 1: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/1.jpg)
Exploration: Part 2
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
![Page 2: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/2.jpg)
Class Notes
1. Homework 4 due today!
![Page 3: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/3.jpg)
Recap: what’s the problem?
this is easy (mostly) this is impossible
Why?
![Page 4: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/4.jpg)
Recap: classes of exploration methods in deep RL
• Optimistic exploration:• new state = good state• requires estimating state visitation frequencies or novelty• typically realized by means of exploration bonuses
• Thompson sampling style algorithms:• learn distribution over Q-functions or policies• sample and act according to sample
• Information gain style algorithms• reason about information gain from visiting new states
![Page 5: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/5.jpg)
Posterior sampling in deep RL
Thompson sampling:What do we sample?
How do we represent the distribution?
since Q-learning is off-policy, we don’t care which Q-function was used to collect data
![Page 6: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/6.jpg)
Bootstrap
Osband et al. “Deep Exploration via Bootstrapped DQN”
![Page 7: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/7.jpg)
Why does this work?
Osband et al. “Deep Exploration via Bootstrapped DQN”
Exploring with random actions (e.g., epsilon-greedy): oscillate back and forth, might not go to a coherent or interesting place
Exploring with random Q-functions: commit to a randomized but internally consistent strategy for an entire episode
+ no change to original reward function
- very good bonuses often do better
![Page 8: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/8.jpg)
Reasoning about information gain (approximately)
Info gain:
Generally intractable to use exactly, regardless of what is being estimated!
![Page 9: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/9.jpg)
Reasoning about information gain (approximately)Generally intractable to use exactly, regardless of what is being estimated
A few approximations:
(Schmidhuber ‘91, Bellemare ‘16)
intuition: if density changed a lot, the state was novel
(Houthooft et al. “VIME”)
![Page 10: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/10.jpg)
Reasoning about information gain (approximately)VIME implementation:
Houthooft et al. “VIME”
![Page 11: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/11.jpg)
Reasoning about information gain (approximately)VIME implementation:
Houthooft et al. “VIME”
+ appealing mathematical formalism
- models are more complex, generally harder to use effectively
Approximate IG:
![Page 12: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/12.jpg)
Exploration with model errors
Stadie et al. 2015:• encode image observations using auto-encoder• build predictive model on auto-encoder latent states• use model error as exploration bonus
Schmidhuber et al. (see, e.g. “Formal Theory of Creativity, Fun, and Intrinsic Motivation):• exploration bonus for model error• exploration bonus for model gradient• many other variations
Many others!
low novelty
high novelty
![Page 13: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/13.jpg)
Recap: classes of exploration methods in deep RL
• Optimistic exploration:• Exploration with counts and pseudo-counts• Different models for estimating densities
• Thompson sampling style algorithms:• Maintain a distribution over models via bootstrapping• Distribution over Q-functions
• Information gain style algorithms• Generally intractable• Can use variational approximation to information gain
![Page 14: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/14.jpg)
Suggested readings
Schmidhuber. (1992). A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers.
Stadie, Levine, Abbeel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models.
Osband, Blundell, Pritzel, Van Roy. (2016). Deep Exploration via Bootstrapped DQN.
Houthooft, Chen, Duan, Schulman, De Turck, Abbeel. (2016). VIME: Variational Information Maximizing Exploration.
Bellemare, Srinivasan, Ostroviski, Schaul, Saxton, Munos. (2016). Unifying Count-Based Exploration and Intrinsic Motivation.
Tang, Houthooft, Foote, Stooke, Chen, Duan, Schulman, De Turck, Abbeel. (2016). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
Fu, Co-Reyes, Levine. (2017). EX2: Exploration with Exemplar Models for Deep Reinforcement Learning.
![Page 15: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/15.jpg)
Break
![Page 16: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/16.jpg)
Imitation vs. Reinforcement Learning
imitation learning reinforcement learning
• Requires demonstrations
• Must address distributional shift
• Simple, stable supervised learning
• Only as good as the demo
• Requires reward function
• Must address exploration
• Potentially non-convergent RL
• Can become arbitrarily good
Can we get the best of both?
e.g., what if we have demonstrations and rewards?
![Page 17: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/17.jpg)
trainingdata
supervisedlearning
Imitation Learning
![Page 18: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/18.jpg)
Reinforcement Learning
![Page 19: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/19.jpg)
Addressing distributional shift with RL?
Update reward using
samples & demos
generate policy
samples from π
policy π reward r
policy π
generator
![Page 20: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/20.jpg)
Addressing distributional shift with RL?
IRL already addresses distributional shift via RL
this part is regular “forward” RL
But it doesn’t use a known reward function!
![Page 21: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/21.jpg)
Simplest combination: pretrain & finetune
• Demonstrations can overcome exploration: show us how to do the task
• Reinforcement learning can improve beyond performance of the demonstrator
• Idea: initialize with imitation learning, then finetune with reinforcement learning!
![Page 22: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/22.jpg)
Simplest combination: pretrain & finetune
Muelling et al. ‘13
![Page 23: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/23.jpg)
Simplest combination: pretrain & finetune
Pretrain & finetune
vs. DAgger
![Page 24: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/24.jpg)
What’s the problem?
Pretrain & finetune
can be very bad (due to distribution shift)
first batch of (very) bad data candestroy initialization
Can we avoid forgetting the demonstrations?
![Page 25: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/25.jpg)
Off-policy reinforcement learning
• Off-policy RL can use any data
• If we let it use demonstrations as off-policy samples, can that mitigate the exploration challenges?• Since demonstrations are provided as data in every iteration, they are never forgotten
• But the policy can still become better than the demos, since it is not forced to mimic them
off-policy policy gradient (with importance sampling)
off-policy Q-learning
![Page 26: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/26.jpg)
Policy gradient with demonstrations
includes demonstrations and experience
Why is this a good idea? Don’t we want on-policy samples?
optimal importance sampling
![Page 27: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/27.jpg)
Policy gradient with demonstrations
How do we construct the sampling distribution?
this works best with self-normalized importance sampling
self-normalized IS
standard IS
![Page 28: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/28.jpg)
Example: importance sampling with demos
Levine, Koltun ’13. “Guided policy search”
![Page 29: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/29.jpg)
Q-learning with demonstrations
• Q-learning is already off-policy, no need to bother with importance weights!
• Simple solution: drop demonstrations into the replay buffer
![Page 30: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/30.jpg)
Q-learning with demonstrations
Vecerik et al., ‘17, “Leveraging Demonstrations for Deep Reinforcement Learning…”
![Page 31: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/31.jpg)
What’s the problem?
Importance sampling: recipe for getting stuck
Q-learning: just good data is not enough
![Page 32: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/32.jpg)
More problems with Q learning
dataset of transitions(“replay buffer”)
off-policyQ-learning
See, e.g.Riedmiller, Neural Fitted Q-Iteration ‘05
Ernst et al., Tree-Based Batch Mode RL ‘05
what action will this pick?
![Page 33: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/33.jpg)
More problems with Q learning
See: Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction.
See also: Fujimoto, Meger, Precup. Off-Policy Deep Reinforcement Learning without Exploration.
naïve RL
distrib. matching (BCQ)
random data
only use values inside support region
support constraint
pessimistic w.r.t.epistemic uncertainty
BEAR
![Page 34: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/34.jpg)
So far…• Pure imitation learning
• Easy and stable supervised learning
• Distributional shift
• No chance to get better than the demonstrations
• Pure reinforcement learning• Unbiased reinforcement learning, can get arbitrarily good
• Challenging exploration and optimization problem
• Initialize & finetune• Almost the best of both worlds
• …but can forget demo initialization due to distributional shift
• Pure reinforcement learning, with demos as off-policy data• Unbiased reinforcement learning, can get arbitrarily good
• Demonstrations don’t always help
• Can we strike a compromise? A little bit of supervised, a little bit of RL?
![Page 35: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/35.jpg)
Imitation as an auxiliary loss function
(or some variant of this)
(or some variant of this)
need to be careful in choosing this weight
![Page 36: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/36.jpg)
Example: hybrid policy gradient
increase demo likelihood
standard policy gradient
Rajeswaran et al., ‘17, “Learning Complex Dexterous Manipulation…”
![Page 37: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/37.jpg)
Example: hybrid Q-learning
Hester et al., ‘17, “Learning from Demonstrations…”
Q-learning loss
n-step Q-learning loss
regularization loss
because why not…
![Page 38: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/38.jpg)
What’s the problem?
• Need to tune the weight
• The design of the objective, esp. for imitation, takes a lot of care
• Algorithm becomes problem-dependent
![Page 39: Exploration: Part 2rail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-19.pdf · this part is regular forward RL But it doesnt use a known reward function! Simplest combination:](https://reader036.vdocuments.mx/reader036/viewer/2022071216/604703894ecf276c932e21b0/html5/thumbnails/39.jpg)
• Pure imitation learning• Easy and stable supervised learning
• Distributional shift
• No chance to get better than the demonstrations
• Pure reinforcement learning• Unbiased reinforcement learning, can get arbitrarily good
• Challenging exploration and optimization problem
• Initialize & finetune• Almost the best of both worlds
• …but can forget demo initialization due to distributional shift
• Pure reinforcement learning, with demos as off-policy data• Unbiased reinforcement learning, can get arbitrarily good
• Demonstrations don’t always help
• Hybrid objective, imitation as an “auxiliary loss”• Like initialization & finetuning, almost the best of both worlds
• No forgetting
• But no longer pure RL, may be biased, may require lots of tuning