off-policy deep rl · 2018-07-06 · [bellman, 1957] dynamic programming [sutton, 1988] learning to...
TRANSCRIPT
![Page 1: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/1.jpg)
off-policy deep RLRemi Munos
Paris
![Page 2: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/2.jpg)
Remi Munos DeepMind
How to do fundamental research in deep RL?Ingredients of a satisfying research:
1- The need● Observe limitation of current approaches● Identify the core problem
2- The idea● Design an algorithm
3- The benefit● Theoretical analysis in simplified setting● Improved numerical performance
![Page 3: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/3.jpg)
Remi Munos DeepMind
Off-policy deep RL
● The need○ Limitations of DQN and A3C ○ off-policy, multi-steps RL
● The idea: ○ Truncated importance sampling while preserving contraction property○ The algorithm: Retrace
● The benefit:○ Convergence to optimal policy in finite state spaces○ Practical algorithms (ACER, Reactor, MPO, Impala)
![Page 4: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/4.jpg)
Remi Munos DeepMind
![Page 5: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/5.jpg)
Remi Munos DeepMind
![Page 6: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/6.jpg)
Remi Munos DeepMind
![Page 7: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/7.jpg)
Remi Munos DeepMind
![Page 8: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/8.jpg)
Remi Munos DeepMind
![Page 9: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/9.jpg)
Remi Munos DeepMind
![Page 10: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/10.jpg)
Remi Munos DeepMind
![Page 11: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/11.jpg)
Remi Munos DeepMind
![Page 12: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/12.jpg)
Remi Munos DeepMind
![Page 13: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/13.jpg)
Remi Munos DeepMind
![Page 14: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/14.jpg)
Remi Munos DeepMind
Two desired properties of a RL algorithm:
● Off-policy learning○ use memory replay○ do exploration○ lag between acting and learning
● Use multi-steps learning ○ propagate rewards rapidly○ avoid accumulation of approximation/estimation errors○ Allow learning from sequences (RNN)
Ex: Q-learning (and DQN) is off-policy but does not use multi-steps returnsPolicy gradient (and A3C) use returns but are on-policy.
Both properties are important in deepRL. Can we have both simultaneously?
![Page 15: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/15.jpg)
Remi Munos DeepMind
Off-policy reinforcement learning
Behavior policy , target policy
Observe trajectory
where , and
Goal: ● Policy evaluation:
● Optimal control:
![Page 16: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/16.jpg)
Remi Munos DeepMind
Off-policy credit assignment problemBehavior policy Target policy
Can we use the TD to estimate for all ?
![Page 17: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/17.jpg)
Remi Munos DeepMind
Importance sampling
Reweight the trace by the product of IS ratios
![Page 18: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/18.jpg)
Remi Munos DeepMind
Importance sampling
Unbiased estimate of
![Page 19: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/19.jpg)
Remi Munos DeepMind
Importance sampling
Unbiased estimate of Large (possibly infinite) variance Not stable! [Precup, Sutton, Singh, 2000], [Mahmood, Yu, White, Sutton, 2015] ,...
![Page 20: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/20.jpg)
Remi Munos DeepMind
algorithm [Harutyunyan et al., 2016]
Cut traces by a constant
![Page 21: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/21.jpg)
Remi Munos DeepMind
algorithm[Harutyunyan et al., 2016]
works if
![Page 22: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/22.jpg)
Remi Munos DeepMind
algorithm[Harutyunyan et al., 2016]
works if
may not work otherwise No guarantee!
![Page 23: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/23.jpg)
Remi Munos DeepMind
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
Reweight the traces by the product of target probabilities
![Page 24: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/24.jpg)
Remi Munos DeepMind
works for arbitrary policies and
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
![Page 25: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/25.jpg)
Remi Munos DeepMind
works for arbitrary policies and
cut traces unnecessarily when on-policy Not efficient!
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
![Page 26: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/26.jpg)
Remi Munos DeepMind
General off-policy return-based algorithm:
Algorithm: Trace coefficient: Problem:
IS high variance
No guarantee
not efficient
![Page 27: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/27.jpg)
Remi Munos DeepMind
Off-policy policy evaluation:
Theorem 1: Assume finite state space. Generate trajectories according to behavior policy . Update all states along trajectories according to:
Assume all states visited infinitely often. Under usual SA assumptions, If then a.s.
Sufficient conditions for a safe algorithm (works for any and )
![Page 28: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/28.jpg)
Remi Munos DeepMind
Off-policy return-based operator
Lemma: Assume the traces satisfy .
Then the off-policy return-based operator:
is a contraction mapping (whatever and ) and is its fixed point.
![Page 29: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/29.jpg)
Remi Munos DeepMind
Proof [part 1]
Thus
which is a linear combination weighted by non-negative coefficients which sum to...
![Page 30: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/30.jpg)
Remi Munos DeepMind
Proof [part 2]
Sum of the coeff.
Thus
![Page 31: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/31.jpg)
Remi Munos DeepMind
Tradeoff for trace coefficients
● Contraction coefficient of the expected operator
when (one-step Bellman update) when (full Monte-Carlo rollouts)
● Variance of the estimate (can be infinite for ) Large uses multi-steps returns, but large varianceSmall low variance, but do not use multi-steps returns
![Page 32: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/32.jpg)
Remi Munos DeepMind
Retrace(λ)[Munos et al., 2016]
Our recommendation:
Properties:● Low variance since
● Safe (off policy): cut the traces when needed
● Efficient (on policy): but only when needed. Note that
![Page 33: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/33.jpg)
Remi Munos DeepMind
Summary
Algorithm: Trace coefficient: Problem:
IS high variance
not safe (off-policy)
not efficient (on-policy)
none!
![Page 34: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/34.jpg)
Remi Munos DeepMind
Retrace(λ) for optimal control
Let and sequences of behavior and target policies and
Theorem 2Under previous assumptions (+ a technical assumption)Assume are “increasingly greedy” wrt Then, a.s.,
![Page 35: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/35.jpg)
Remi Munos DeepMind
Remarks
● If are greedy policies, then → Convergence of Watkin’s Q(λ) to
(open problem since 1989)
● “Increasingly greedy” allows for smoother traces thus faster convergence
● The behavior policies do not need to become greedy wrt → no GLIE assumption (Greedy in the limit with infinite exploration)
(first return-based algo converging to without GLIE)
![Page 36: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/36.jpg)
Remi Munos DeepMind
Theoretical guarantees of Retrace
Under assumption of finite-state space:● Convergence to optimal policy● Cut traces when -and only when- needed● Adjust the length of the backup to the “off-policy-ness” of the data
Should be useful in deep RL since it allows memory-replay, exploration, distributed acting, and learn from sequences.
Now, does in work in practice?
![Page 37: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/37.jpg)
Remi Munos DeepMind
Retrace for deepRL Several actor-critic architectures at DeepMind:
● ACER (Actor-Critic for Experience Replay) [Wang et al., 2017]. Policy gradient. Works for continuous actions.
● Reactor (Retrace-actor) [Gruesly et al., 2018]. Use beta-LOO to update policy. Use LSTM.
● MPO (Maximum a posteriori Policy Optimization) [Abdolmaleki et al., 2018] Soft (KL-regularized) policy improvement.
● IMPALA (IMPortance Weighted Actor-Learner Architecture) [Espeholt et al., 2018]. Heavily distributed agent. Uses V-trace.
![Page 38: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/38.jpg)
Research Engineering
Reactor [Gruesly et al., 2018]
Observation
Q(x, a)
π(x, a)µ(x, a)
RecurrentNetwork
optimize
evaluate
Experience replay
predict
sample
correct^
![Page 39: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/39.jpg)
Remi Munos DeepMind
Reactor performances on Atari
![Page 40: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/40.jpg)
Remi Munos DeepMind
Reactor performances on Atari
![Page 41: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/41.jpg)
Remi Munos DeepMind
Control suite with MPO
MPO (Maximum a posteriori Policy Optimization)
[Abdolmaleki et al., 2018]
on the DeepMind control suite
(set of continuous control tasks intended to serve as performance benchmarks for RL agents)
See: https://www.youtube.com/watch?v=he_BPw32PwU
![Page 42: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/42.jpg)
Remi Munos DeepMind
IMPALA [Espeholt et al., 2018]
IMPortance Weighted Actor-Learner Architecture● Heavily distributed architecture● Many actors (CPU), ● one (or more) learner (GPU)
● Actors generate trajectories and place them into a queue.
● Learner dequeues and performs parameter updates.
Stale experience → requires off-policy learning: V-trace
Actor Actor
V-Trace AdvantageActor-Critic
Learner
ParametersObservations
Actor
Actor Actor
Actor
![Page 43: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/43.jpg)
V-trace: off-policy algorithm using V-values
V-trace = Modified version of Retrace where we learn V instead of Q● The V-Trace corrected estimate for the value V(xs ) is:
● where and .
● Converges to
● The V-Trace update for the value function is:
● The V-Trace update for the policy is:
![Page 44: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/44.jpg)
Impala Architectures
Small CNN-LSTM
Deep ResNet CNN-LSTM
![Page 45: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/45.jpg)
DMLab-30 Task Set
● Set of 30 cognitive tasks in DeepMind Lab 3D environment.
● Many of the tasks are procedurally generated.
Grounded Language Memory Outdoor Foraging Navigation
![Page 46: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/46.jpg)
DMLab-30 Task Set
Individual task:An agent trained per task
Multi-tasks:A single agent trained on all tasks simultaneously
![Page 47: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/47.jpg)
Performance of Impala on DMLab-30
● IMPALA outperforms A3C (10x more data efficient, 2x overall final performance)
● Positive transfer in multi-task training
![Page 48: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/48.jpg)
IMPALA Videos - Mushroom foraging task
Mushroom foraging task. The agent must collect mushrooms within a naturalistic terrain environment to maximise score. The mushrooms do not regrow. The map is randomly generated. The spawn location is randomized for each episode. Foraging task
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 49: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/49.jpg)
IMPALA Videos - Select Located Object task
The agent is asked to collect a specified coloured object in a specified coloured room. Example: “Pick the red object in the blue room.” Language task
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 50: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/50.jpg)
IMPALA Videos - Object Locations
The agent must collect as many apples as possible before the episode ends to maximise their score. Upon collecting all of the apples, the level will reset, repeating until the episode ends. Apple locations, level layout and theme are randomized per episode. Navigation task
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 51: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/51.jpg)
IMPALA Videos - Obstructed Goals
Agents are required to find the goal as fast as possible, but now with randomly opened and closed doors. Navigation task
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 52: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/52.jpg)
IMPALA Videos - Keys Doors Puzzle
A procedural planning puzzle. The agent must reach the goal object, located in a position that is blocked by a series of coloured doors. Coloured keys can be used to open matching doors once. Collecting keys in the wrong order can make the goal unreachable. Requires planning
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 53: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/53.jpg)
IMPALA Videos - Select Non-matching Object
The agent must choose an object that is different from the one it has seen before. The agent is placed into a first room containing an object and a teleport pad. Touching the pad teleports the agent to a second room containing two objects, one of which matches the object in the previous room. Requires memory
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 54: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/54.jpg)
IMPALA Videos - Watermaze
The agent must find a hidden platform which, when found, generates a reward. This is difficult to find the first time, but in subsequent trials the agent should try to remember where it is and go straight back to this place. Tests episodic memory and navigation ability. Requires episodic memory
See: https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30
![Page 55: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/55.jpg)
Remi Munos DeepMind
We need more fundamental research in deep RL!
DeepRL is a super exciting research topic.
We need new idea, new algorithms, new theories!
Please join the fun!
![Page 56: off-policy deep RL · 2018-07-06 · [Bellman, 1957] Dynamic Programming [Sutton, 1988] Learning to predict by the methods of temporal differences [Watkins, 1989] Learning From Delayed](https://reader030.vdocuments.mx/reader030/viewer/2022040920/5e97a17332de181bf821a44e/html5/thumbnails/56.jpg)
Remi Munos DeepMind
References (used in the slides):● [Pontryagin, 1956] See: Optimal Processes of Regulation, 1960 for English version.● [Bellman, 1957] Dynamic Programming● [Sutton, 1988] Learning to predict by the methods of temporal differences● [Watkins, 1989] Learning From Delayed Rewards ● [Precup, Sutton, Singh, 2000] Eligibility traces for off-policy policy evaluation● [Mnih et al., 2015] Human Level Control Through Deep Reinforcement Learning● [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning ● [Mahmood, Yu, White, Sutton, 2015] Emphatic Temporal-Difference Learning● [Harutyunyan et al., 2016] Q(λ) with Off-Policy Corrections● [Munos et al., 2016] Safe and Efficient Off-Policy Reinforcement Learning● [Wang et al., 2017] Sample Efficient Actor-Critic with Experience Replay● [Gruesly et al., 2018] The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement
Learning ● [Abdolmaleki et al., 2018] Maximum a Posteriori Policy Optimization● [Espeholt et al., 2018] IMPortance Weighted Actor-Learner Architecture