multi-task deep reinforcement learning with popartmulti-task deep reinforcement learning with popart...

Multi-task Deep Reinforcement Learning with PopArt

Matteo Hessel Hubert Soyer Lasse Espeholt

Wojciech Czarnecki Simon Schmitt Hado van Hasselt

Abstract

The reinforcement learning community has made greatstrides in designing algorithms capable of exceeding humanperformance on specific tasks. These algorithms are mostlytrained one task at the time, each new task requiring to traina brand new agent instance. This means the learning algo-rithm is general, but each solution is not; each agent can onlysolve the one task it was trained on. In this work, we study theproblem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learningis that a balance must be found between the needs of multipletasks competing for the limited resources of a single learn-ing system. Many learning algorithms can get distracted bycertain tasks in the set of tasks to solve. Such tasks appearmore salient to the learning process, for instance because ofthe density or magnitude of the in-task rewards. This causesthe algorithm to focus on those salient tasks at the expense ofgenerality. We propose to automatically adapt the contribu-tion of each task to the agent’s updates, so that all tasks havea similar impact on the learning dynamics. This resulted instate of the art performance on learning to play all games in aset of 57 diverse Atari games. Excitingly, our method learneda single trained policy - with a single set of weights - thatexceeds median human performance. To our knowledge, thiswas the first time a single agent surpassed human-level per-formance on this multi-task domain. The same approach alsodemonstrated state of the art performance on a set of 30 tasksin the 3D reinforcement learning platform DeepMind Lab.

IntroductionIn recent years, the field of deep reinforcement learning (RL)has enjoyed many successes. Deep RL agents have been ap-plied to board games such as Go (Silver et al. 2016) andchess (Silver et al. 2017), continuous control (Lillicrap et al.2016; Duan et al. 2016), classic video-games such as Atari(Mnih et al. 2015; Hessel et al. 2018; Gruslys et al. 2018;Schulman et al. 2015; Schulman et al. 2017; Bacon, Harb,and Precup 2017), and 3D first person environments (Mnihet al. 2016; Jaderberg et al. 2016). While the results are im-pressive, they were achieved on one task at the time, eachtask requiring to train a new agent instance from scratch.

Multi-task and transfer learning remain important openproblems in deep RL. There are at least four different strains

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

of multi-task reinforcement learning that have been exploredin the literature: off-policy learning of many predictionsabout the same stream of experience (Schmidhuber 1990;Sutton et al. 2011; Jaderberg et al. 2016), continual learningin a sequence of tasks (Ring 1994; Thrun 1996; Thrun 2012;Rusu et al. 2016), distillation of task-specific experts into asingle shared model (Parisotto, Ba, and Salakhutdinov 2015;Rusu et al. 2015; Schmitt et al. 2018; Teh et al. 2017),and parallel learning of multiple tasks at once (Sharma andRavindran 2017; Caruana 1998). We will focus on the latter.

Parallel multi-task learning has recently achieved remark-able success in enabling a single system to learn a largenumber of diverse tasks. The Importance Weighted Actor-Learner Architecture, henceforth IMPALA (Espeholt et al.2018), achieved a 59.7% median human normalised scoreacross 57 Atari games, and a 49.4% mean human normalisedscore across 30 DeepMind Lab levels. These results are stateof the art for multi-task RL, but they are far from the human-level performance demonstrated by deep RL agents on thesame domains, when trained on each task individually.

Part of why multi-task learning is much harder than sin-gle task learning is that a balance must be found betweenthe needs of multiple tasks, that compete for the limitedresources of a single learning system (for instance, for itslimited representation capacity). We observed that the naivetransposition of common RL algorithms to the multi-tasksetting may not perform well in this respect. More specif-ically, the saliency of a task for the agent increases withthe scale of the returns observed in that task, and these maydiffer arbitrarily across tasks. This affects value-based algo-rithms such as Q-learning (Watkins 1989), as well as policy-based algorithms such as REINFORCE (Williams 1992).

The problem of scaling individual rewards appropriatelyis not novel, and has often been addressed through re-ward clipping (Mnih et al. 2015). This heuristic changes theagent’s objective, e.g., if all rewards are non-negative thealgorithm optimises frequency of rewards rather than theircumulative sum. If the two objectives are sufficiently wellaligned, clipping can be effective. However, the scale of re-turns also depends on the rewards’ sparsity. This impliesthat, even with reward clipping, in a multi-task setting themagnitude of updates can still differ significantly betweentasks, causing some tasks to have a larger impact on thelearning dynamics than other equally important ones.

arX

iv:1

809.

0447

4v1

[cs

.LG

] 1

2 Se

p 20

18

Note that both the sparsity and the magnitude of rewardscollected in an environment are inherently non-stationary,because the agent is learning to actively maximise the totalamount of rewards it can collect. These non-stationary learn-ing dynamics make it impossible to normalise the learningupdates a priori, even if we would be willing to pour signif-icant domain knowledge into the design of the algorithm.

To summarise, in IMPALA the magnitude of updates re-sulting from experience gathered in each environment de-pends on: 1) the scale of rewards, 2) the sparsity of rewards,3) the competence of the agent. In this paper we use PopArtnormalisation (van Hasselt et al. 2016) to derive an actor-critic update invariant to these factors, enabling large per-formance improvements in parallel multi-task agents. Wedemonstrated this on the Atari-57 benchmark, where a sin-gle agent achieved a median normalised score of 110% andon DmLab-30, where it achieved a mean score of 72.8%.

BackgroundReinforcement learning (RL) is a framework for learningand decision-making under uncertainty (Sutton and Barto2018). A learning system - the agent - must learn to inter-act with the environment it is embedded in, so as to max-imise a scalar reward signal. The RL problem is often for-malised as a Markov decision process (Bellman 1957): atuple (S,A, p, γ), where S,A are finite sets of states andactions, p denotes the dynamics, such that p(r, s′ | s, a) isthe probability of observing reward r and state s′ when ex-ecuting action a in state s, and γ ∈ [0, 1] discounts futurerewards. The policy maps states s ∈ S to probability dis-tributions over actions π(A|S = s), thus specifying the be-haviour of the agent. The returnGt = Rt+1+γRt+2+. . . isthe γ-discounted sum of rewards collected by an agent fromstate St onward under policy π. We define action values andstate values as qπ(s, a) = Eπ[Gt | St = s,At = a] andvπ(s) = Eπ[Gt | St = s], respectively. The agent’s objec-tive is to find a policy to maximise such values.

In multi-task reinforcement learning, a single agent mustlearn to master N different environments T = {Di =(Si,Ai, pi, γ)}Ni=1, each with its own distinct dynamics(Brunskill and Li 2013). Particularly interesting is the casein which the action space and transition dynamics are atleast partially shared. For instance, the environments mightfollow the same physical rules, while the set of intercon-nected states and obtainable rewards differ. We may for-malise this as a single larger MDP, whose state space isS = {{(sj , i)}sj∈Si}Ni=1. The task index i may be latent,or may be exposed to the agent’s policy. In this paper, weuse the task index at training time, for the value estimatesused to compute the policy updates, but not at testing time:our algorithm will return a single general policy π(A|S)which is only function of the individual environment’sstate S and not conditioned directly on task index i. Thisis more challenging than the standard multi-task learningsetup, which typically allows conditioning the model on thetask index even at evaluation (Romera-Paredes et al. 2013;Collobert and Weston 2008), because our agents will needto infer what task to solve purely from the stream of rawobservations and/or early rewards in the episode.

Actor-criticIn our experiments, we use an actor-critic algorithm tolearn a policy πη(A|S) and a value estimate vθ(s), whichare both outputs of a deep neural network. We update theagent’s policy by using REINFORCE-style stochastic gradi-ent (Gt − vθ(St))∇η log π(At|St) (Williams 1992), wherevθ(St) is used as a baseline to reduce variance. In additionwe use a multi-step return Gvt that bootstraps on the valueestimates after a limited number of transitions, both to re-duce variance further and to allow us to update the policybefore Gt fully resolves at the end of an episode. The valuefunction vθ(S) is instead updated to minimise the squaredloss with respect to the (truncated and bootstrapped) return:

∆θ ∝ −∇θ(Gvt−vθ(St))2 = (Gvt−vθ(St))∇θvθ(St) (1)

∆η ∝ (Gπt − vθ(St))∇η log(πη(At|St)) , (2)

where Gvt and Gπt are stochastic estimates of vπ(St) andqπ(St, At), respectively. Note how both updates depend lin-early on the scale of returns, which, as previously argued, de-pend on scale/sparsity of rewards, and agent’s competence.

Efficient multi-task learning in simulationWe use the IMPALA agent architecture (Espeholt et al.2018), proposed for reinforcement learning in simulated en-vironments. In IMPALA the agent is distributed across mul-tiple threads, processes or machines. Several actors run onCPU generating rollouts of experience, consisting of a fixednumber of interactions (100 in our experiments) with theirown copy of the environment, and then enqueue the roll-outs in a shared queue. Actors receive the latest copy of thenetwork’s parameters from the learner before each rollout.A single GPU learner processes rollouts from all actors, inbatches, and updates a deep network. The network is a deepconvolutional ResNet (He et al. 2015), followed by a LSTMrecurrent layer (Hochreiter and Schmidhuber 1997). Policyand values are all linear functions of the LSTM’s output.

Despite the large network used for estimating policy πηand values vθ, the decoupled nature of the agent enablesto process data very efficiently: in the order of hundredsof thousands frames per second (Espeholt et al. 2018). Thesetup easily supports the multi-task setting by simply assign-ing different environments to each of the actors and then run-ning the single policy π(S|A) on each of them. The data inthe queue can also be easily labelled with the task id, if use-ful at training time. Note that an efficient implementation ofIMPALA is available open-source 1, and that, while we usethis agent for our experiments, our approach can be appliedto other data parallel multi-task agents (e.g. A3C).

Off-policy correctionsBecause we use a distributed queue-based learning setup, thedata consumed by the learning algorithm might be slightlyoff-policy, as the policy parameters change between actingand learning. We can use importance sampling correctionsρt = π(At|St)/µ(At|St) to compensate for this (Precup,

1www.github.com/deepmind/scalable agent

Sutton, and Singh 2000). In particular, we can write the n-step return as Gt = Rt+1 + γRt+2 + . . . + γnv(St+n) =

v(St) +∑t+n−1k=t γk−tδk, where δt = Rt+1 + γv(St+1) −

v(St), and then apply appropriate importance sampling cor-rections to each error term (Sutton et al. 2014) to get Gt =

v(St)+∑t+n−1k=t γk−t(

∏ki=t ρi)δk. This is unbiased, but has

high variance. To reduce variance, we can further clip mostof the importance-sampling ratios, e.g., as ct = min(1, ρt).This leads to the v-trace return (Espeholt et al. 2018)

Gv−tracet = v(St) +

t+n−1∑k=t

γk−t

(k∏i=t

ci

)δk (3)

A very similar target was proposed for the ABQ(ζ) algo-rithm (Mahmood 2017), where the product ρtλt was consid-ered and then the trace parameter λt was chosen to be adap-tive to lead to exactly the same behaviour that ct = ρtλt =min(1, ρt). This shows that this form of clipping does notimpair the validity of the off-policy corrections, in the samesense that bootstrapping in general does not change the se-mantics of a return. The returns used by the value and policyupdates defined in Equation 1 and 2 are then

Gvt = Gv−tracet and Gπt = Rt+1 + γGv-tracet+1 . (4)

This is the same algorithm as used by Espeholt et al.(2018) in the experiments on the IMPALA architecture.

Adaptive normalisationIn this section we use PopArt normalisation (van Hasselt etal. 2016), which was introduced for value-based RL, to de-rive a scale invariant algorithm for actor-critic agents. Forsimplicity, we first consider the single-task setting, then weextend it to the multi-task setting (the focus of this work).

Scale invariant updatesIn order to normalise both baseline and policy gradient up-dates, we first parameterise the value estimate vµ,σ,θ(S) asthe linear transformation of a suitably normalised value pre-diction nθ(S). We further assume that the normalised valueprediction is itself the output of a linear function, for in-stance the last fully connected layer of a deep neural net:

vµ,σ,θ(s) = σ · nθ(s) + µ = σ · (w>fθ\{w,b}(s) + b︸︷︷︸= nθ(s)

) + µ .

(5)As proposed by van Hasselt et al., µ and σ can be updated soas to track mean and standard deviation of the values. Firstand second moments of can be estimated online as

µt = (1−β)µt−1+βGvt , νt = (1−β)νt−1+β(Gvt )2,

(6)and then used to derive the estimated standard deviation asσt =

√νt − µ2

t . Note that the fixed decay rate β determinesthe horizon used to compute the statistics. We can then usethe normalised value estimate nθ(S) and the statistics µ andσ to normalise the actor-critic loss, both in its value and pol-icy component; this results in the scale-invariant updates:

∆θ ∝(Gvt − µσ

− nθ(St))∇θnθ(St) , (7)

∆η ∝(Gπt − µσ

− nθ(St))∇η log πη(At|St) . (8)

If we optimise the new objective naively, we are at risk ofmaking the problem harder: the normalised targets for val-ues are non-stationary, since they depend on statistics µ andσ. The PopArt normalisation algorithm prevents this, by up-dating the last layer of the normalised value network to pre-serve unnormalised value estimates vµ,σ,θ, under any changein the statistics µ→ µ′ and σ → σ′:

w′ =σ

σ′w , b′ =

σb+ µ− µ′

σ′. (9)

This extends PopArt’s scale-invariant updates to the actor-critic setting, and can help to make tuning hyperparameterseasier, but it is not sufficient to tackle the challenging multi-task RL setting that we are interested in this paper. For this,a single pair of normalisation statistics is not sufficient.

Scale invariant updates for multi-task learningLet Di be an environment in some finite set T = {Di}Ni=1,and let π(S|A) be a task-agnostic policy, that takes a stateS from any of the environments Di, and maps it to a proba-bility distribution onto the shared action space A. Considernow a multi-task value function v(S) with N outputs, onefor each task. We can use for v the same parametrisation asin Equation 5, but with vectors of statistics µ,σ ∈ RN , anda vector-valued function nθ(s) = (n1θ(s), . . . , n

Nθ (s))>

vµ,σ,θ(S) = σ�nθ(S)+µ = σ�(Wfθ\{W,b}(S)+b)+µ(10)

where W and b denote the parameters of the last fully con-nected layer in nθ(s). Given a rollout {Si,k, Ak, Ri,k}t+nk=t ,generated under task-agnostic policy πη(A|S) in environ-ment Di, we can adapt the updates in Equation 7 and 8 toprovide scale invariant updates also in the multi-task setting:

∆θ ∝

(Gv,it − µi

σi− niθ(St)

)∇θniθ(St) , (11)

∆η ∝

(Gπ,it − µi

σi− niθ(St)

)∇η log πη(At|St) . (12)

Where the targets G·,it use the value estimates for environ-mentDi for bootstrapping. For each rollout, only the ith headin the value net is updated, while the same policy networkis updated irrespectively of the task, using the appropriaterescaling for updates to parameters η. As in the single-taskcase, when updating the statistics µ and σ we also need toupdate W and b to preserve unnormalised outputs,

w′i =σiσ′i

wi , b′i =σibi + µi − µ′i

σ′i, (13)

where wi is the ith row of matrix W, and µi, σi, bi are the ithelements of the corresponding parameter vectors. Note thatin all updates only the values, but not the policy, are con-ditioned on the task index, which ensures that the resultingagent can then be run in a fully task agnostic way, since val-ues are only used to reduce the variance of the policy updatesat training time but not needed for action selection.

Table 1: Summary of results: aggregate scores for IMPALA and PopArt-IMPALA. We report median human normalised scorefor Atari-57, and mean capped human normalised score for DmLab-30. In Atari, Random and Human refer to whether thetrained agent is evaluated with random or human starts. In DmLab-30 the test score includes evaluation on the held-out levels.

Atari-57 Atari-57 (unclipped) DmLab-30

Agent Random Human Random Human Train Test

IMPALA 59.7% 28.5% 0.3% 1.0% 60.6% 58.4%

PopArt-IMPALA 110.7% 101.5% 107.0% 93.7% 73.5% 72.8%

Experiments

We evaluated our approach in two challenging multi-taskbenchmarks, Atari-57 and DmLab-30, based on Atari andDeepMind Lab respectively, and introduced by Espeholt etal. We also consider a new benchmark, consisting of thesame games as Atari-57, but with the original unclippedrewards. We demonstrate state of the art performance onall benchmarks. To aggregate scores across many tasks, wenormalise the scores on each task based on the scores ofa human player and of a random agent on that same task(van Hasselt, Guez, and Silver 2016). All experiments usepopulation-based training (PBT) to tune hyperparameters(Jaderberg et al. 2017). As in Espeholt et al., we report learn-ing curves as function of the number of frames processed byone instance of the tuning population, summed across tasks.

DomainsAtari-57 is a collection of 57 classic Atari 2600 games. TheALE (Bellemare et al. 2013), exposes them as RL envi-ronments. Most prior work has focused on training agentsfor individual games (Mnih et al. 2015; Hessel et al. 2018;Gruslys et al. 2018; Schulman et al. 2015; Schulman et al.2017; Bacon, Harb, and Precup 2017). Multi-task learningon this platform has not been as successful due to large num-ber of environments, inconsistent dynamics and very differ-ent reward structure. Prior work on multi-task RL in the ALEhas therefore focused on smaller subsets of games (Rusu etal. 2015; Sharma and Ravindran 2017). Atari has a particu-larly diverse reward structure. Consequently, it is a perfectdomain to fully assess how well can our agents deal withextreme differences in the scale of returns. Thus, we trainall agents both with and without reward clipping, to com-pare performance degradation as returns get more diverse inthe unclipped version of the environment. In both cases, atthe end of training, we test agents both with random-starts(Mnih et al. 2015) and human-starts (Nair et al. 2015); ag-gregate results are reported in Table 1 accordingly.

DmLab-30 is a collection of 30 visually rich, par-tially observable RL environments (Beattie et al. 2016).This benchmark has strong internal consistency (all lev-els are played with a first person camera in a 3D envi-ronment with consistent dynamics). Howevere, the tasksthemselves are quite diverse, and were designed to test dis-tinct skills in RL agents: among these navigation, mem-ory, planning, laser-tagging, and language grounding. The

levels can also differ visually in non-trivial ways, asthey include both natural environments and maze-like lev-els. Two levels (rooms collect good objects androoms exploit deferred effects) have held outtest versions, therefore Table 1 reports both train and test ag-gregate scores. We observed that the original IMPALA agentsuffers from an artificial bottleneck in performance, due tothe fact that some of the tasks cannot be solved with the ac-tion set available to the agent. As first step, we thus fix thisissue by equipping it with a larger action set, resulting in astronger IMPALA baseline than reported in the original pa-per. We also run multiple independent PBT experiments, toassess the variability of results across multiple replications.

Atari-57 resultsFigures 1 and 2 show the median human normalised perfor-mance across the entire set of 57 Atari games in the ALE,when training agent with and without reward clipping, re-spectively. The curves are plotted as function of the totalnumber of frames seen by each agent.

PopArt-IMPALA (orange line) achieves a median perfor-mance of 110% with reward clipping and a median perfor-mance of 101% in the unclipped version of Atari-57. Re-call that here we are measuring the median performance ofa single trained agent across all games, rather than the me-dian over the performance of a set of individually trainedagents as it has been more common in the Atari domain. Toour knowledge, both agents are the first to surpass medianhuman performance across the entire set of 57 Atari games.

The IMPALA agent (blue line) performs much worse. Thebaseline barely reaches 60% with reward clipping, and themedian performance is close to 0% in the unclipped setup.The large decrease in the performance of the baseline IM-PALA agent once clipping is removed is in stark contrastwith what we observed for PopArt-IMPALA, that achievedalmost the same performance in the two training regimes.

Since the level-specific value predictions used by multi-task PopArt effectively increase the capacity of the net-work, we also ran an additional experiment to disentan-gle the contribution of the increased network capacity fromthe contribution of the adaptive normalisation. For this pur-pose, we train a second baseline, that uses level specificvalue predictions, but does not use PopArt to adaptivelynormalise the learning updates. The experiments show thatsuch MultiHead-IMPALA agent (pink line) actually per-forms slightly worse than the original IMPALA both with

0 2 4 6 8 10 12Environment Frames 1e9

0

20

40

60

80

100

120M

edia

n H

um

an N

orm

alis

ed S

core

Atari-57 (clipped)

PopArt-IMPALA

MultiHead-IMPALA

IMPALA

Figure 1: Atari-57 (reward clipping). Median human nor-malised score across all Atari levels, as function of the totalnumber of frames seen by the agents across all levels. Wecompare PopArt-IMPALA to IMPALA and to an additionalbaseline, MultiHead-IMPALA, that uses task-specific valuepredictions but no adaptive normalisation. All three agentare trained with the clipped reward scheme.

and without clipping, confirming that the performance boostof PopArt-IMPALA is indeed due to the adaptive rescaling.

We highlight that in our experiments a single instance ofmulti-task PopArt-IMPALA has processed the same amountof frames as a collection of 57 expert DQN agents (57 ×200 M = 1.14× 1010), while achieving better performance.Despite the large CPU requirements, on a cloud service,training multi-task PopArt-IMPALA can also be competi-tive in terms of costs, since it exceeds the performance of avanilla-DQN in just 2.5 days, with a smaller GPU footprint.

Normalisation statisticsIt is insightful to observe the different normalisation statis-tics across games, and how they adapt during training. Fig-ure 3 (top row) plots the shift µ for a selection of Atarigames, in the unclipped training regime. The scale σ isvisualised in the same figure by shading the area in therange [µ − σ, µ + σ]. The statistics differ by orders ofmagnitude across games: in crazy climber the shift ex-ceeds 2500, while in bowling it never goes above 15. Theadaptivity of the proposed normalisation emerges clearlyin crazy climber and qbert, where the statistics spanmultiple orders of magnitude during training. The bottomrow in Figure 3 shows the corresponding agent’s undis-counted episode return: it follows the same patterns as thestatistics (with differences in magnitude due to discounting).Finally note how the statistics can even track the instabilitiesin the agent’s performance, as in qbert.

DmLab-30 resultsFigure 4 shows, as a function of the total number of framesprocessed by each agent, the mean human normalised per-formance across all 30 DeepMind Lab levels, where eachlevel’s score is capped at 100% . For all agents, we ran three

0 2 4 6 8 10 12Environment Frames 1e9

−20

0

20

40

60

80

100

120

Media

n H

um

an N

orm

alis

ed S

core

Atari-57 (unclipped)

PopArt-IMPALA

MultiHead-IMPALA

IMPALA

Figure 2: Atari-57 (unclipped): Median human normalisedscore across all Atari levels, as a function of the total num-ber of frames seen by the agents across all levels. We herecompare the same set of agents as in Figure 1, but now allagents are trained without using reward clipping. The ap-proximately flat lines corresponding to the baselines meanno learning at all on at least 50% of the games.

breakout crazy_climber qbert seaquestU

ndis

coun

ted

Ret

urn

[μ-σ

, μ+σ

]

Environment Frames Environment Frames Environment Frames Environment Frames

Figure 3: Normalisation statistics: Top: learned statistics,without reward clipping, for four distinct Atari games. Theshaded region is [µ−σ, µ+σ]. Bottom: undiscounted returns.

independent PBT experiments. In Figure 4 we plot the learn-ing curves for each experiment and, for each agent, fill in thearea between best and worse experiment.

Compared to the original paper, our IMPALA baselineuses a richer action set, that includes more possible hori-zontal rotations, and vertical rotations (details in Appendix).Fine-grained horizontal control is useful on lasertaglevels, while vertical rotations are necessary for a fewpsychlab levels. Note that this new baseline (solid bluein Figure 4) performs much better than the original IM-PALA agent, which we also train and report for complete-ness (dashed blue). Including PopArt normalisation (in or-ange) on top of our baseline results in largely improvedscores. Note how agents achieve clearly separated perfor-mance levels, with the new action set dominating the origi-nal paper’s one, and with PopArt-IMPALA dominating IM-PALA for all three replications of the experiment.

0 2 4 6 8 10Environment Frames 1e9

0

10

20

30

40

50

60

70

80M

ean C

apped H

um

an N

orm

alis

ed S

core

DmLab-30

PopArt-IMPALA

IMPALA

IMPALA-original

Figure 4: DmLab-30. Mean capped human normalisedscore of IMPALA (blue) and PopArt-IMPALA (orange),across the DmLab-30 benchmark as function of the num-ber of frames (summed across all levels). Shaded region isbounded by best and worse run among 3 PBT experiments.For reference, we also plot the performance of IMPALAwith the limited action set from the original paper (dashed).

ExtensionsIn this section, we explore the combination of the proposedPopArt-IMPALA agent with pixel control (Jaderberg et al.2016), to further improve data efficiency, and make train-ing IMPALA-like agents on large multi-task benchmarkscheaper and more practical. Pixel control is an unsuper-vised auxiliary task introduced to help learning good staterepresentations. As shown in Figure 5, the combinationof PopArt-IMPALA with pixel control (red line) allows tomatch the final performance of the vanilla PopArt-IMPALA(orange line) with a fraction of the data (∼ 2B frames).This is on top of the large improvement in data efficiencyalready provided by PopArt, meaning that the pixel con-trol augmented PopArt-IMPALA needs less than 1/10-th ofthe data to match our own IMPALA baseline’s performance(and 1/30-th of the frames to match the original publishedIMPALA). Importantly, since both PopArt and Pixel Con-trol only add a very small computational cost, this improve-ment in data efficiency directly translates in a large reductionof the cost of training IMPALA agents on large multi-taskbenchmarks. Note, finally, that other orthogonal advances indeep RL could also be combined to further improve perfor-mance, similarly to what was done by Rainbow (Hessel et al.2018), in the context of value-based reinforcement learning.

Implementation notesWe implemented all agents in TensorFlow. For each batchof rollouts processed by the learner, we average the Gvt tar-gets within a rollout, and for each rollout in the batch weperform one online update of PopArt’s normalisation statis-tics with decay β = 3 × 10−4. Note that β didn’t requireany tuning. To prevent numerical issues, we clip the scaleσ in the range [0.0001, 1e6]. We do not back-propagate gra-dients into µ and σ, exclusively updated as in Equation 6.

0 2 4 6 8 10Environment Frames 1e9

0

10

20

30

40

50

60

70

80

Mean C

apped H

um

an N

orm

alis

ed S

core

0.1

IMPALA-original@10B

IMPALA@10B

PopArt-IMPALA@10B

DmLab-30

PopArt-IMPALA

Pixel-PopArt-IMPALA

Figure 5: DmLab-30 (with pixel control). Mean capped hu-man normalised score of PopArt-IMPALA with pixel con-trol (red), across the DmLab-30 benchmark as function ofthe total number of frames across all tasks. Shaded region isbounded by best and worse run among 3 PBT experiments.Dotted lines mark the point where Pixel-PopArt-IMPALAmatches PopArt-IMPALA and the two IMPALA baselines.

The weights W of the last layer of the value function areupdated according to Equation 13 and 11. Note that we firstapply the actor-critic updates (11), then update the statis-tics (6), finally apply output preserving updates (13). Formore just-in-time rescaling of updates we can invert this or-der, but this wasn’t necessary. As anticipated, in all experi-ments we used population-based training (PBT) to adapt hy-perparameters during training (Jaderberg et al. 2017). As inthe IMPALA paper, we use PBT to tune learning rate,entropy cost, the optimiser’s epsilon, and—in theAtari experiments—the max gradient norm. In Atari-57 we used populations of 24 instances, in DmLab-30 just 8instances. All hyperparameters are reported in the Appendix.

DiscussionIn this paper we propose a scale-invariant actor-critic al-gorithm that enables significantly improved performance inmulti-task reinforcement learning settings. Being able to ac-quire knowledge about a wide range of facts and skills hasbeen long considered an essential feature for an RL agentto demonstrate intelligent behaviour (Sutton et al. 2011;Degris and Modayil 2012; Legg and Hutter 2007). To askour algorithms to master multiple tasks is therefore a naturalstep as we progress towards increasingly powerful agents.

The wide-spread adoption of deep learning in RL is quitetimely in this regard, since sharing parts of a neural net-work across multiple tasks is also a powerful way of build-ing robust representations. This is particularly important forRL, because rewards on individual tasks can be sparse, andtherefore sharing representations across tasks can be vitalto bootstrap learning. Several agents (Jaderberg et al. 2016;Lample and Chaplot 2016; Shelhamer et al. 2016; Mirowskiet al. 2016) demonstrated this by improving performance ona single external task by learning off-policy about auxiliary

tasks defined on the same stream of experience (e.g. pixelcontrol, immediate reward prediction or auto-encoding).

Multi-task learning, as considered in this paper, where weget to execute, in parallel, the policies learned for each task,has potential additional benefits, including deep exploration(Osband et al. 2016), and policy composition (Mankowitz etal. 2018; Todorov 2009). By learning on-policy about tasks,it may also be easier to scale to much more diverse tasks:if we only learn about some task off-policy from experiencegenerated pursuing a very different one, we might never ob-serve any reward. A limitation of our approach is that itcan be complicated to implement parallel learning outsideof simulation, but recent work on parallel training of robots(Levine et al. 2016) suggests that this is not necessarily aninsurmountable obstacle if sufficient resources are available.

Adoption of parallel multi-task RL has up to now beenfairly limited. That the scaling issues considered in this pa-per, may have been a factor in the limited adoption is in-dicated by the wider use of this kind of learning in su-pervised settings (Johnson et al. 2017; Lu et al. 2016;Misra et al. 2016; Hashimoto et al. 2016), where loss func-tions are naturally well scaled (e.g. cross entropy), or canbe easily scaled thanks to the stationarity of the training dis-tribution. We therefore hope and believe that the work pre-sented here can enable more research on multi-task RL.

We also believe that PopArt’s adaptive normalisationcan be combined with other research in multi-task rein-forcement learning, that previously did not scale as effec-tively to large numbers of diverse tasks. We highlight aspotential candidates policy distillation (Parisotto, Ba, andSalakhutdinov 2015; Rusu et al. 2015; Schmitt et al. 2018;Teh et al. 2017) and active sampling of the task distributionthe agent trains on (Sharma and Ravindran 2017). The com-bination of PopArt-IMPALA with active sampling might beparticularly promising since it may allow a more efficientuse of the parallel data generation, by focusing it on the taskmost amenable for learning. Elastic weight consolidation(Kirkpatrick et al. 2017) and other work from the contin-ual learning literature (Ring 1994; Mcclelland, Mcnaughton,and O’Reilly 1995) might also be adapted to parallel learn-ing setups to reduce interference (French 1999) among tasks.

References[Bacon, Harb, and Precup 2017] Bacon, P.; Harb, J.; and Precup, D.

2017. The option-critic architecture. AAAI Conference on ArtificialIntelligence.

[Beattie et al. 2016] Beattie, C.; Leibo, J. Z.; Teplyashin, D.; Ward,T.; Wainwright, M.; Kuttler, H.; Lefrancq, A.; Green, S.; Valdes,V.; Sadik, A.; Schrittwieser, J.; Anderson, K.; York, S.; Cant, M.;Cain, A.; Bolton, A.; Gaffney, S.; King, H.; Hassabis, D.; Legg, S.;and Petersen, S. 2016. Deepmind lab. CoRR abs/1612.03801.

[Bellemare et al. 2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.;and Bowling, M. 2013. The arcade learning environment: An eval-uation platform for general agents. JAIR.

[Bellman 1957] Bellman, R. 1957. A markovian decision process.Journal of Mathematics and Mechanics.

[Brunskill and Li 2013] Brunskill, E., and Li, L. 2013. Sam-ple complexity of multi-task reinforcement learning. CoRRabs/1309.6821.

[Caruana 1998] Caruana, R. 1998. Multitask learning. In Learningto learn.

[Collobert and Weston 2008] Collobert, R., and Weston, J. 2008. Aunified architecture for natural language processing: Deep neuralnetworks with multitask learning. In ICML.

[Degris and Modayil 2012] Degris, T., and Modayil, J. 2012.Scaling-up knowledge for a cognizant robot. In AAAI Spring Sym-posium: Designing Intelligent Robots.

[Duan et al. 2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.;and Abbeel, P. 2016. Benchmarking deep reinforcement learningfor continuous control. In ICML.

[Espeholt et al. 2018] Espeholt, L.; Soyer, H.; Munos, R.; Si-monyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.;Dunning, I.; Legg, S.; and Kavukcuoglu, K. 2018. Impala: Scal-able distributed deep-rl with importance weighted actor-learner ar-chitectures. In ICML.

[French 1999] French, R. M. 1999. Catastrophic forgetting in con-nectionist networks. Trends in cognitive sciences.

[Gruslys et al. 2018] Gruslys, A.; Azar, M. G.; Bellemare, M. G.;and Munos, R. 2018. The reactor: A sample-efficient actor-criticarchitecture. ICLR.

[Hashimoto et al. 2016] Hashimoto, K.; Xiong, C.; Tsuruoka, Y.;and Socher, R. 2016. A joint many-task model: Growing a neuralnetwork for multiple NLP tasks. CoRR abs/1611.01587.

[He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015.Deep residual learning for image recognition. arXiv preprintarXiv:1512.03385.

[Hessel et al. 2018] Hessel, M.; Modayil, J.; van Hasselt, H.;Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar,M. G.; and Silver, D. 2018. Rainbow: Combining improvementsin deep reinforcement learning. AAAI Conference on Artificial In-telligence.

[Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhu-ber, J. 1997. Long short-term memory. Neural computation.

[Jaderberg et al. 2016] Jaderberg, M.; Mnih, V.; Czarnecki, W. M.;Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016.Reinforcement learning with unsupervised auxiliary tasks. CoRRabs/1611.05397.

[Jaderberg et al. 2017] Jaderberg, M.; Dalibard, V.; Osindero, S.;Czarnecki, W. M.; Donahue, J.; Razavi, A.; Vinyals, O.; Green,T.; Dunning, I.; Simonyan, K.; Fernando, C.; and Kavukcuoglu,K. 2017. Population based training of neural networks. CoRRabs/1711.09846.

[Johnson et al. 2017] Johnson, M.; Schuster, M.; Le, Q. V.; Krikun,M.; Wu, Y.; Chen, Z.; Thorat, N.; Viegas, F. B.; Wattenberg, M.;Corrado, G.; Hughes, M.; and Dean, J. 2017. Google’s multilingualneural machine translation system: Enabling zero-shot translation.Transactions of the Association for Computational Linguistics 5.

[Kirkpatrick et al. 2017] Kirkpatrick, J.; Pascanu, R.; Rabinowitz,N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.;Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.;Kumaran, D.; and Hadsell, R. 2017. Overcoming catastrophicforgetting in neural networks. PNAS.

[Lample and Chaplot 2016] Lample, G., and Chaplot, D. S. 2016.Playing FPS games with deep reinforcement learning. CoRRabs/1609.05521.

[Legg and Hutter 2007] Legg, S., and Hutter, M. 2007. Universalintelligence: A definition of machine intelligence. Minds Mach.

[Levine et al. 2016] Levine, S.; Pastor, P.; Krizhevsky, A.; andQuillen, D. 2016. Learning hand-eye coordination for roboticgrasping with large-scale data collection. In ISER.

[Lillicrap et al. 2016] Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.;Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuouscontrol with deep reinforcement learning. In ICLR.

[Lu et al. 2016] Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.;and Feris, R. S. 2016. Fully-adaptive feature sharing in multi-tasknetworks with applications in person attribute classification. CoRRabs/1611.05377.

[Mahmood 2017] Mahmood, A. 2017. Incremental off-policy rein-forcement learning algorithms. Ph.D. UAlberta.

[Mankowitz et al. 2018] Mankowitz, D. J.; Zıdek, A.; Barreto, A.;Horgan, D.; Hessel, M.; Quan, J.; Oh, J.; van Hasselt, H.; Silver, D.;and Schaul, T. 2018. Unicorn: Continual learning with a universal,off-policy agent. CoRR abs/1802.08294.

[Mcclelland, Mcnaughton, and O’Reilly 1995] Mcclelland, J. L.;Mcnaughton, B. L.; and O’Reilly, R. C. 1995. Why there are com-plementary learning systems in the hippocampus and neocortex:Insights from the successes and failures of connectionist models oflearning and memory. Psychological Review.

[Mirowski et al. 2016] Mirowski, P.; Pascanu, R.; Viola, F.; Soyer,H.; Ballard, A. J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.;Kavukcuoglu, K.; Kumaran, D.; and Hadsell, R. 2016. Learning tonavigate in complex environments. CoRR abs/1611.03673.

[Misra et al. 2016] Misra, I.; Shrivastava, A.; Gupta, A.; andHebert, M. 2016. Cross-stitch networks for multi-task learning.CoRR abs/1604.03539.

[Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu,A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik,A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.;and Hassabis, D. 2015. Human-level control through deep rein-forcement learning. Nature.

[Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.;Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016.Asynchronous methods for deep reinforcement learning. In ICML.

[Nair et al. 2015] Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek,C.; Fearon, R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.;Beattie, C.; Petersen, S.; Legg, S.; Mnih, V.; Kavukcuoglu, K.; andSilver, D. 2015. Massively parallel methods for deep reinforcementlearning. arXiv preprint arXiv:1507.04296.

[Osband et al. 2016] Osband, I.; Blundell, C.; Pritzel, A.; andVan Roy, B. 2016. Deep exploration via bootstrapped DQN. InNIPS.

[Parisotto, Ba, and Salakhutdinov 2015] Parisotto, E.; Ba, L. J.; andSalakhutdinov, R. 2015. Actor-mimic: Deep multitask and transferreinforcement learning. CoRR abs/1511.06342.

[Precup, Sutton, and Singh 2000] Precup, D.; Sutton, R. S.; andSingh, S. P. 2000. Eligibility traces for off-policy policy evalu-ation. In ICML.

[Ring 1994] Ring, M. 1994. Continual learning in reinforcementenvironments.

[Romera-Paredes et al. 2013] Romera-Paredes, B.; Aung, H.;Bianchi-Berthouze, N.; and Pontil, M. 2013. Multilinear multitasklearning. In ICML.

[Rusu et al. 2015] Rusu, A. A.; Colmenarejo, S. G.; Gulcehre,C.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.;Kavukcuoglu, K.; and Hadsell, R. 2015. Policy distillation. CoRRabs/1511.06295.

[Rusu et al. 2016] Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.;Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Had-sell, R. 2016. Progressive neural networks. CoRR abs/1606.04671.

[Schmidhuber 1990] Schmidhuber, J. 1990. An on-line algorithmfor dynamic reinforcement learning and planning in reactive envi-ronments. In IJCNN.

[Schmitt et al. 2018] Schmitt, S.; Hudson, J. J.; Zıdek, A.; Osin-dero, S.; Doersch, C.; Czarnecki, W. M.; Leibo, J. Z.; Kuttler, H.;Zisserman, A.; Simonyan, K.; and Eslami, S. M. A. 2018. Kick-starting deep reinforcement learning. CoRR abs/1803.03835.

[Schulman et al. 2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan,M. I.; and Abbeel, P. 2015. Trust region policy optimization. CoRRabs/1502.05477.

[Schulman et al. 2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Rad-ford, A.; and Klimov, O. 2017. Proximal policy optimization algo-rithms. CoRR abs/1707.06347.

[Sharma and Ravindran 2017] Sharma, S., and Ravindran, B. 2017.Online multi-task learning using active sampling. CoRRabs/1702.06053.

[Shelhamer et al. 2016] Shelhamer, E.; Mahmoudieh, P.; Argus,M.; and Darrell, T. 2016. Loss is its own reward: Self-supervisionfor reinforcement learning. CoRR abs/1612.07307.

[Silver et al. 2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez,A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou,I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.;Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Masteringthe game of Go with deep neural networks and tree search. Nature.

[Silver et al. 2017] Silver, D.; Hubert, T.; Schrittwieser, J.;Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.;Kumaran, D.; Graepel, T.; Lillicrap, T. P.; Simonyan, K.; andHassabis, D. 2017. Mastering chess and shogi by self-play with ageneral reinforcement learning algorithm. CoRR abs/1712.01815.

[Sutton and Barto 2018] Sutton, R. S., and Barto, A. G. 2018. Re-inforcement Learning: An Introduction. MIT press.

[Sutton et al. 2011] Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.;Pilarski, P. M.; White, A.; and Precup, D. 2011. Horde: A scalablereal-time architecture for learning knowledge from unsupervisedsensorimotor interaction. In AAMAS.

[Sutton et al. 2014] Sutton, R. S.; Mahmood, A. R.; Precup, D.; andvan Hasselt, H. 2014. A new q(λ) with interim forward view andMonte Carlo equivalence. In ICML.

[Teh et al. 2017] Teh, Y. W.; Bapst, V.; Czarnecki, W. M.; Quan,J.; Kirkpatrick, J.; Hadsell, R.; Heess, N.; and Pascanu, R.2017. Distral: Robust multitask reinforcement learning. CoRRabs/1707.04175.

[Thrun 1996] Thrun, S. 1996. Is learning the n-th thing any easierthan learning the first? In NIPS.

[Thrun 2012] Thrun, S. 2012. Explanation-based neural networklearning: A lifelong learning approach. Springer.

[Todorov 2009] Todorov, E. 2009. Compositionality of optimalcontrol laws. In NIPS.

[van Hasselt et al. 2016] van Hasselt, H.; Guez, A.; Hessel, M.;Mnih, V.; and Silver, D. 2016. Learning values across many or-ders of magnitude. In NIPS.

[van Hasselt, Guez, and Silver 2016] van Hasselt, H.; Guez, A.; andSilver, D. 2016. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artificial Intelligence.

[Watkins 1989] Watkins, C. J. C. H. 1989. Learning from DelayedRewards. Ph.D. Dissertation, King’s College, Cambridge, England.

[Williams 1992] Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learning.

AppendixIn this Appendix we report additional details about the re-sults presented in the main text, as well as present addi-tional experiments on the DmLab-30 benchmark. We alsoreport the breakdown per level of the scores of IMPALAand PopArt-IMPALA on the Atari-57 and DmLab-30 bench-marks. Finally, we report the hyperparameters used to trainthe baseline agents as well as PopArt-IMPALA. These hy-perparameters are mostly the same as in Espeholt et al., butwe report them for completeness and to ease reproducibility.

Hyper-parameter tuningIn our experiments we used Population-Based Training(PBT) to tune hyper-parameters. In our DmLab-30 exper-iments, however, we used smaller populations than in theoriginal IMPALA paper. For completeness, we also reporthere the results of running PopArt-IMPALA and IMPALAwith the larger population size used by Espeholt et al. Dueto the increased cost of using larger populations, in this casewe will only report one PBT tuning experiment per agent,rather than the 3 reported in the main text.

The learning curves for both IMPALA and PopArt-IMPALA are shown in Figure 6, together with horizon-tal dashed lines marking average final performance of theagents trained with the smaller population of just 8 in-stances. The performance of both IMPALA and PopArt-IMPALA agents at the end of training is very similarwhether hyperparameters are tuned with 8 or 24 PBT in-stances, suggesting that the large populations used for hyper-parameter tuning by Espeholt et al. may not be necessary.

Note, however, that we have observed larger discrepanciesbetween experiments where small and large population sizeare used for tuning hyper-parameter, when training the lessperforming IMPALA agent that used a more limited actionset, as presented in the original IMPALA paper.

0.0 0.2 0.4 0.6 0.8 1.0Environment Frames 1e10

10

20

30

40

50

60

70

80

Mean C

apped N

orm

aliz

ed S

core

DmLab-30

PopArt-IMPALA [24]

IMPALA [24]

PopArt-IMPALA [8] (@1e10 frames)

IMPALA [8] (@1e10 frames)

Figure 6: Larger populations: mean capped human nor-malised score across the DmLab-30 benchmark as a func-tion of the total number of frames seen by the agents acrossall levels. The solid lines plot the performance of IMPALA(blue) and PopArt-IMPALA (orange) when tuning hyper-parameters with a large PBT population of 24 instances.Dashed lines correspond to the final performance of thesesame agents, after 10B frames, in the previous experimentswhere hyper-parameters were tuned with a population of 8.

Pixel ControlPixel control (Jaderberg et al. 2016) is an unsupervised aux-iliary task introduced to help learning good state represen-tations. We report here also the performance of combiningPixel Control with IMPALA without also using PopArt. Asshown in Figure 7, pixel control increases the performanceof both the PopArt-IMPALA agent as well as that of the IM-PALA baseline. PopArt still guarantees a noticeable boostin performance, with the median human normalized scoreof Pixel-PopArt-IMPALA (red line) exceeding the score ofPixel-IMPALA (green line) by approximately 10 points.

We implemented the pixel control task as described in theoriginal paper, only adapting the scheme to the rectangularobservations used in DmLab-30. We split the (72× 96) ob-servations into a 18×24 grid of 4×4 cells. For each locationin the grid we define a distinct pseudo-reward ri,j , equal tothe absolute value of the difference between pixel intensitiesin consecutive frames, averaged across the 16 pixels of cellci,j . For each cell, we train action values with multi-step Q-learning, accumulating rewards until the end of a rollout andthen bootstrapping. We use a discount γ = 0.9. Learning isfully off-policy on experience generated by the actors, thatfollow the main policy π as usual.

We use a deep deconvolutional network for the actionvalue predictions associated to each pseudo-reward ri,j .First, we feed the LSTM’s output to a fully connected layer,reshape the output tensor as 6× 9× 32, and apply a decon-volution with 3×3 kernels that outputs a 8×11×32 tensor.From this, we compute a spatial grid of Q-values using a du-eling network architecture: we use a deconvolution with 1output channel for the state values across the grid and a de-convolution with |A| channels for the advantage estimatesof each cell. Output deconvolutions use 4 × 4 kernels withstride 2. The additional head is only evaluated on the learner,actors do not execute it.

0.0 0.5 1.0 1.5 2.0Environment Frames 1e9

0

10

20

30

40

50

60

70

80

Capped H

um

an N

orm

alis

ed S

core

IMPALA-original@10B

IMPALA@10B

PopArt-IMPALA@10B

DmLab-30

Pixel-PopArt-IMPALA

Pixel-IMPALA

Figure 7: Pixel Control: mean capped human normalisedscore across the DmLab-30 benchmark as a function ofthe total number of frames (summed across levels). Solidlines plot the performance PopArt-IMPALA (red) and IM-PALA (green), after augmenting both with pixel con-trol. Dashed lines mark the point at which Pixel-PopArt-IMPALA matches the final performance of previous agents.Note how, thanks to the improved data efficiency, we trainfor 2B frames, compared to 10B in previous experiments.

Atari-57 Score breakdownIn this section we use barplots to report the final perfor-mance of the agents on each of the levels in the Atari-57multi-task benchmark. In order to compute these scores wetake the final trained agent and evaluate it with a frozen pol-icy on each of the levels for 200 episodes. The same trainedpolicy is evaluated in all the levels, and the policy is notprovided information about the task it’s being evaluated on.For Atari, we compare PopArt-IMPALA, with and withoutreward clipping, to an IMPALA baseline. In all cases theheight of the bars in the plot denote human normalised score.For the Atari results we additionally rescale logarithmicallythe x-axis, because in this domain games may differ in theirnormalised performance by several orders of magnitude.

Figure 8: Atari-57 breakdown: human normalised score forIMPALA and PopArt-IMPALA, as measured in a separateevaluation phase at the end of training, broken down for the57 games. For PopArt-IMPALA we report the scores bothwith and without reward clipping

-1 0 1 10 100 1000 10000Normalised Score

skiing

solaris

montezuma_revenge

private_eye

pitfall

seaquest

asteroids

frostbite

qbert

gravitar

amidar

beam_rider

bowling

berzerk

space_invaders

alien

hero

ms_pacman

asterix

chopper_command

riverraid

time_pilot

name_this_game

ice_hockey

video_pinball

yars_revenge

venture

surround

freeway

battle_zone

wizard_of_wor

enduro

pong

tutankham

tennis

star_gunner

kung_fu_master

bank_heist

centipede

zaxxon

phoenix

fishing_derby

jamesbond

robotank

road_runner

defender

crazy_climber

kangaroo

demon_attack

boxing

krull

double_dunk

breakout

up_n_down

assault

gopher

atlantis

PopArt-IMPALA PopArt-IMPALA (unclipped) IMPALA

DmLab-30 Score breakdownIn this section we use barplots to report the final perfor-mance of the agents on each of the levels in the DmLab-30 multi-task benchmark. In order to compute these scoreswe take the final trained agent and evaluate it with a frozenpolicy on each of the levels for 500 episodes. We performthe evaluation over a higher number of episodes (comparedto Atari) because the variance of the mean episode returnis typically higher in DeepMind Lab. As before, the sametrained policy is evaluated on all levels, and the policy is notprovided information about the task it’s being evaluated on.Also in DmLab-30 we perform a three-way comparison. Wecompare PopArt-IMPALA to our improved IMPALA base-line, and, for completeness, to the original paper’s IMPALA.

Figure 9: DmLab-30 breakdown: human normalised scorefor the original paper’s IMPALA, our improved IMPALAbaseline, and PopArt-IMPALA, as measured at the end oftraining, broken down for the 30 tasks; they all used 8 in-stances for population based training.

0 20 40 60 80 100 120 140Normalised Score

rooms_exploit_deferred_effects_test

language_execute_random_task

psychlab_arbitrary_visuomotor_mapping

platforms_hard

rooms_keys_doors_puzzle

platforms_random

rooms_watermaze

natlab_fixed_large_map

rooms_select_nonmatching_object

psychlab_continuous_recognition

explore_object_rewards_many

explore_obstructed_goals_large

psychlab_sequential_comparison

explore_goal_locations_large

explore_object_rewards_few

natlab_varying_map_randomized

rooms_collect_good_objects_test

explore_object_locations_large

language_answer_quantitative_question

natlab_varying_map_regrowth

language_select_located_object

explore_object_locations_small

lasertag_one_opponent_large

psychlab_visual_search

explore_obstructed_goals_small

lasertag_three_opponents_large

explore_goal_locations_small

lasertag_three_opponents_small

language_select_described_object

lasertag_one_opponent_small

PopArt-IMPALA IMPALA-original IMPALA

DeepMind Lab action discretisation

DeepMind Lab’s native action space is a 7-dimensionalcontinuous space, whose dimensions correspond to rotat-ing horizontally/vertically, strafing left/right, moving for-ward/backward, tagging, crouching, and jumping.

Despite the native action space being continuous, previ-ous work on this platform has however typically relied on acoarse discretisation of the action space. We therefore followthe same approach also in our experiments.

Below we list the discretisations used by the agents con-sidered in our experiments. This includes the discretisationused by IMPALA, as well as the one we introduce in this pa-per in order to unlock some levels in DmLab-30 which justcan’t be solved under the original IMPALA discretisation.

Table 1: Action discretisation used by IMPALA: we reportbelow the discretisation of DeepMind Lab’s action space, asused by the original IMPALA agent in Espeholt et al.

Action Native DmLab ActionForward (FW) [ 0, 0, 0, 1, 0, 0, 0]Backward (BW) [ 0, 0, 0, -1, 0, 0, 0]Strafe Left [ 0, 0, -1, 0, 0, 0, 0]Strafe Right [ 0, 0, 1, 0, 0, 0, 0]Look Left (LL) [-20, 0, 0, 0, 0, 0, 0]Look Right (LR) [ 20, 0, 0, 0, 0, 0, 0]FW + LL [-20, 0, 0, 1, 0, 0, 0]FW + LR [ 20, 0, 0, 1, 0, 0, 0]Fire [ 0, 0, 0, 0, 1, 0, 0]

Table 2: Action discretisation of DeepMind Lab’s actionspace, as used by our version of IMPALA and by PopArt-IMPALA.

Action Native DmLab ActionFW [ 0, 0, 0, 1, 0, 0, 0]BW [ 0, 0, 0, -1, 0, 0, 0]Strafe Left [ 0, 0, -1, 0, 0, 0, 0]Strafe Right [ 0, 0, 1, 0, 0, 0, 0]Small LL [-10, 0, 0, 0, 0, 0, 0]Small LR [ 10, 0, 0, 0, 0, 0, 0]Large LL [-60, 0, 0, 0, 0, 0, 0]Large LR [ 60, 0, 0, 0, 0, 0, 0]Look Down [ 0, 10, 0, 0, 0, 0, 0]Look Up [ 0,-10, 0, 0, 0, 0, 0]FW + Small LL [-10, 0, 0, 1, 0, 0, 0]FW + Small LR [ 10, 0, 0, 1, 0, 0, 0]FW + Large LL [-60, 0, 0, 1, 0, 0, 0]FW + Large LR [ 60, 0, 0, 1, 0, 0, 0]Fire [ 0, 0, 0, 0, 1, 0, 0]

Fixed Hyperparameters

Table 3: PopArt specific hyperparameters: these are heldfixed during training and were only very lightly tuned. Thelower bound is used to avoid numerical issues when rewardsare extremely sparse.

Hyperparameter valueStatistics learning rate 0.0003Scale lower bound 0.0001Scale upper bound 1e6

Table 4: DeepMind Lab preprocessing. As in previous workon DeepMind Lab, we render the observation with a resolu-tion of [72, 96], as well as use 4 action repeats. We also em-ploy the optimistic asymmetric rescaling (OAR) of rewards,that was introduced in Espeholt et al. for exploration.

Hyperparameter valueImage Height 72Image Width 96Number of action repeats 4Reward Rescaling -0.3min(tanh(r),0)+

5max(tanh(r),0)

Table 5: Atari preprocessing. The standard Atari-preprocessing is used in the Atari experiments. Sincethe introduction of DQN these setting have become astandard practice when training deep RL agent on Atari.Note however, that we report experiments training agentsboth with and without reward clipping.

Hyperparameter valueImage Height 84Image Width 84Grey scaling TrueMax-pooling 2 consecutive frames TrueFrame Stacking 4End of episode on life loss TrueReward Clipping (if used) [-1, 1]Number of action repeats 4

Table 6: Other agent hyperparameters: These hyperparame-ters are the same used by Espeholt et al.

Hyperparameter valueUnroll length 20 (Atari), 100 (DmLab)Discount γ 0.99Baseline loss weight γ 0.5Batch size 32Optimiser RMSPropRMSProp momentum 0.

Network Architecture

Table 7: Network hyperparameters. The network architec-ture is described in details in Espeholt et al., For complete-ness, we also report in the Table below the complete spec-ification of the network. Convolutional layers are specifiedaccording to the pattern (num layers, kernel size, stride).

Hyperparameter valueConvolutional Stack- Number of sections 3- Channels per section [16, 32, 32]- Activation Function ReLUResNet section- Conv 1 / 3x3 / 1)- Max-Pool 1 / 3x3 / 2- Conv 2 / 3x3 / 1- Skip Identity- Conv 2 / 3x3 / 1- Skip IdentityLanguage preprocessing- Word embeddings 20- Sentence embedding LSTM / 64Fully connected layer 256LSTM (DmLab-only) 256Network Heads- Value Linear- Policy Linear+softmax

Population Based Training

Table 8: Population Based Training: we use PBT for tuninghyper-parameters, as described in Espeholt et al., with pop-ulation size and fitness function as defined below.

Hyperparameter valuePopulation Size (Atari) 24Population Size (DmLab) 8Fitness Mean capped

human normalisedscore (cap=100)

Table 9: hyperparameters tuned with population based train-ing are listed below: note that these are the same used by allbaseline agents we compare to, to ensure fair comparisons.

Hyperparameter distributionEntropy cost Log-uniform on

[5e-5, 1e-2]Learning rate Log-uniform on

[5e-6, 5e-3]RMSProp epsilon Categorical on

[1e-1, 1e-3, 1e-5, 1e-7]Max Grad Norm Uniform on

[10, 100]

multi-task deep reinforcement learning with popartmulti-task deep reinforcement learning with popart...

Documents