self-tuning deep reinforcement learning...reinforcement learning (rl) algorithms often have...
TRANSCRIPT
Self-Tuning Deep Reinforcement Learning
Tom Zahavy 1 Zhongwen Xu 1 Vivek Veeriah 1 Matteo Hessel 1 Junhyuk Oh 1 Hado van Hasselt 1
David Silver 1 Satinder Singh 1 2
AbstractReinforcement learning (RL) algorithms oftenrequire expensive manual or automated hyper-parameter searches in order to perform well ona new domain. This need is particularly acute inmodern deep RL architectures which often incor-porate many modules and multiple loss functions.In this paper, we take a step towards addressingthis issue by using metagradients (Xu et al., 2018)to tune these hyperparameters via differentiablecross validation, whilst the agent interacts withand learns from the environment. We present theSelf-Tuning Actor Critic (STAC) which uses thisprocess to tune the hyperparameters of the usualloss function of the IMPALA actor critic agent(Espeholt et. al., 2018 ), to learn the hyperpa-rameters that define auxiliary loss functions, andto balance trade offs in off policy learning by in-troducing and adapting the hyperparameters of anovel leaky V-trace operator. The method is sim-ple to use, sample efficient and does not requiresignificant increase in compute. Ablative studiesshow that the overall performance of STAC im-proves as we adapt more hyperparameters. Whenapplied to 57 games on the Atari 2600 environ-ment over 200 million frames our algorithm im-proves the median human normalized score of thebaseline from 243% to 364%.
1. IntroductionReinforcement Learning (RL) algorithms often have mul-tiple hyperparameters that require careful tuning; this isespecially true for modern deep RL architectures, whichoften incorporate many modules and many loss functions.Training a deep RL agent is thus typically performed in twonested optimization loops. In the inner training loop, we fixa set of hyperparameters, and optimize the agent parameterswith respect to these fixed hyperparameters. In the outer(manual or automated) tuning loop, we search for good
1Work done at DeepMind, London. 2University of Michigan.Correspondence to: Tom Zahavy <[email protected]>.
hyperparameters, evaluating them either in terms of theirinner loop performance, or in terms of their performanceon some special validation data. Inner loop training is typi-cally differentiable and can be efficiently performed usingback propagation, while the optimization of the outer loopis typically performed via gradient-free optimization. Sinceonly the hyperparameters (but not the agent policy itself) aretransferred between outer loop iterations, we refer to this asa “multiple lifetime” approach. For example, random searchfor hyperparameters (Bergstra & Bengio, 2012) falls underthis category, and population based training (Jaderberg et al.,2017) also shares many of its properties. The cost of rely-ing on multiple lifetimes to tune hyperparameters is oftennot accounted for when the performance of algorithms isreported. The impact of this is mild when users can rely onhyperparameters established in the literature, but the cost ofhyper-parameter tuning across multiple lifetimes manifestsitself when algorithm are applied to new domains.This motivates a significant body of work on tuning hyper-parameters online, within a single agent lifetime. Previouswork in this area has often focused on solutions tailoredto specific hyperparameters. For instance, Schaul et al.(2019) proposed a non stationary bandit algorithm to adaptthe exploration-exploitation trade off. Mann et al. (2016)and White & White (2016) proposed algorithms to adaptλ (the eligibility trace coefficient). Rowland et al. (2019)introduced an α coefficient into V-trace to account for thevariance-contraction trade off in off policy learning. In an-other line of work, metagradients were used to adapt theoptimiser’s parameters (Sutton, 1992; Snoek et al., 2012;Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al.,2017; Young et al., 2018).
In this paper we build on the metagradient approach to tunehyperparameters. In previous work, online metagradientshave been used to learn the discount factor or the λ coeffi-cient (Xu et al., 2018), to discover intrinsic rewards (Zhenget al., 2018) and auxiliary tasks (Veeriah et al., 2019). Thisis achieved by representing an inner (training) loss func-tion as a function of both the agent parameters and a set ofhyperparameters. In the inner loop, the parameters of theagent are trained to minimize this inner loss function w.r.tthe current values of the hyperparameters; in the outer loop,the hyperparameters are adapted via back propagation tominimize the outer (validation) loss.
arX
iv:2
002.
1292
8v2
[st
at.M
L]
2 M
ar 2
020
Self-Tuning Deep Reinforcement Learning
It is perhaps surprising that we may choose to optimizea different loss function in the inner loop, instead of theouter loss we ultimately care about. However, this is nota new idea. Regularization, for example, is a techniquethat changes the objective in the inner loop to balance thebias-variance trade-off and avoid the risk of over fitting.In model-based RL, it was shown that the policy foundusing a smaller discount factor can actually be better thana policy learned with the true discount factor (Jiang et al.,2015). Auxiliary tasks (Jaderberg et al., 2016) are anotherexample, where gradients are taken w.r.t unsupervised lossfunctions in order to improve the agent’s representation.Finally, it is well known that, in order to maximize thelong term cumulative reward efficiently (the objective ofthe outer loop), RL agents must explore, i.e., act accordingto a different objective in the inner loop (accounting, forinstance, for uncertainty).
This paper makes the following contributions. First, weshow that it is feasible to use metagradients to simulta-neously tune many critical hyperparameters (controllingimportant trade offs in a reinforcement learning agent), aslong as they are differentiable w.r.t a validation/outer loss.Importantly, we show that this can be done online, within asingle lifetime, requiring only 30% additional compute.
We demonstrate this by introducing two novel deep RL ar-chitectures, that extend IMPALA (Espeholt et al., 2018), adistributed actor-critic, by adding additional componentswith many more new hyperparameters to be tuned. The firstagent, referred to as a Self-Tuning Actor Critic (STAC), in-troduces a leaky V-trace operator that mixes importance sam-pling (IS) weights with truncated IS weights. The mixingcoefficient in leaky V-trace is differentiable (unlike the orig-inal V-trace) but similarly balances the variance-contractiontrade-off in off-policy learning. The second architecture isSTACX (STAC with auXiliary tasks). Inspired by Fedus et al.(2019), STACX augments STAC with parametric auxiliaryloss functions (each with it’s own hyper-parameters).
These agents allow us to show empirically, through ex-tensive ablation studies, that performance consistently im-proves as we expose more hyperparameters to metagradients.In particular, when applied to 57 Atari games (Bellemareet al., 2013), STACX achieved a normalized median scoreof 364%, a new state-of-the-art for online model free agents.
2. BackgroundIn the following, we consider three types of parameters:
1. θ – the agent parameters.
2. ζ – the hyperparameters.
3. η ⊂ ζ – the metaparameters.
θ denotes the parameters of the agent and parameterises,for example, the value function and the policy; these param-eters are randomly initialised at the beginning of an agent’slifetime, and updated using back-propagation on a suitableinner loss function. ζ denotes the hyperparameters, in-cluding, for example, the parameters of the optimizer (e.gthe learning rate) or the parameters of the loss function (e.g.the discount factor); these may be tuned over the course ofmany lifetimes (for instance via random search) to optimizean outer (validation) loss function. In a typical deep RLsetup, only these first two types of parameters need to beconsidered. In metagradient algorithms a third set of param-eters must be specified: the metaparameters, denoted η;these are a subset of the differentiable parameters in ζ thatstart with some some initial value (itself a hyper-parameter),but that are then adapted during the course of training.
2.1. The metagradient approach
Metagradient RL (Xu et al., 2018) is a general frameworkfor adapting, online, within a single lifetime, the differen-tiable hyperparameters η. Consider an inner loss that is afunction of both the parameters θ and the metaparametersη: Linner(θ; η). On each step of an inner loop, θ can be opti-mized with a fixed η to minimize the inner loss Linner(θ; η) :
θ(ηt).= θt+1 = θt −∇θLinner(θt; ηt). (1)
In an outer loop, η can then be optimized to minimize theouter loss by taking a metagradient step. As θ(η) is a func-tion of η this corresponds to updating the η parameters bydifferentiating the outer loss w.r.t η,
ηt+1 = ηt −∇ηLouter(θ(ηt)). (2)
The algorithm is general, as it implements a specific case ofonline cross-validation, and can be applied, in principle, toany differentiable meta-parameter η used by the inner loss.
2.2. IMPALA
Specific instantiations of the metagradient RL frameworkrequire specification of the inner and outer loss functions.Since our agent builds on the IMPALA actor critic agent(Espeholt et al., 2018), we now provide a brief introduction.
IMPALA maintains a policy πθ(at|xt) and a value functionvθ(xt) that are parameterized with parameters θ. Thesepolicy and the value function are trained via an actor-criticupdate with entropy regularization; such an update is oftenrepresented (with slight abuse of notation) as the gradientof the following pseudo-loss function
L(θ) =gv · (vs − Vθ(xs))2
−gpρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))
−ge ·∑a
πθ(as|xs) log πθ(as|xs), (3)
Self-Tuning Deep Reinforcement Learning
where gv, gp, ge are suitable loss coefficients. We refer tothe policy that generates the data for these updates as thebehaviour policy µ(at|xt). In the on policy case, whereµ(at|xt) = π(at|xt), ρs = 1 , then vs is the n-steps boot-strapped return vs =
∑s+n−1t=s γt−srt + γnV (xs+1).
IMPALA uses a distributed actor critic architecture, thatassign copies of the policy parameters to multiple actors indifferent machines to achieve higher sample throughput. Asa result, the target policy π on the learner machine can beseveral updates ahead of the actor’s policy µ that generatedthe data used in an update. Such off policy discrepancy canlead to biased updates, requiring to weight the updates withIS weights for stable learning. Specifically, IMPALA (Espe-holt et al., 2018) uses truncated IS weights to balance thevariance-contraction trade off on these off-policy updates.This corresponds to instantiating Eq. (3) with
vs = V (xs) +
s+n−1∑t=s
γt−s(Πt−1i=sci
)δtV, (4)
where we define δtV = ρt(rt + γV (xt+1) − V (xt)) andwe set ρt = min(ρ, π(at|xt)
µ(at|xt) ), ci = λmin(c, π(at|xt)µ(at|xt) ) for
suitable truncation levels ρ and c.
2.3. Metagradient IMPALA
The metagradient agent in (Xu et al., 2018) uses the meta-gradient update rules from the previous section with theactor-critic loss function of Espeholt et al. (2018). Morespecifically, the inner loss is a parameterised version of theIMPALA loss with metaparameters η based on Eq. (3),
L(θ; η) =0.5 (vs − Vθ(xs))2
−ρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))
−0.01∑a
πθ(as|xs) log πθ(as|xs),
where η = {γ, λ}. Notice that γ and λ also affect the innerloss through the definition of vs (Eq. (4)).
The outer loss is defined to be the policy gradient loss
Louter(η) =Louter(θ(η))
=− ρs log πθ(η)(as|xs)(rs + γvs+1 − Vθ(η)(xs)).
3. Self-Tuning actor-critic agentsWe first consider a slightly extended version of the meta-gradient IMPALA agent (Xu et al., 2018). Specifically, weallow the metagradient to adapt learning rates for individualcomponents in the loss,
L(θ; η) =gv · (vs − Vθ(xs))2
−gpρs · log πθ(as|xs)(rs + γvs+1 − Vθ(xs))
−ge ·∑a
πθ(as|xs) log πθ(as|xs). (5)
The outer loss is defined using Eq. (3),
Louter(η) = Louter(θ(η))
=gouterv ·
(vs − Vθ(η)(xs)
)2
−gouterp · ρs log πθ(η)(as|xs)(rs + γvs+1 − Vθ(η)(xs))
−goutere ·
∑a
πθ(η)(as|xs) log πθ(η)(as|xs)
+gouterkl · KL
(πθ(η), πθ
). (6)
Notice the new Kullback–Leibler (KL) term in Eq. (6), mo-tivating the η-update not to change the policy too much.
Compared to the work by Xu et al. (2018) that only self-tuned the inner-loss γ and λ (hence η = {γ,λ}), alsoself-tuning the inner-loss coefficients corresponds to set-ting η = {γ, λ, gv, gp, ge}. These metaparameters allow forloss specific learning rates and support dynamically balanc-ing exploration with exploitation by adapting the entropyloss weight 1. The hyperparameters of STAC include the ini-tialisations of the metaparameters, the hyperparameters ofthe outer loss {γouter, λouter, gouter
v , gouterp , gouter
e }, the KL co-efficient gouter
kl (set to 1), and the learning rate of the ADAMmeta optimizer (set to 1e− 3).
IMPALA Self-Tuning IMPALAθ vθ, πθ vθ, πθ
ζ {γ, λ, gv, gp, ge}{γouter, λouter, gouter
v , gouterp , gouter
e }InitialisationsADAM parameter, gouter
klη – {γ, λ, gv, gp, ge}
Table 1. Parameters in & IMPALA and self-tuning IMPALA.
To set the initial values of the metaparameters of self-tuningIMPALA, we use a simple “rule of thumb” and set themto the values of the corresponding parameters in the outerloss (e.g., the initial value of the inner-loss λ is set equal tothe outer-loss λ). For outer-loss hyperparameters that arecommon to IMPALA we default to the IMPALA settings.
In the next two sections, we show how embracing self tuningvia metagradients enables us to augment this agent with aparameterised Leaky V-trace operator and with self-tunedauxiliary loss functions. These ideas are examples of howthe ability to self-tune metaparameters via metagradients canbe used to introduce novel ideas into RL algorithms withoutrequiring extensive tuning of the new hyperparameters.
1There are a few additional subtle differences between thisself-tuning IMPALA agent and the metagradient agent from Xuet al. (2018). For example, we do not use the γ embedding usedin (Xu et al., 2018). These differences are further discussed in thesupplementary where we also reproduce the results of Xu et al.(2018) in our code base.
Self-Tuning Deep Reinforcement Learning
3.1. STAC
All the hyperparameters that we considered for self-tuningso far have the property that they are explicitly defined inthe definition of the loss function and can be directly differ-entiated. The truncation levels in the V-trace operator withinIMPALA, on the other hand, are equivalent to applying aReLU activation and are non differentiable.
Motivated by the study of non linear activations in DeepLearning (Xu et al., 2015), we now introduce an agent basedon a variant of the V-trace operator that we call leaky V-trace. We will refer to this agent as the Self-Tuning ActorCritic (STAC). Leaky V-trace uses a leaky rectifier (Maaset al., 2013) to truncate the importance sampling weights,which allows for a small non-zero gradient when the unit issaturated. We show that the degree of leakiness can controlcertain trade offs in off policy learning, similarly to V-trace,but in a manner that is differentiable.
Before we introduce Leaky V-trace, let us first recall howthe off policy trade offs are represented in V-trace using thecoefficients ρ, c. The weight ρt = min(ρ, π(at|xt)
µ(at|xt) ) appearsin the definition of the temporal difference δtV and definesthe fixed point of this update rule. The fixed point of thisupdate is the value function V πρ of the policy πρ that issomewhere between the behaviour policy µ and the targetpolicy π controlled by the hyper parameter ρ,
πρ =min (ρµ(a|x), π(a|x))∑b min (ρµ(b|x), π(b|x))
. (7)
The product of the weights cs, ..., ct−1 in Eq. (4) measureshow much a temporal difference δtV observed at time timpacts the update of the value function. The truncationlevel c is used to control the speed of convergence by tradingoff the update variance for a larger contraction rate, similarto Retrace(λ) (Munos et al., 2016). By clipping the impor-tance weights, the variance associated with the update ruleis reduced relative to importance-weighted returns . On theother hand, the clipping of the importance weights effec-tively cuts the traces in the update, resulting in the updateplacing less weight on later TD errors, and thus worseningthe contraction rate of the corresponding operator.
Following this interpretation of the off policy coefficients,we now propose a variation of V-trace which we call leakyV-trace with new parameters αρ ≥ αc,
ISt =π(at|xt)µ(at|xt)
, ρt = αρ min(ρ, ISt
)+ (1− αρ)ISt,
ci = λ(αc min
(c, ISt
)+ (1− αc)ISt
),
vs = V (xs) +
s+n−1∑t=s
γt−s(Πt−1i=sci
)δtV,
δtV = ρt(rt + γV (xt+1)− V (xt)). (8)
We highlight that for αρ = 1, αc = 1, Leaky V-trace isexactly equivalent to V-trace, while for αρ = 0, αc = 0, itis equivalent to canonical importance sampling. For othervalues we get a mixture of the truncated and non truncatedimportance sampling weights.
Theorem 1 below suggests that Leaky V-trace is a contrac-tion mapping, and that the value function that it will con-verge to is given by V πρ,αρ , where
πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)
αρ∑b min (ρµ(b|x), π(b|x)) + 1− αρ
,
(9)
is a policy that mixes (and then re-normalizes) the targetpolicy with the V-trace policy Eq. (7) 2. A more formalstatement of Theorem 1 and a detailed proof (which closelyfollows that of Espeholt et al. (2018) for the original v-traceoperator) can be found in the supplementary material.
Theorem 1. The leaky V-trace operator defined by Eq. (8) isa contraction operator and it converges to the value functionof the policy defined by Eq. (9).
Similar to ρ, the new parameter αρ controls the fixed pointof the update rule, and defines a value function that inter-polates between the value function of the target policy πand the behaviour policy µ. Specifically, the parameter αcallows the importance weights to ”leak back” creating theopposite effect to clipping.
Since Theorem 1 requires us to have αρ ≥ αc, our mainSTAC implementation parametrises the loss with a singleparameter α = αρ = αc. In addition, we also experimentedwith a version of STAC that learns both αρ and αc. Quiteinterestingly, this variation of STAC learns the rule αρ ≥ αcby its own (see the experiments section for more details).
Note that low values of αc lead to importance samplingwhich is high contraction but high variance. On the otherhand, high values of αc lead to V-trace, which is lowercontraction and lower variance than importance sampling.Thus exposing αc to meta-learning enables STAC to directlycontrol the contraction/variance trade-off.
In summary, the metaparameters for STAC are{γ, λ, gv, gp, ge, α}. To keep things simple, when usingLeaky V-trace we make two simplifications w.r.t the hy-perparameters. First, we use V-trace to initialise LeakyV-trace, i.e., we initialise α = 1. Second, we fix the outerloss to be V-trace, i.e. we set αouter = 1.
2Note that α−trace (Rowland et al., 2019), another adaptivealgorithm for off policy learning, mixes the V-trace policy with thebehaviour policy; Leaky V-trace mixes it with the target policy.
Self-Tuning Deep Reinforcement Learning
3.2. STAC with auxiliary tasks (STACX)
Next, we introduce a new agent, that extends STAC withauxiliary policies, value functions, and respective auxiliaryloss functions; this is new because the parameters that de-fine the auxiliary tasks (the discount factors in this case)are self-tuned. As this agent has a new architecture in ad-dition to an extended set of metaparameters we give it adifferent acronym and denote it by STACX (STAC with aux-iliary tasks). The auxiliary losses have the same parametricform as the main objective and can be used to regularize itsobjective and improve its representations.
Deep Residual
Block
MLP
Observation
MLPMLP
STAC STACX
Figure 1. Block diagrams of STAC and STACX.
STACX’s architecture has a shared representation layerθshared, from which it splits into n different heads (Fig. 1).For the shared representation layer we use the deep residualnet from (Espeholt et al., 2018). Each head has a policy anda corresponding value function that are represented using a2 layered MLP with parameters {θi}ni=1. Each one of theseheads is trained in the inner loop to minimize a loss functionL(θi; ηi). parametrised by its own set of metaparameters ηi.
The policy of the STACX agent is defined to be the policyof a specific head (i = 1). The hyperparameters {ηi}ni=1 aretrained in the outer loop to improve the performance of thissingle head. Thus, the role of the auxiliary heads is to actas auxiliary tasks (Jaderberg et al., 2016) and improve theshared representation θshared. Finally, notice that each headhas its own policy πi, but the behaviour policy is fixed to beπ1. Thus, to optimize the auxiliary heads we use (Leaky)V-trace for off policy corrections 3.
3 We also considered two extensions of this approach. (1) Ran-dom ensemble: The policy head is chosen at random from [1, .., n],and the hyperparameters are differentiated w.r.t the performanceof each one of the heads in the outer loop. (2) Average ensemble:The actor policy is defined to be the average logits of the heads,and we learn one additional head for the value function of thispolicy. The metagradient in the outer loop is taken with respect tothe actor policy, and /or, each one of the heads individually. Whilethese extensions seem interesting, in all of our experiments they
The metaparameters for STACX are{γi, λi, giv, gip, gie, αi}3i=1. Since the outer loss is de-fined only w.r.t head 1, introducing the auxiliary tasks intoSTACX does not require new hyperparameters for the outerloss. In addition, we use the same initialisation values forall the auxiliary tasks. Thus, STACX has exactly the samehyperparameters as STAC.
0 0.5e8 1e8 1.5e8 2e80%
50%100%150%200%250%300%350%
Med
ian
hum
an n
orm
alize
d sc
ores
Meta gradient, Xu et. al.
Fixed auxiliary tasks
IMPALASTACSTACX
Figure 2. Median human-normalized scores during training.
4. ExperimentsIn all of our experiments (with the exception of the ro-bustness experiments in Section 4.3) we use the IMPALAhyperparameters both for the IMPALA baseline and forthe outer loss of the STAC agent, i.e., λouter = 1, αouter =1, gouter
v,p,e = 1. We use γ = 0.995 as it was found to improvethe performance of IMPALA considerably (Xu et al., 2018).
4.1. Atari learning curves
We start by evaluating STAC and STACX in the ArcadeLearning Environment (Bellemare et al., 2013, ALE). Fig. 2presents the normalized median scores 4 during training. Wefound STACX to learn faster and achieve higher final per-formance than STAC. We also compare these agents withversions of them without self tuning (fixing the metapa-rameters). The version of STACX with fixed unsupervisedauxiliary tasks achieved a normalized median score of 247,similar to that of UNREAL (Jaderberg et al., 2016) but notmuch better than IMPALA. In Fig. 3 we report the relativeimprovement of STACX over IMPALA in the individuallevels (an equivalent figure for STAC may be found in thesupplementary material).
always led to a small decrease in performance when compared toour auxiliary task agent without these extensions. Similar findingswere reported in (Fedus et al., 2019).
4Normalized median scores are computed as follows. For eachAtari game, we compute the human normalized score after 200Mframes of training and average this over 3 different seeds; we thenreport the overall median score over the 57 Atari domains.
Self-Tuning Deep Reinforcement Learning
35% 50% 75% 100% 150% 200% 275% 350% 500%
Relative human normalized scoresPhoenix
Up n down
Yars revenge
Boxing
Bowling
Double dunk
Amidar
Riverraid
Name this game
Time pilot
Kung fu master
Solaris
Atlantis
Ms pacman
Demon attack
Skiing
Pong
Montezuma revenge
Private eye
Pitfall
Freeway
Enduro
Venture
Crazy climber
Bank heist
Frostbite
Fishing derby
Berzerk
Surround
Krull
Seaquest
Hero
Breakout
Zaxxon
Assault
Qbert
Asterix
Tutankham
Video pinball
Star gunner
Centipede
Asteroids
Ice hockey
Space invaders
Alien
Battle zone
Gopher
Gravitar
Beam rider
Tennis
Defender
Kangaroo
Robotank
Jamesbond
Chopper command
Wizard of wor
Road runner
Figure 3. Mean human-normalized scores after 200M frames, rela-tive improvement in percents of STACX over IMPALA.
STACX achieved a median score of 364%, a new state of theart result in the ALE benchmark for training online model-free agents for 200M frames. In fact, there are only twoagents that reported better performance after 200M frames:LASER (Schmitt et al., 2019) achieved a normalized me-dian score of 431% and MuZero (Schrittwieser et al., 2019)achieved 731%. These papers propose algorithmic mod-ifications that are orthogonal to our approach and can becombined in future work; LASER combines IMPALA witha uniform large-scale experience replay; MuZero uses replayand a tree-based search with a learned model.
4.2. Ablative analysis
Next, we perform an ablative study of our approach by train-ing different variations of STAC and STACX. The resultsare summarized in Fig. 4. In red, we can see different base-
lines 5, and in green and blue, we can see different ablationsof STAC and STACX respectively. In these ablations ηcorresponds to the subset of the hyperparameters that arebeing self-tuned, e.g., η = {γ} corresponds to self-tuninga single loss function where only γ is self-tuned (and theother hyperparameters are fixed).
Inspecting Fig. 4 we observe that the performance of STACand STACX consistently improve as they self-tune moremetaparameters. These metaparameters control differenttrade-offs in reinforcement learning: discount factor con-trols the effective horizon, loss coefficients affect learningrates, the Leaky V-trace coefficient controls the variance-contraction-bias trade-off in off policy RL.
We have also experimented with adding more auxiliarylosses, i.e, having 5 or 8 auxiliary loss functions. Thesevariations performed better then having a single loss func-tion but slightly worse then having 3. This can be furtherexplained by Fig. 9 (Section 4.4), which shows that theauxiliary heads are self-tuned to similar metaparameters.
Nature DQNIMPALA
{}{ }
{ , }{ , , gv, ge, gp}
{ , , gv, ge, gp, }Xu et al.
{}3i = 1
{ }3i = 1
{ i, i}3i = 1
{ i, i, giv, gi
e, gip}3
i = 1
{ i, i, giv, gi
e, gip, i}3
i = 1
95%191%
243%240%248%257%
292%287%
247%262%
249%341%
364%
Figure 4. Ablative studies of STAC (green) and STACX (blue)compared to some baselines (red). Median normalized scoresacross 57 Atari games after training for 200M frames.
Finally, we experimented with a few more versions ofSTACX. One variation allows STACX to self-tune bothαρ and αc without enforcing the relation αρ ≥ αc. This ver-sion performed slightly worse than STACX (achieved a me-dian score of 353%) and we further discuss it in Section 4.4(Fig. 10). In another variation we self-tune α together with asingle truncation parameter ρ = c. This variation performedmuch worse achieving a median score of 301%, which maybe explained by ρ not being differentiable.
5We further discuss reproducing the results of Xu et al. (2018)in the supplementary material.
Self-Tuning Deep Reinforcement Learning
0% 20% 40% 60% 80% 100%Percentage of relative improvement
02468
10121416182022
Num
ber o
f gam
es{ i, i, gi
v, gie, gi
p, i}3i = 1
{ i, i, giv, gi
e, gip}3
i = 1
{ , , gv, ge, gp}{ , }{ }
Figure 5. The number of games (y-axis) in which ablative varia-tions of STAC and STACX improve the IMPALA baseline by atleast x percents (x-axis).
In Fig. 5, we further summarize the relative improvementof ablative variations of STAC (green,yellow, and blue; thebottom three lines) and STACX (light blue and red; the toptwo lines) over the IMPALA baseline. For each value ofx ∈ [0, 100] (the x axis), we measure the number of gamesin which an ablative version of the STAC(X) agent is betterthan the IMPALA agent by at least x percent and subtractfrom it the number of games in which the IMPALA agentis better than STAC(X). Clearly, we can see that STACX(light blue) improves the performance of IMPALA by a largemargin. Moreover, we observe that allowing the STAC(X)agents to self-tune more metaparameters consistently im-proves its performance in more games.
4.3. Robustness
It is important to note that our algorithm is not hyperpa-rameter free. For instance, we still need to choose hyper-parameter settings for the outer loss. Additionally, eachhyperparameter in the inner loss that we expose to meta-gradients still requires an initialization (itself a hyperparame-ter). Therefore, in this section, we investigate the robustnessof STACX to its hyperparameters.
We begin with the hyperparameters of the outer loss. Inthese experiments we compare the robustness of STACXwith that of IMPALA in the following manner. For each hy-per parameter (γ, gv) we select 5 perturbations. For STACXwe perturb the hyper parameter in outer loss (γouter, gouter
v )and for IMPALA we perturb the corresponding hyper pa-rameter (γ, gv). We randomly selected 5 Atari levels andpresent the mean and standard deviation across 3 randomseeds after 200M frames of training.
Fig. 6 presents the results for the discount factor. We cansee that overall (in 72% of the configurations measuredby mean), STACX indeed performs better than IMPALA.Similarly, Fig. 7 shows the robustness of STACX to thecritic weight (gv), where STACX improves over IMPALAin 80% of the configurations.
0.99 0.992 0.995 0.997 0.9990
50
Norm
alize
d m
ean Chopper_command
STACIMPALA
0.99 0.992 0.995 0.997 0.9990
50
100
Norm
alize
d m
ean Jamesbond
0.99 0.992 0.995 0.997 0.9990
20
Norm
alize
d m
ean Defender
0.99 0.992 0.995 0.997 0.9990
5
Norm
alize
d m
ean Beam_rider
0.99 0.992 0.995 0.997 0.9990
5
Norm
alize
d m
ean Crazy_climber
Figure 6. Robustness to the discount factor. Mean and confidenceintervals (over 3 seeds), after 200M frames of training. Left(blue) bars correspond to STACX and right (red) bars to IMPALA.STACX is better than IMPALA in 72% of the runs measured bymean.
0.25 0.35 0.5 0.75 10
100
Norm
alize
d m
ean Chopper_command
STACIMPALA
0.25 0.35 0.5 0.75 10
50
Norm
alize
d m
ean Jamesbond
0.25 0.35 0.5 0.75 10
20
Norm
alize
d m
ean Defender
0.25 0.35 0.5 0.75 10
10
Norm
alize
d m
ean Beam_rider
0.25 0.35 0.5 0.75 10
5
Norm
alize
d m
ean Crazy_climber
Figure 7. Robustness to the critic weight gv . Mean and confidenceintervals (over 3 seeds), after 200M frames of training. Left (blue)bars correspond to STACX and right (red) bars correspond toIMPALA. STACX is better than IMPALA in 80% of the runsmeasured by mean.
Next, we investigate the robustness of STACX to the initial-isation of the metaparameters in Fig. 8. We selected valuesthat are close to 1 as our design principle is to initialisethe metaparameters to be similar to the hyperparameters inthe outer loss. We observe that overall, the method is quiterobust to different initialisations.
Self-Tuning Deep Reinforcement Learning
0.99
0.99
20.
995
0.99
70.
9990
10
20
30
40
50
60
All i
nit v
alue
ssChopper_command
0.99
0.99
20.
995
0.99
70.
9990
20
40
60
80
Jamesbond
0.99
0.99
20.
995
0.99
70.
9990
5
10
15
20
Defender
0.99
0.99
20.
995
0.99
70.
9990
2
4
6
Beam_rider
0.99
0.99
20.
995
0.99
70.
9990
2
4
6
Crazy_climber
0.99
0.99
20.
995
0.99
70.
9990
10
20
30
40
50
Disc
ount
init
valu
e
0.99
0.99
20.
995
0.99
70.
9990
20
40
60
80
100
0.99
0.99
20.
995
0.99
70.
9990
5
10
15
20
25
30
0.99
0.99
20.
995
0.99
70.
9990
2
4
6
0.99
0.99
20.
995
0.99
70.
9990
2
4
6
Figure 8. Robustness to the initialisation of the metaparameters.Mean and confidence intervals (over 6 seeds), after 200M framesof training. Columns correspond to different games. Bottom:perturbing γ init ∈ {0.99, 0.992, 0.995, 0.997, 0.999}. Top: per-turbing all the meta parameter initialisations. I.e., setting all the hy-perparamters {γ init, λinit, ginit
v , ginite , ginit
p , αinit}3i=1 to a single fixedvalue in {0.99, 0.992, 0.995, 0.997, 0.999}.
4.4. Adaptivity
In Fig. 9 we visualize the metaparameters of STACX duringtraining. As there are many metaparameters, seeds and lev-els, we restrict ourselves to a single seed (chosen arbitraryto 1) and a single game (Jamesbond). More examples canbe found in the supplementary material. For each hyperparameter we plot the value of the hyperparameters associ-ated with three different heads, where the policy head (headnumber 1) is presented in blue and the auxiliary heads (2and 3) are presented in orange and magenta.
Inspecting Fig. 9 we can see that the two auxiliary heads self-tuned their metaparameters to have relatively similar values,but different than those of the main head. The discountfactor of the main head, for example, converges to the valueof the discount factor in the outer loss (0.995), while thediscount factors of the auxiliary heads change quite a lotduring training and learn about horizons that differ fromthat of the main head.
We also observe non trivial behaviour in self-tuning the coef-ficients of the loss functions (ge, gp, gv), in the λ coefficientand in the off policy coefficient α. For instance, we foundthat at the beginning of training α is self tuned to a highvalue (close to 1) so it is quite similar to V-trace; instead,towards the end of training STACX self-tunes it to lowervalues which makes it closer to importance sampling.
0.0 0.5 1.0 1.5 2.01e8
0.9850.9900.995
0.0 0.5 1.0 1.5 2.01e8
0.900.951.00
0.0 0.5 1.0 1.5 2.01e8
0.5
1.0
g p
0.0 0.5 1.0 1.5 2.01e8
0.6
0.8
1.0
g v
0.0 0.5 1.0 1.5 2.01e8
0.5
1.0
g e
0.0 0.5 1.0 1.5 2.01e8
0.5
1.0
head 1 head 2 head 3
Figure 9. Adaptivity in Jamesbond.
Finally, we also noticed an interesting behaviour in the ver-sion of STACX where we expose to self-tuning both theαρ and αc coefficients, without imposing αρ ≥ αc (Theo-rem 1). This variation of STACX achieved a median scoreof 353%. Quite interestingly, the metagradient discoveredαρ ≥ αc on its own, i.e., it self-tunes αρ so that is greater orequal to αc in 91.2% of the time (averaged over time, seeds,and levels) , and so that αρ ≥ 0.99αc in 99.2% of the time.Fig. 10 shows an example of this in Jamesbond.
0.0 0.5 1.0 1.5 2.01e8
0.990
0.9953
3c
0.0 0.5 1.0 1.5 2.01e8
0.990
0.9952
2c
0.0 0.5 1.0 1.5 2.01e8
0.0
0.5
1.01
1c
Figure 10. Discovery of αρ ≥ αc.
5. SummaryIn this work we demonstrate that it is feasible to use metagra-dients to simultaneously tune many critical hyperparameters(controlling important trade offs in a reinforcement learningagent), as long as they are differentiable; we show that thiscan be done online, within a single lifetime. We do so bypresenting STAC and STACX, actor critic algorithms thatself-tune a large number of hyperparameters of very differ-ent nature. We showed that the performance of these agentsimproves as they self-tune more hyperparameters, and wedemonstrated that STAC and STACX are computationallyefficient and robust to their own hyperparameters.
Self-Tuning Deep Reinforcement Learning
ReferencesBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.
Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012. ISSN 1532-4435.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,I., et al. Impala: Scalable distributed deep-rl with impor-tance weighted actor-learner architectures. arXiv preprintarXiv:1802.01561, 2018.
Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., andLarochelle, H. Hyperbolic discounting and learning overmultiple horizons. arXiv preprint arXiv:1902.06865,2019.
Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.Forward and reverse gradient-based hyperparameter opti-mization. In Proceedings of the 34th International Con-ference on Machine Learning-Volume 70, pp. 1165–1173.JMLR. org, 2017.
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-ment learning with unsupervised auxiliary tasks. arXivpreprint arXiv:1611.05397, 2016.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning,I., Simonyan, K., et al. Population based training ofneural networks. arXiv preprint arXiv:1711.09846, 2017.
Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen-dence of effective planning horizon on model accuracy.In Proceedings of the 2015 International Conference onAutonomous Agents and Multiagent Systems, pp. 1181–1189, 2015.
Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier non-linearities improve neural network acoustic models. Inin ICML Workshop on Deep Learning for Audio, Speechand Language Processing. Citeseer, 2013.
Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-based hyperparameter optimization through reversiblelearning. In International Conference on Machine Learn-ing, pp. 2113–2122, 2015.
Mann, T. A., Penedones, H., Mannor, S., and Hester, T.Adaptive lambda least-squares temporal difference learn-ing. arXiv preprint arXiv:1612.09465, 2016.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare,M. Safe and efficient off-policy reinforcement learning.In Advances in Neural Information Processing Systems,pp. 1054–1062, 2016.
Pedregosa, F. Hyperparameter optimization with approxi-mate gradient. In International Conference on MachineLearning, pp. 737–746, 2016.
Rowland, M., Dabney, W., and Munos, R. Adap-tive trade-offs in off-policy learning. arXiv preprintarXiv:1910.07478, 2019.
Schaul, T., Borsa, D., Ding, D., Szepesvari, D., Ostrovski,G., Dabney, W., and Osindero, S. Adapting behaviourfor learning progress. arXiv preprint arXiv:1912.06910,2019.
Schmitt, S., Hessel, M., and Simonyan, K. Off-policyactor-critic with shared experience replay. arXiv preprintarXiv:1909.11583, 2019.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis,D., Graepel, T., et al. Mastering atari, go, chess andshogi by planning with a learned model. arXiv preprintarXiv:1911.08265, 2019.
Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms.In Advances in neural information processing systems,pp. 2951–2959, 2012.
Sutton, R. S. Adapting bias by gradient descent: An incre-mental version of delta-bar-delta. In AAAI, pp. 171–176,1992.
Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. Discov-ery of useful questions as auxiliary tasks. In Advances inNeural Information Processing Systems, pp. 9306–9317,2019.
White, M. and White, A. A greedy approach to adaptingthe trace parameter for temporal difference learning. InProceedings of the 2016 International Conference onAutonomous Agents & Multiagent Systems, pp. 557–565,2016.
Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluationof rectified activations in convolutional network. arXivpreprint arXiv:1505.00853, 2015.
Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient re-inforcement learning. In Advances in neural informationprocessing systems, pp. 2396–2407, 2018.
Self-Tuning Deep Reinforcement Learning
Young, K., Wang, B., and Taylor, M. E. Metatrace: Onlinestep-size tuning by meta-gradient descent for reinforce-ment learning control. arXiv preprint arXiv:1805.04514,2018.
Zheng, Z., Oh, J., and Singh, S. On learning intrinsicrewards for policy gradient methods. In Advances inNeural Information Processing Systems, pp. 4644–4654,2018.
Self-Tuning Deep Reinforcement Learning
6. Additional results
35% 50% 75% 100% 150% 200% 275% 350% 500%
Relative human normalized scoresSkiing
Phoenix
Space invaders
Yars revenge
Riverraid
Asteroids
Amidar
Defender
Bowling
Asterix
Time pilot
Qbert
Name this game
Atlantis
Double dunk
Berzerk
Ms pacman
Kung fu master
Bank heist
Surround
Hero
Solaris
Enduro
Pong
Freeway
Boxing
Private eye
Pitfall
Venture
Montezuma revenge
Star gunner
Demon attack
Frostbite
Krull
Fishing derby
Seaquest
Zaxxon
Breakout
Up n down
Tutankham
Ice hockey
Alien
Crazy climber
Video pinball
Beam rider
Assault
Gopher
Kangaroo
Gravitar
Road runner
Centipede
Tennis
Battle zone
Robotank
Chopper command
Wizard of wor
Jamesbond
Figure 11. Mean human-normalized scores after 200M frames, relative improvement in percents of STAC over the IMPALA baseline.
Self-Tuning Deep Reinforcement Learning
7. Analysis of Leaky V-traceDefine the Leaky V-trace operator R:
RV (x) = V (x) + Eµ
∑t≥0
γt(Πt−1i=0 ci
)ρt (rt + γV (xt+1)− V (xt)) |x0 = x, µ)
, (10)
where the expectation Eµ is with respect to the behaviour policy µ which has generated the trajectory (xt)t≥0, i.e.,x0 = x, xt+1 ∼ P (·|xt, at), at ∼ µ(·|xt). Similar to (Espeholt et al., 2018), we consider the infinite-horizon operator butvery similar results hold for the n-step truncated operator.
Let
IS(xt) =π(at|xt)µ(at|xt)
,
be importance sampling weights, let
ρt = min(ρ, IS(xt)), ct = min(c, IS(xt)),
be truncated importance sampling weights with ρ ≥ c, and let
ρt = αρρt + (1− αρ)IS(xt), ct = αcct + (1− αc)IS(xt)
be the Leaky importance sampling weights with leaky coefficients αρ ≥ αc.
Theorem 2 (Restatement of Theorem 1). Assume that there exists β ∈ (0, 1] such that Eµρ0 ≥ β. Then the operator Rdefined by Eq. (10) has a unique fixed point V π , which is the value function of the policy πρ,αρ defined by
πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)∑b αρ min (ρµ(b|x), π(b|x)) + (1− αρ)π(b|x)
,
Furthermore, R is a η-contraction mapping in sup-norm, with
η = γ−1 − (γ−1 − 1)Eµ
∑t≥0
γt(Πt−2i=0 ci
)ρt−1
≤ 1− (1− γ)(αρβ + 1− αρ) < 1,
where c−1 = 1, ρ−1 = 1 and Πt−2s=0cs = 1 for t = 0, 1.
Proof. The proof follows the proof of V-trace from (Espeholt et al., 2018) with adaptations for the leaky V-trace coefficients.We have that
RV1(x)− RV2(x) = Eµ∑t≥0
γt(Πt−2s=0cs
)[ρt−1 − ct−1ρt] (V1(xt)− V2(xt)) .
Denote by κt = ρt−1 − ct−1ρt, and notice that
Eµρt = αρEµρt + (1− αρ)EµIS(xt) ≤ 1,
since EµIS(xt) = 1, and therefore, Eµρt ≤ 1. Furthermore, since ρ ≥ c and αρ ≥ αc, we have that ∀t, ρt ≥ ct. Thus, thecoefficients κt are non negative in expectation, since
Eµκt = Eµρt−1 − ct−1ρt ≥ Eµct−1(1− ρt) ≥ 0.
Thus, V1(x)− V2(x) is a linear combination of the values V1 − V2 at the other states, weighted by non-negative coefficientswhose sum is
Self-Tuning Deep Reinforcement Learning
∑t≥0
γtEµ(Πt−2s=0cs
)[ρt−1 − ct−1ρt]
=∑t≥0
γtEµ(Πt−2s=0cs
)ρt−1 −
∑t≥0
γtEµ(Πt−1s=0cs
)ρt
=γ−1 − (γ−1 − 1)∑t≥0
γtEµ(Πt−2s=0cs
)ρt−1
≤γ−1 − (γ−1 − 1)(1 + γEµρ0) (11)=1− (1− γ)Eµρ0
=1− (1− γ)Eµ (αρρ0 + (1− αρ)IS(x0))
≤1− (1− γ) (αρβ + 1− αρ) < 1 (12)
where Eq. (11) holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive,and Eq. (12) holds by the assumption.
We deduce that ‖RV1(x) − RV2(x)‖∞ ≤ η‖V1(x) − V2(x)‖∞, with η = 1 − (1 − γ) (αρβ + 1− αρ) < 1, so R is acontraction mapping. Furthermore, we can see that the parameter αρ controls the contraction rate, for αρ = 1 we get thecontraction rate of V-trace η = 1− (1− γ)β and as αρ gets smaller with get better contraction as with αρ = 0 we get thatη = γ.
Thus R possesses a unique fixed point. Let us now prove that this fixed point is V πρ,αρ , where
πρ,αρ =αρ min (ρµ(a|x), π(a|x)) + (1− αρ)π(a|x)∑b αρ min (ρµ(b|x), π(b|x)) + (1− αρ)π(b|x)
, (13)
is a policy that mixes the target policy with the V-trace policy.
We have:
Eµ [ρt(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))|xt]=Eµ (αρρt + (1− αρ)IS(xt)) (rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))
=∑a
µ(a|xt)(αρ min
(ρ,π(a|xt)µ(a|xt)
)+ (1− αρ)
π(a|xt)µ(a|xt)
)(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))
=∑a
(αρ min (ρµ(a|xt), π(a|xt)) + (1− αρ)π(a|xt)) (rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt))
=∑a
πρ,αρ(a|xt)(rt + γV πρ,αρ (xt+1)− V πρ,αρ (xt)) ·∑b
(αρ min (ρµ(b|xt), π(b|xt)) + (1− αρ)π(b|xt)) = 0,
where we get that the left side (up to the summation on b) of the last equality equals zero since this is the Bellman equationfor V πρ,αρ . We deduce that RV πρ,αρ = V πρ,αρ , thus, V πρ,αρ is the unique fixed point of R.
Self-Tuning Deep Reinforcement Learning
8. ReproducibilityInspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and usingmeta gradients to tune only λ, γ compared to the results that were reported in (Xu et al., 2018).
We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Ourcode is written in JAX, compared to the implementation in (Xu et al., 2018) that was written in TensorFlow. This mayexplain the small difference in final performance between our IMAPALA baseline (Fig. 4) and the the result of Xu et. al.which is slightly higher (257.1).
Second, Xu et. al. observed that embedding the γ hyper parameter into the θ network improved their results significantly,reaching a final performance (when learning γ, λ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Ourmethod, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this differenceby introducing the γ embedding intro our architecture. With γ embedding, our method achieved a score of 280.6 whichalmost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model withauxiliary loss functions. In this case for auxiliary loss i we embed γi. We experimented with two variants, one that shares theembedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of thesevariants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and withoutauxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performedbetter without the embedding (353.4) and we therefor ended up not using the γ embedding in our architecture. We leave it tofuture work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.
Self-Tuning Deep Reinforcement Learning
Self-Tuning Deep Reinforcement Learning
9. Individual level learning curves
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
alie
n, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
alie
n, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
alie
n, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
amid
ar, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
1000
1200
1400
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
amid
ar, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
200
400
600
800
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
amid
ar, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
500
1000
1500
2000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
assa
ult,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
assa
ult,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
assa
ult,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
aste
rix, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200000
400000
600000
800000
1000000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
aste
rix, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200000
400000
600000
800000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
aste
rix, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
50000
100000
150000
200000
250000
300000
350000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9775
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
aste
roid
s, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.0 0.5 1.0 1.5 2.01e8
0
25000
50000
75000
100000
125000
150000
175000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9775
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
aste
roid
s, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
140000
160000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
aste
roid
s, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
140000
160000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
atla
ntis,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
200000
400000
600000
800000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
atla
ntis,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
200000
400000
600000
800000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
atla
ntis,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
200000
400000
600000
800000
Figure 12. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
bank
_hei
st, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
1000
1200
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
bank
_hei
st, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
1000
1200
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
bank
_hei
st, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
1000
1200
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9750
0.9775
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
battl
e_zo
ne, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9775
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
battl
e_zo
ne, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
battl
e_zo
ne, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
2500
5000
7500
10000
12500
15000
17500
20000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
beam
_rid
er, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
beam
_rid
er, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
20000
40000
60000
80000
100000
120000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
beam
_rid
er, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
140000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
berz
erk,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
400
500
600
700
800
900
1000
1100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
berz
erk,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
500
600
700
800
900
1000
1100
1200
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
berz
erk,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
400
500
600
700
800
900
1000
1100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
bowl
ing,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.0 0.5 1.0 1.5 2.01e8
25.0
27.5
30.0
32.5
35.0
37.5
40.0
42.5
45.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
bowl
ing,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
20
25
30
35
40
45
50
55
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
bowl
ing,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
20
25
30
35
40
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
boxi
ng, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20
40
60
80
100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
boxi
ng, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20
40
60
80
100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
boxi
ng, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
2
0
2
4
6
Figure 13. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
brea
kout
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
brea
kout
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
brea
kout
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200
400
600
800
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
cent
iped
e, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
cent
iped
e, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
cent
iped
e, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
chop
per_
com
man
d, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
50000
100000
150000
200000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.8
0.9
1.0
chop
per_
com
man
d, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
200000
400000
600000
800000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
chop
per_
com
man
d, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
50000
100000
150000
200000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
craz
y_cli
mbe
r, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
50000
100000
150000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
craz
y_cli
mbe
r, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
50000
100000
150000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
craz
y_cli
mbe
r, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
50000
100000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
defe
nder
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
defe
nder
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
defe
nder
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
100000
200000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
dem
on_a
ttack
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
50000
100000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
dem
on_a
ttack
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
50000
100000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
dem
on_a
ttack
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
50000
100000
Figure 14. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
doub
le_d
unk,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
15
10
5
0
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
doub
le_d
unk,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
15
10
5
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
doub
le_d
unk,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
15
10
5
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
endu
ro, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.989
0.990
0.991
0.992
0.993
0.994
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9898
0.9900
0.9902
0.9904
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9890
0.9895
0.9900
0.9905
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
endu
ro, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9897
0.9898
0.9899
0.9900
0.9901
0.9902
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9901
0.9902
0.9903
0.9904
0.9905
0.9906
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
endu
ro, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.989
0.990
0.991
0.992
0.993
0.994
0.995
0.996
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9898
0.9899
0.9900
0.9901
0.9902
0.9903
0.9904
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9898
0.9899
0.9900
0.9901
0.9902
0.9903
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
fishi
ng_d
erby
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
80
60
40
20
0
20
40
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
fishi
ng_d
erby
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
100
80
60
40
20
0
20
40
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
fishi
ng_d
erby
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
80
60
40
20
0
20
40
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
freew
ay, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.986
0.987
0.988
0.989
0.990
0.991
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98875
0.98900
0.98925
0.98950
0.98975
0.99000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9899
0.9900
0.9901
0.9902
0.9903
0.9904
0.9905
0.0 0.5 1.0 1.5 2.01e8
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9775
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
freew
ay, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.986
0.987
0.988
0.989
0.990
0.991
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9888
0.9890
0.9892
0.9894
0.9896
0.9898
0.9900
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9896
0.9898
0.9900
0.9902
0.9904
0.9906
0.0 0.5 1.0 1.5 2.01e8
0.00
0.02
0.04
0.06
0.08
0.10
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
freew
ay, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.987
0.988
0.989
0.990
0.991
0.992
0.993
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9890
0.9895
0.9900
0.9905
0.9910
0.9915
0.9920
0.9925
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.987
0.988
0.989
0.990
0.991
0.992
0.993
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9896
0.9898
0.9900
0.9902
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9898
0.9899
0.9900
0.9901
0.9902
0.9903
0.9904
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
frost
bite
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
175
200
225
250
275
300
325
350
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
frost
bite
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.0 0.5 1.0 1.5 2.01e8
175
200
225
250
275
300
325
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
frost
bite
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
200
250
300
350
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
goph
er, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
goph
er, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
goph
er, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
Figure 15. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
grav
itar,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
500
1000
1500
2000
2500
3000
3500
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
grav
itar,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
500
1000
1500
2000
2500
3000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
grav
itar,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
500
1000
1500
2000
2500
3000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
hero
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
5000
10000
15000
20000
25000
30000
35000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
hero
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
35000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
hero
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
35000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
ice_h
ocke
y, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
10
15
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
ice_h
ocke
y, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
10
15
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
ice_h
ocke
y, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
10
15
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
jam
esbo
nd, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
jam
esbo
nd, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
jam
esbo
nd, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
kang
aroo
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.0 0.5 1.0 1.5 2.01e8
0
2000
4000
6000
8000
10000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
kang
aroo
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
2000
4000
6000
8000
10000
12000
14000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
kang
aroo
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.0 0.5 1.0 1.5 2.01e8
0
2000
4000
6000
8000
10000
12000
14000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
krul
l, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
2000
4000
6000
8000
10000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
krul
l, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.955
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
2000
3000
4000
5000
6000
7000
8000
9000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
krul
l, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
3000
4000
5000
6000
7000
8000
Figure 16. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
kung
_fu_
mas
ter,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.0 0.5 1.0 1.5 2.01e8
20000
40000
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
kung
_fu_
mas
ter,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.0 0.5 1.0 1.5 2.01e8
20000
40000
60000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.8
0.9
1.0
kung
_fu_
mas
ter,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.0 0.5 1.0 1.5 2.01e8
20000
40000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
mon
tezu
ma_
reve
nge,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9890
0.9895
0.9900
0.9905
0.0 0.5 1.0 1.5 2.01e8
0.05
0.00
0.05
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
mon
tezu
ma_
reve
nge,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.989
0.990
0.0 0.5 1.0 1.5 2.01e8
0
1
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
mon
tezu
ma_
reve
nge,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.993
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.0 0.5 1.0 1.5 2.01e8
0
2
4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
ms_
pacm
an, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
2000
4000
6000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
ms_
pacm
an, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
2000
4000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
ms_
pacm
an, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
2000
4000
6000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
nam
e_th
is_ga
me,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
10000
20000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
nam
e_th
is_ga
me,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
10000
20000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
nam
e_th
is_ga
me,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
10000
20000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
phoe
nix,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
phoe
nix,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
200000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
phoe
nix,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.25
0.50
0.75
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
pitfa
ll, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.925
0.950
0.975
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98
0.99
0.0 0.5 1.0 1.5 2.01e8
40
20
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
pitfa
ll, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
60
40
20
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
pitfa
ll, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
100
50
0
Figure 17. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
pong
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
20
15
10
5
0
5
10
15
20
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
pong
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
0.0 0.5 1.0 1.5 2.01e8
20
10
0
10
20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
pong
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
20
15
10
5
0
5
10
15
20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
priv
ate_
eye,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.982
0.984
0.986
0.988
0.990
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.0 0.5 1.0 1.5 2.01e8
60
70
80
90
100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
priv
ate_
eye,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
0.0 0.5 1.0 1.5 2.01e8
200
150
100
50
0
50
100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
priv
ate_
eye,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.0 0.5 1.0 1.5 2.01e8
100
50
0
50
100
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
qber
t, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
qber
t, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
2500
5000
7500
10000
12500
15000
17500
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
qber
t, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
2500
5000
7500
10000
12500
15000
17500
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
river
raid
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
river
raid
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
5000
10000
15000
20000
25000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
river
raid
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
road
_run
ner,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
60000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
road
_run
ner,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
60000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
road
_run
ner,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
500000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.965
0.970
0.975
0.980
0.985
0.990
0.995
robo
tank
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
10
20
30
40
50
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
robo
tank
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10
20
30
40
50
60
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
robo
tank
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10
20
30
40
50
Figure 18. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9800
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
seaq
uest
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
1000
2000
3000
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
seaq
uest
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
1000
2000
3000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
seaq
uest
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
500
1000
1500
2000
2500
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
skiin
g, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.993
0.994
0.0 0.5 1.0 1.5 2.01e8
30000
25000
20000
15000
10000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
skiin
g, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.0 0.5 1.0 1.5 2.01e8
30000
25000
20000
15000
10000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
skiin
g, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.987
0.988
0.989
0.990
0.991
0.0 0.5 1.0 1.5 2.01e8
30000
25000
20000
15000
10000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
sola
ris, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
1000
1500
2000
2500
3000
3500
4000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.985
0.990
0.995
sola
ris, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
1000
1500
2000
2500
3000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
sola
ris, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
1000
1500
2000
2500
3000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
spac
e_in
vade
rs, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
spac
e_in
vade
rs, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
spac
e_in
vade
rs, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
star
_gun
ner,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
star
_gun
ner,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
star
_gun
ner,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.875
0.900
0.925
0.950
0.975
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
100000
200000
300000
400000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
surro
und,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
10
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
surro
und,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.85
0.90
0.95
1.00
surro
und,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.0 0.5 1.0 1.5 2.01e8
10
5
0
5
Figure 19. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
tenn
is, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.0 0.5 1.0 1.5 2.01e8
20
10
0
10
20
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
tenn
is, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.0 0.5 1.0 1.5 2.01e8
20
10
0
10
20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
tenn
is, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.0 0.5 1.0 1.5 2.01e8
20
10
0
10
20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
time_
pilo
t, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.20.30.40.50.60.70.80.91.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
60000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
time_
pilo
t, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
60000
70000
80000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
time_
pilo
t, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
60000
70000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
tuta
nkha
m, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
50
100
150
200
250
300
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
tuta
nkha
m, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9825
0.9850
0.9875
0.9900
0.9925
0.9950
0.9975
1.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
100
150
200
250
300
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
tuta
nkha
m, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.20.30.40.50.60.70.80.91.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.5 1.0 1.5 2.01e8
100
150
200
250
300
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
up_n
_dow
n, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.20.30.40.50.60.70.80.91.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
50000
100000
150000
200000
250000
300000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
up_n
_dow
n, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
50000
100000
150000
200000
250000
300000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
up_n
_dow
n, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
50000
100000
150000
200000
250000
300000
350000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
vent
ure,
seed
=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.993
0.994
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.991
0.992
0.993
0.994
0.995
0.996
0.997
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9896
0.9897
0.9898
0.9899
0.9900
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9896
0.9898
0.9900
0.9902
0.9904
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
vent
ure,
seed
=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98950.98960.98970.98980.98990.99000.99010.99020.9903
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9901
0.9902
0.9903
0.9904
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
vent
ure,
seed
=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9896
0.9897
0.9898
0.9899
0.9900
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.9901
0.9902
0.9903
0.9904
0.9905
0.0 0.5 1.0 1.5 2.01e8
0.04
0.02
0.00
0.02
0.04
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
vide
o_pi
nbal
l, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
1.000
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
500000
600000
700000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
vide
o_pi
nbal
l, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.98000.98250.98500.98750.99000.99250.99500.99751.0000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
0.0 0.5 1.0 1.5 2.01e8
0
100000
200000
300000
400000
500000
600000
700000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.995
vide
o_pi
nbal
l, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.90
0.92
0.94
0.96
0.98
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0
100000
200000
300000
400000
500000
600000
700000
Figure 20. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
wiza
rd_o
f_wo
r, se
ed=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
gp
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
gv
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.975
0.980
0.985
0.990
0.995
ge
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
reward
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
wiza
rd_o
f_wo
r, se
ed=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.965
0.970
0.975
0.980
0.985
0.990
0.995
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
35000
40000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
wiza
rd_o
f_wo
r, se
ed=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
10000
20000
30000
40000
50000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
yars
_rev
enge
, see
d=1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.980
0.985
0.990
0.995
1.000
0.0 0.5 1.0 1.5 2.01e8
0
20000
40000
60000
80000
100000
120000
140000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
yars
_rev
enge
, see
d=2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
5000
10000
15000
20000
25000
30000
35000
40000
45000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
yars
_rev
enge
, see
d=3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.4
0.6
0.8
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.5 1.0 1.5 2.01e8
20000
40000
60000
80000
100000
120000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
zaxx
on, s
eed=
1
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.988
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.960
0.965
0.970
0.975
0.980
0.985
0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
1.000
zaxx
on, s
eed=
2
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.94
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.998
zaxx
on, s
eed=
3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.990
0.992
0.994
0.996
0.998
1.000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.95
0.96
0.97
0.98
0.99
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.001e8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.01e8
0
5000
10000
15000
20000
25000
30000
Figure 21. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,blue is the main (policy) head.