submitted to ieee transaction on pattern analysis ... - …

SUBMITTED TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Batch Reinforcement Learning with aNonparametric Off-Policy Policy Gradient

Samuele Tosatto, Joao Carvalho and Jan Peters

Abstract—Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentiallyenables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance,delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robotlearning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In thispaper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policyparameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance samplingapproaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate againststate-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

Index Terms—Reinforcement Leanring, Policy Gradient, Nonparametric Estimation.

F

1 INTRODUCTION

R EINFORCEMENT LEARNING has made overwhelmingprogress in recent years, especially when applied to

board and computer games, or simulated tasks [1]–[3].However, in comparison, only a little improvement hasbeen achieved on real-world tasks. One of the reasons ofthis gap is that the vast majority of reinforcement learningapproaches are on-policy. On-policy algorithms require thatthe samples are collected using the optimization policy;and therefore this implies that a) there is little control onthe environment and b) samples must be discarded aftereach policy improvement, causing high sample inefficiency.In contrast, off-policy techniques are theoretically moresample efficient, because they decouple the procedures ofdata acquisition and policy update, allowing for the pos-sibility of sample-reuse. Furthermore, off-policy estimationenables offline (or batch) reinforcement learning, whichmeans that the algorithm extracts the optimal policy froma fixed dataset. This property is crucial in many real-worldapplications, since it allows a decoupled data-acquisitionprocess, and, subsequently, a safer interaction with theenvironment. However, classical off-policy algorithms likeQ-learning with function approximation and its offline ver-sion, fitted Q-iteration [4], [5], are not guaranteed to con-verge [6], [7], and allow only discrete actions. More recentsemi-gradient1 off-policy techniques, like Off-Policy ActorCritic (Off-PAC) [9], Deep Deterministic Policy Gradient(DDPG) [10], [11], Soft Actor Critic (SAC) [2], often per-form sub-optimally, especially when the collected data isstrongly off-policy, due to the biased semi-gradient update[12]. In the last years, offline techniques like ConservativeQ-Learning (CQL) [13], Bootstrapping Error Accumulation

• Samuele Tosatto is with the University of Alberta, Edmonton, Alberta,Canada. Department of Computing Science. E-mail: [email protected].

• Joao Carvalho and Jan Peters are with the Technische Universitat Darm-stadt, ¨ Darmstadt, Germany, FG Intelligent Autonomous Systems. E-mail: {name}.{surname}@tu-darmstadt.de

1. We adopt the terminology from [8].

Fig. 1: In the off-policy reinforcement learning scheme, thepolicy can be optimized using an off-policy dataset. Thisallows for safer interaction with the system and for bettersample efficiency.

Reduction (BEAR) [14] and Behavior Regularized ActorCritic (BRAC) [15], take care of a particular treatment of theout-of-distribution (OOD) policy improvement, however,still using the semi-gradient estimation. Another class ofoff-policy gradient estimation uses importance sampling[16]–[18] to deliver an unbiased estimate of the gradi-ent but suffer from high variance and is generally onlyapplicable with stochastic policies. Moreover, these algo-rithms require the full knowledge of the behavioral policy,making them unsuitable when data stems from a humandemonstrator. On the other hand, model-based approachesaim to approximate the environment’s dynamics, allowingthe generation of synthetic samples. The generative modelcan, in principle, be used by an on-policy reinforcementlearning algorithm, therefore, encompassing the difficulty ofoff-policy estimation. However, the generation of syntheticsamples is also problematic, as they are affected by the errorof the approximated dynamics. To mitigate this problem,state-of-the-art techniques aim to quantify and penalize the

arX

iv:2

010.

1477

1v3

[cs

.LG

] 7

Jun

202

1


most uncertain regions of the state-action space, like Model-based Offline Policy Optimization (MOPO) [19] and Model-based Offline Reinforcement Learning (MOReL) [20]. Toaddress all previously highlighted issues, we propose a newalgorithm, the Nonparametric Off-policy Policy Gradient(NOPG) [21]. NOPG constructs a nonparametric model ofthe dynamics and the reward signal, and builds a full-gradient estimate based on the closed-form solution of anonparametric Bellman equation. Our approach, in contrastto the majority of model-based approaches, does not requirethe generation of artificial data. On the other hand, whilemodel-free approaches are either built on semi-gradientestimation, or importance sampling, NOPG computes a full-gradient estimate without importance sampling estimators,and allows for the use of human demonstrations. Figure 1shows the offline scheme of NOPG. A behavioral policy,represented by a human demonstrator, provides (possiblysuboptimal) trajectories that solve a task. NOPG optimizesa policy from off-line and off-policy samples.

Contribution. This paper introduces a nonparametricBellman equation, and the full-gradient derived from itsclosed-form solution. We study the properties of the non-parametric Bellman equation, focusing on the bias of itsclosed-form solution. We empirically analyze the bias andthe variance of the proposed gradient estimator. We com-pare its effectiveness and efficiency w.r.t. state of the artonline and offline techniques, observing that NOPG exhibitshigh sample efficiency.

2 PROBLEM STATEMENT

Consider the reinforcement learning problem of an agentinteracting with a given environment, as abstracted by aMarkov decision process (MDP) and defined over the tu-ple (S,A, γ, P,R, µ0) where S ≡ Rds is the state space,A ≡ Rda the action space, the transition-based discountfactor γ is a stochastic mapping between S ×A×S to [0, 1),which allows for unification of episodic and continuingtasks [22]. The discount factor allows the transormation ofa particular MDP to an equivalent MDP where, at eachstate-action, there is 1 − γ probability to transition to anabsorbing state with zero reward. A variable discount factor,can, therefore, be interpreted as a variable probability oftransitioning to an absorbing state. To keep however thetheory simple, we will assume that γ(s,a, s′) ≤ γc whereγc < 1. The transition probability from a state s to s′

given an action a is governed by the conditional densityp(s′|s,a). The stochastic reward signal R for a transition(s,a, s′) ∈ S×A×S is drawn from a distribution R(s,a, s′)with mean value Es′ [R(s,a, s′)] = r(s,a). The initial distri-bution µ0(s) denotes the probability of the state s ∈ S to bea starting state. A policy π is a stochastic or deterministicmapping from S onto A, usually parametrized by a set ofparameters θ.

We define an episode as τ ≡ {st,at, rt, γt}∞t=1 where

s0 ∼ µ0(·); at ∼ π(· | st); st+1 ∼ p(· | st,at)rt ∼ R(st,at, st+1); γt ∼ γ(st,at, st+1).

In this paper we consider the discounted infinite-horizonsetting, where the objective is to maximize the expectedreturn

Jπ = Eτ

[ ∞∑

t=0

rt

t∏

i=0

γi

]. (1)

It is important to introduce two important quantities: thestationary state visitation µπ and the value function Vπ . Wenaturally extend the stationary state visitation defined by[23] with the transition-based discount factor

µ(s) = Eτ

[ ∞∑

t=0

p(s = st|π, µ0)t∏

i=1

γi

],

or, equivalently, as the fixed point of

µ(s) = µ0(s) +

∫

S

∫

Apγ(s|s′,a′)π(a′|s′)µπ(s′) ds′ da′

where, from now on, pγ(s′|s,a)=p(s′|s,a)E[γ(s,a, s′)]. Thevalue function

Vπ(s) = Eτ

[ ∞∑

t=0

rtp(rt|s0 = s, π)t∏

i=0

γi

],

corresponds to the fixed point of the Bellman equation,

Vπ(s) =

∫

Aπ(a|s)

(r(s,a) +

∫

SVπ(s′)pγ(s′|s,a) ds′

)da.

The state-action value function is defined as

Qπ(s,a) = r(s,a) +

∫

SVπ(s′)pγ(s′|s,a) ds′.

The expected return (1) can be formulated as

Jπ =

∫

Sµ0(s)Vπ(s) ds =

∫

S

∫

Aµπ(s)π(a|s)r(s,a) da ds.

Policy Gradient Theorem. Objective (1) is usually max-imized via gradient ascent. The gradient of Jπ w.r.t. thepolicy parameters θ is

∇θJπ=

∫

S

∫

A

µπ(s)πθ(a|s)Qπ(s,a)∇θ log πθ(a|s) da ds,

as stated in the policy gradient theorem [23]. When it ispossible to interact with the environment with the policy πθ ,one can approximate the integral by considering the state-action as a distribution (up to a normalization factor) anduse the samples to perform a Monte-Carlo (MC) estimation[24]. The Q-function can be estimated via Monte-Carlosampling, approximate dynamic programming or by directBellman minimization. In the off-policy setting, we do nothave access to the state-visitation µπ induced by the policy,but instead we observe a different state distribution. Whileestimating the Q-function with the new state distributionis well established in the literature [4], [25], the shift in thestate visitation µπ(s) is more difficult to obtain. State-of-the-art techniques either omit to consider this shift (we refer tothese algorithms as semi-gradient ), or they try to estimate itvia importance sampling correction. These approaches willbe discussed in detail in Section 4.


3 NONPARAMETRIC OFF-POLICY POLICY GRADI-ENT

In this section, we introduce a nonparametric Bellmanequation with a closed form solution, which carries thedependency from the policy’s parameters. We derive thegradient of the solution, and discuss the properties of theproposed estimator.

3.1 A Nonparametric Bellman EquationNonparametric methods, require the storage of all samplesto make predictions, but usually exhibit a low bias. Theygained popularity thanks to their general simplicity andtheir theoretical properties. In this paper, we focus on kerneldensity estimation, and Nadaraya-Watson kernel regression.In this context, it is usual to assume the kernel to be a sym-metric, real function κ, with positive co-domain and with∫k(x, y) dx = 1 ∀y. The goal of kernel density estimation

is to predict the density of a distribution p(s). After thecollection of n samples xi ∼ p(·), it is possible to obtainthe estimated density

p(x) = n−1n∑

i=1

κ(x,xi).

Nadaraya-Watson kernel regression builds on the kerneldensity estimation to predict the output of a functionf : Rd → R, assuming to observe a dataset of n samples{xi, yi}ni=1 with xi ∼ p(·) and yi = f(xi) + εi (where εi is azero-mean noise with finite variance). The prediction

f(x) =

∫R yp(x, y) dy

p(x)=

∑ni=1 yiκ(x,xi)∑ni=1 κ(x,xi)

is constructed on the expectation of the conditional densityprobability, keeping in account that

∫R yκ(y, yi) dy = yi.

Nonparametric Bellman equations have been developed ina number of prior works. [26]–[28] used nonparametricmodels such as Gaussian Processes for approximate dy-namic programming. [29] have shown that these methodsdiffer mainly in their use of regularization. [30] provideda Bellman equation using kernel density-estimation and ageneral overview on nonparametric dynamic programming.In contrast to prior work, our formulation preserves thedependency on the policy, enabling the computation ofthe policy gradient in closed-form. Moreover, we upper-bound the bias of the Nadaraya-Watson kernel regression toprove that our value function estimate is consistent w.r.t. theclassical Bellman equation under smoothness assumptions.We focus on the maximization of the average return in theinfinite horizon case formulated as a starting state objective∫s µ0(s)Vπ(s) ds [23].

Definition 1. The discounted infinite-horizon objective is definedby Jπ =

∫µ0(s)Vπ(s) ds. Under a stochastic policy the objective

is subject to the Bellman equation constraint

Vπ(s)=

∫

Aπθ(a|s)

(r(s,a) + γ

∫

SVπ(s′)p(s′|s,a) ds′

)da,

(2)

while in the case of a deterministic policy the constraint is givenas

Vπ(s) = r(s, πθ(s)) + γ

∫

SVπ(s′)p(s′|s, πθ(s)) ds′.

Maximizing the objective in Definition 1 analytically isnot possible, excluding special cases such as under linear-quadratic assumptions [31], or finite state-action space.Extracting an expression for the gradient of Jπ w.r.t. thepolicy parameters θ is also not straightforward given theinfinite set of possibly non-convex constraints representedin the recursion over Vπ . Nevertheless, it is possible totransform the constraints in Definition 1 to a finite set oflinear constraints via nonparametric modeling, thus leadingto an expression of the value function with simple algebraicmanipulation [30].

3.1.1 Nonparametric Modeling.

Assume a set of n observations D ≡ {si,ai, ri, s′i, γi}ni=1

sampled from interaction with an environment, withsi,ai ∼ β(·, ·), s′i ∼ p(·|si,ai), ri ∼ R(si,ai) and γ ∼γ(si,ai, s

′i). We define the kernels ψ : S × S → R+,

ϕ : A × A → R+ and φ : S × S → R+, as normal-ized, symmetric and positive definite functions with band-widths hψ,hϕ,hφ respectively. We define ψi(s) = ψ(s, si),ϕi(a) = ϕ(a,ai), and φi(s) = φ(s, s′i). Following [30], themean reward r(s,a) and the transition conditional p(s′|s,a)are approximated by the Nadaraya-Watson regression [32],[33] and kernel density estimation, respectively

r(s,a) :=

∑ni=1 ψi(s)ϕi(a)ri∑ni=1 ψi(s)ϕi(a)

,

p(s′|s,a) :=

∑ni=1 φi(s

′)ψi(s)ϕi(a)∑ni=1 ψi(s)ϕi(a)

,

γ(s,a, s′) :=

∑ni=1 γiψi(s)ϕi(a)φi(s

′)∑ni=1 ψi(s)ϕi(a)φi(s′)

and, therefore, by the product of p and γ we obtain

pγ(s′|s,a) := p(s′|s,a)γ(s,a, s′)

=

∑ni=1 γiψi(s)ϕi(a)φi(s

′)∑ni=1 ψi(s)ϕi(a)

.

Inserting the reward and transition kernels into theBellman Equation for the stochastic policy case we obtainthe nonparametric Bellman equation (NPBE)

Vπ(s)=

∫

Aπθ(a|s)

(r(s,a)+

∫

SVπ(s′)pγ(s′|s,a) ds′

)da

=∑

i

∫

A

πθ(a|s)ψi(s)ϕi(a)∑j ψj(s)ϕj(a)

da

×(ri + γi

∫

Sφi(s

′)Vπ(s′) ds′). (3)

Equation (3) can be conveniently expressed in matrixform by introducing the vector of responsibilities εi(s) =∫πθ(a|s)ψi(s)ϕi(a)/

∑j ψj(s)ϕj(a) da, which assigns each

state s a weight relative to its distance to a sample i underthe current policy.

Definition 2. The nonparametric Bellman equation on thedataset D is formally defined as

Vπ(s)=εᵀπ(s)

(r +

∫

Sφγ(s′)Vπ(s′) ds′

), (4)


with φγ(s) = [γ1φ1(s), . . . , γnφn(s)]ᵀ, r = [r1, . . . , rn]ᵀ,επ(s)=[επ1 (s), . . . , επn(s)]ᵀ,

εi(s,a) =ψi(s)ϕi(a)∑j ψj(s)ϕj(a)

and

επi (s)=

{∫πθ(a|s)εi(s,a) da if π is stochastic

εi(s, πθ(s)) otherwise.

From Equation (4) we deduce that the value functionmust be of the form εᵀπ(s)qπ , indicating that it can also beseen as a form of Nadaraya-Watson kernel regression,

εᵀπ(s)qπ = εᵀπ(s)

(r +

∫

Sφγ(s′)εᵀπ(s′)qπ ds′

). (5)

Notice that, trivially, every qπ which satisfies

qπ = r +

∫

Sφγ(s′)εᵀπ(s′)qπ ds′ (6)

also satisfies Equation (5). Theorem 1 demonstrates that thealgebraic solution of Equation (6) is the only solution of thenonparametric Bellman Equation (4).

Theorem 1. The nonparametric Bellman equation has a uniquefixed-point solution

V ∗π (s) := εᵀπ(s)Λ−1π r,

with Λπ := I − Pγπ and Pγ

π :=∫S φγ(s′)εᵀπ(s′) ds′, where Λπ

is always invertible since Pπ,γ is a strictly sub-stochastic matrix(Frobenius’ Theorem). The statement is valid also for n → ∞,provided bounded R.

The transition matrix Pγπ is strictly sub-stochastic since

each row Pγπ,i = γi

∫φi(s

′)εᵀπ(s′) ds′ is composed by theconvolution between φi, which by definition integrates toone, and 0 ≤ εᵀπ(s′) ≤ 1, as can be seen in Definition 2. Proofof Theorem 1 is provided in the supplementary material.

3.2 Nonparametric Gradient Estimation

With the closed-form solution of V ∗π (s) from Theorem 1, itis possible to compute the analytical gradient of Jπ w.r.t. thepolicy parameters

∇θV ∗π (s) =

(∂

∂θεᵀπ(s)

)Λ−1π r + εᵀπ(s)

∂

∂θΛ−1π r

=

(∂

∂θεᵀπ(s)

)Λ−1π r

︸︷︷︸A

+ εᵀπ(s)Λ−1π

(∂

∂θPγπ

)Λ−1π r

︸︷︷︸B

. (7)

Substituting the result of Equation (7) into the return spec-ified in Definition 1, introducing εᵀπ,0 :=

∫µ0(s)εᵀπ(s) ds,

qπ = Λ−1π r, and µπ = Λ−ᵀπ επ,0, we obtain

∇θJπ =

(∂

∂θεᵀπ,0

)qπ + µᵀ

π

(∂

∂θPγπ

)qπ, (8)

Algorithm 1 Nonparametric Off-Policy Policy Gradient

input: dataset {si,ai, ri, s′i, γi}ni=1 where πθ indicates thepolicy to optimize and ψ, φ, ϕ the kernels respectively forstate, action and next-state.while not converged do

Compute εᵀπ(s) as in Definition 2 and εᵀπ,0 :=∫µ0(s)εᵀπ(s) ds.

Estimate Pγπ as defined in Theorem 1 using MC (φ(s) is

a distribution).Solve r = Λπqπ and επ,0 = Λᵀ

πµπ for qπ and µπ usingconjugate gradient.Update θ using Equation (8).

end while

where qπ and µπ can be estimated via conjugate gradientto avoid the inversion of Λπ . It is interesting to notice that(8) is closely related to the Policy Gradient Theorem [23],

∇θJπ =

∫

S×A

(µ0(s) + φᵀ

γ(s)µπ)

︸︷︷︸Stationary distribution

εᵀ(s,a)qπ︸︷︷︸Qπ(s,a)

∇θπθ(a|s) dads

and to the Deterministic Policy Gradient Theorem (in itsonpolicy formulation) [10],

∇θJπ =

∫

S

(µ0(s) + φᵀ

γ(s)µπ)

︸︷︷︸Stationary distribution

∇aεᵀ(s,a)qπ︸︷︷︸

∇aQπ(s,a)

∇θπθ(s) ds.

In a later analysis, we will consider the state-action surro-gate return JπS (s,a) = (µ0(s) + Φᵀ

γ(s)µπ)εᵀ(s,a)qπ . Theterms A and B in Equation (7) correspond to the termsin Equation (10). In contrast to semi-gradient actor-criticmethods, where the gradient bias is affected by both thecritic bias and the semi-gradient approximation [8], [12], ourestimate is the full gradient and the only source of bias isintroduced by the estimation of Vπ , which we analyze inSection 3.3. The term µπ can be interpreted as the supportof the state-distribution as it satisfies µᵀ

π = εᵀπ,0 + µᵀπPγ

π . InSection 5, more specifically in Figure 7, we empirically showthat εᵀπ(s)µπ provides an estimate of the state distributionover the whole state-space. The quantities εᵀπ,0 and Pπ

i,j areestimated via Monte-Carlo sampling, which is unbiased butcomputationally demanding, or using other techniques suchas unscented transform or numerical quadrature. The matrixPγπ is of dimension n×n, which can be memory-demanding.

In practice, we notice that the matrix is often sparse. Bytaking advantage of conjugate gradient and sparsificationwe are able to achieve computational complexity of O(n2)per policy update and memory complexity of O(n). Furtherdetails on the computational and memory complexity canbe found in the supplementary material. A schematic of ourimplementation is summarized in Algorithm 1.

3.3 A theoretical Analysis

Nonparametric estimates of the transition dynamics andreward enjoy favorable properties for an off-policy learn-


ing setting. A well-known asymptotic behavior of theNadaraya-Watson kernel regression,

E[

limn→∞

fn(x)]− f(x) ≈

h2n

(1

2f ′′(x) +

f ′(x)β′(x)

β(x)

)∫u2K(u) du,

shows how the bias is related to the regression functionf(x), as well as to the samples’ distribution β(x) [34],[35]. However, this asymptotic behavior is valid only forinfinitesimal bandwidth, infinite samples (h → 0, nh → ∞)and requires the knowledge of the regression function andof the sampling distribution.

In a recent work, we propose an upper bound of thebias that is also valid for finite bandwidths [36]. We showunder some Lipschitz conditions that the bound of theNadaraya-Watson kernel regression bias does not dependon the samples’ distribution, which is a desirable propertyin off-policy scenarios. The analysis is extended to multidi-mensional input space. For clarity of exposition, we reportthe main result in its simplest formulation, and later use itto infer the bound of the NPBE bias.

Theorem 2. Let f :Rd→R be a Lipschitz continuous functionwith constant Lf . Assume a set {xi, yi}ni=1 of i.i.d. samplesfrom a log-Lipschitz distribution β with a Lipschitz constant Lβ .Assume yi = f(xi) + εi, where f : Rd→R and εi is i.i.d. andzero-mean. The bias of the Nadaraya-Watson kernel regressionwith Gaussian kernels in the limit of infinite samples n → ∞ isbounded by

∣∣∣E[

limn→∞

fn(x)]− f(x)

∣∣∣ ≤

Lfd∑k=1

hk

(d∏i 6=k

χi

)(1√2π

+Lβhk

2 χk)

d∏i=1

eL2βh

2i

2

(1− erf

(hiLβ√

2

)) ,

where

χi = eL2βh2i

2

(1 + erf

(hiLβ√

2

)),

h > 0 ∈ Rd is the vector of bandwidths and erf is the errorfunction.

Building on Theorem 2 we show that the solution of theNPBE is consistent with the solution of the true Bellmanequation. Moreover, although the bound is not affecteddirectly by β(s), a smoother sample distribution β(s) playsfavorably in the bias term (a low Lβ is preferred).

Theorem 3. Consider an arbitrary MDP M with a transi-tion density p and a stochastic reward function R(s,a) =r(s,a) + εs,a, where r(s,a) is a Lipschitz continuous functionwith LR constant and εs,a denotes zero-mean noise. Assume|R(s,a)| ≤Rmax and a dataset Dn sampled from a log-Lipschitzdistribution β defined over the state-action space with Lipschitzconstant Lβ . Let VD be the unique solution of a nonparametricBellman equation with Gaussian kernels ψ,ϕ, φ with positivebandwidths hψ,hϕ,hφ defined over the dataset limn→∞Dn.

Assume VD to be Lipschitz continuous with constant LV . Thebias of such estimator is bounded by

∣∣V (s)− V ∗(s)∣∣ ≤ 1

1− γc

(ABias + γcLV

ds∑

k=1

hφ,k√2π

), (9)

where V (s) = ED[VD(s)], ABias is the bound of the bias providedin Theorem 2 with Lf =LR, h=[hψ,hϕ], d=ds+da and V ∗(s)is the fixed point of the ordinary Bellman equation. 2

Theorem 3 shows that the value function provided byTheorem 1 is consistent when the bandwidth approachesinfinitesimal values. Moreover, it is interesting to notice thatthe error can be decomposed in ABias, which is the biascomponent dependent on the reward’s approximation, andthe remaining term that depends on the smoothness of thevalue function and the bandwidth of φ, which can be readas the error of the transition’s model.

The bound shows that smoother reward functions, state-transitions and sample distributions play favorably againstthe estimation bias. Notice that the bias persist even in thesupport points. This issue is known in Nadaraya-Watsonkernel regression. However, the bias can be controlled andlowered by reducing the bandwidth. The i.i.d. assumptionrequired by Theorem 3 is not restrictive, as even if the sam-ples are collected by a sequential process that interacts withthe MPD, the sample distribution will eventually convergeto the stationary distribution of the MDP. Furthermore,our algorithm considers all the samples simoultaneously,therefore, some inter-correlations between samples does notaffect the estimation.

Low Gradient

Ground TruthPredictionSamples

Fig. 2: The classic effect (known as boundary-bias) of theNadaraya-Watson regression predicting a constant functionin low-density regions is beneficial in our case, as it preventsthe policy from moving in those areas as the gradient getsclose to zero.

Beyond the bias analysis in the limit of infinite data isimportant to understand the variance and the bias for finitedata. We provide a finite sample analysis in the empiricalsection. It is known that the lowest mean squared erroris obtained by setting a higher bandwidth in presence ofscarce data, and decreasing it with more data. In our imple-mentation of NOPG, we select the bandwidth using crossvalidation.

2. Complete proofs of the theorems and precise definitions can befound in the supplementary material.


3.3.1 A Trust RegionVery commonly, in order to prevent harmful policy opti-mization, the policy is constrained to stay close to the data[37], to avoid taking large steps [3], [38] or to circumventlarge variance in the estimation [39], [40]. These techniques,which keep the policy updates in a trusted-region, preventincorrect and dangerous estimates of the gradient. Even ifwe do not include any explicit constraint of this kind, theNadaraya-Watson kernel regression automatically discour-ages policy improvements towards low-data areas. In fact,as depicted in Figure 2, the Nadaraya-Watson kernel regres-sion, tends to predict a constant function in low densityregions. Usually, this characteristic is regarded as an issue,as it causes the so-called boundary-bias. In our case, thiseffect turns out to be beneficial, as it constrains the policy tostay close to the samples, where the model is more correct.

4 RELATED WORK

Off-policy and offline policy optimization became increas-ingly more popular in the recent years and have beenexplored in many different flavors. Batch reinforcementlearning, in the model-free view, has been elaborated both asa value-based and a policy gradient technique (we includeactor-critic in this last class). Recently, offline reinforcementlearning has been proposed also in the mode-based formu-lation. In this section, we aim to detail both the advantagesand disadvantages of the proposed approaches. As wewill see, our solution shares some advantages typical ofthe model-free policy gradient algorithms while using anapproximated model of the dynamics and reward.

4.1 Model FreeThe model-free offline formulation aims to improve thepolicy based purely on the samples collected in the dataset,without generating synthetic samples from an approxi-mated model. We can divide this broad category in twomain families: the value-based approaches, and the policygradients.

4.1.1 Value BasedValue-based techniques are constructed on the approximatedynamic theory, leveraging on the policy-improvement the-orem and on the contraction property of the Bellman op-erator. Examples of offline approximate dynamic program-ming algorithms are Fitted Q-Iteration (FQI) and NeuralFQI (NFQI) [4], [5]. The max operator used in the optimalBellman operator, usually restricts the applicability to dis-crete action spaces (few exceptions, [41]). Furthermore, theprojected error can prevent the convergence to a satisfyingsolution [6]. These algorithms usually suffer also of thedelusion bias [7]. Despite these issues, there has been arecent revival of value-based techniques in the context of of-fline reinforcement learning. Batch-Constrained Q-learning(BCQ) [12], at the best of our knowledge, firstly defined theextrapolation error, which is partially caused by optimizingunseen (out of distribution, OOD) state-action pairs in thedataset. This source of bias has been extensively studiedin subsequent works, both in the model-free and model-based setting. Value-based methods using non-parametric

kernel-based approaches have also been used [26], [27], [42].An interesting discussion in [29] shows that kernelized leastsquare temporal difference approaches are also approximat-ing the reward and transition models, and therefore theycan be considered model-based.

4.1.2 Policy GradientThis class of algorithms leverages their theory on the policygradient theorem [10], [23]. Examples of policy gradientsare REINFORCE [24], G(PO)MDP [43], natural policy gra-dient [44], and Trust-Region Policy Optimization (TRPO)[3]. Often, these approaches make use of approximate dy-namic programming to estimate the Q-function (or relatedquantities), to obtain lower variance [2], [11], [38], [45].The policy gradient theorem, however, defines the policygradient w.r.t. on-policy samples. To overcome this problemtwo main techniques have been introduced: semi-gradientapproaches, which rely on the omission of one term in thecomputation of the gradient, introduce an irreducible sourceof bias, while importance sampling solutions, which areunbiased, suffer from high variance.Semi-Gradient: The off-policy policy gradient theorem wasthe first proposed off-policy actor-critic algorithm [9]. Sincethen, it has been used by the vast majority of state-of-the-art off-policy algorithms [2], [10], [11]. Nonetheless, itis important to note that this theorem and its successors,introduce two approximations to the original policy gra-dient theorem [23]. First, semi-gradient approaches con-sider a modified discounted infinite-horizon return objectiveJπ =

∫ρβ(s)Vπ(s) ds, where ρβ(s) is state distribution un-

der the behavioral policy πβ . Second, the gradient estimateis modified to be

∇θJπ = ∇θ∫

Sρβ(s)Vπ(s) ds

= ∇θ∫

Sρβ(s)

∫

Aπθ(a|s)Qπ(s,a) da ds

=

∫

S

ρβ(s)

∫

A

∇θπθ(a|s)Qπ(s,a)︸︷︷︸A

+ πθ(a|s)∇θQπ(s,a)︸︷︷︸B

da ds (10)

≈∫

S

ρβ(s)

∫

A

∇θπθ(a|s)Qπ(s,a) da ds,

where the term B related to the derivative of Qπ is ig-nored. The authors provide a proof that this biased gradi-ent, or semi-gradient, still converges to the optimal policyin a tabular setting [8], [9]. However, further approxima-tion (e.g., given by the critic and by finite sample size),might disallow the convergence to a satisfactory solution.Although these algorithms work correctly sampling fromthe replay memory (which discards the oldest samples),they have shown to fail with samples generated via acompletely different process [8], [12]. The source of biasintroduced in the semi-gradient estimation depends fully onthe distribution mismatch, and cannot be recovered by anincrement of the dataset size or with a more powerful func-tion approximator. Still, mainly due to its simplicity, semi-gradient estimation has been used in the off-line scenario.The authors of these works do not tackle the distribution


1 ′1

2

′2

(a) LQG (b) Pendulum-v0 (c) MountainCar-v0 (d) Real Cart-Pole

Fig. 3: Some of the benchmarking tasks. The return’s landscape (a) of the LQG problem. In the gradient analysis, weobtain the gradient of the policy with parameters θ1, θ2 by sampling from a policy interpolated with the parameters θ′1, θ

′2.

Sub-figures (b) and (d) depicts the OpenAI environment used. The real system in (c) has been used to evaluate the policylearned to stabilize the cart-pole task.

mismatch with its impact to the gradient estimation, butrather consider to prevent the policy taking OOD actions. Inthis line of thoughts, Batch-Constrained Q-learning (BCQ)[12] introduces a regularization that keeps the policy close tothe behavioral policy, Bootstraping Error Accumulation Re-duction (BEAR) [14] considers instead a pessimistic estimateof the Q function which penalizes the uncertainty, BehaviorRegularized Actor Critic (BRAC) [15] proposes instead ageneralization of the aforementioned approaches.Importance-Sampling: One way to obtain an unbiased esti-mate of the policy gradient in an off-policy scenario is to re-weight every trajectory via importance sampling [16]–[18].An example of the gradient estimation via G(PO)MDP [43]with importance sampling is given by

∇θJπ = E

T−1∑

t=0

ρt

t−1∏

j=0

γj

rt

t∑

i=0

∇θ log πθ(ai|si) , (11)

where ρt =∏tz=0 πθ(az|sz)/πβ(az|sz). This technique

applies only to stochastic policies and requires theknowledge of the behavioral policy πβ . Moreover,Equation (11) shows that path-wise importance sampling(PWIS) needs a trajectory-based dataset, since it needsto keep track of the past in the correction term ρt,hence introducing more restrictions on its applicability.Additionally, importance sampling suffers from highvariance [46]. Recent works have helped to make PWISmore reliable. For example, [8], building on the emphaticweighting framework [47], proposed a trade-off betweenPWIS and semi-gradient approaches. Another possibilityconsists in restricting the gradient improvement to asafe-region, where the importance sampling does not sufferfrom too high variance [39]. Another interesting line ofresearch is to estimate the importance sampling correctionon a state-distribution level instead of on the classictrajectory level [48]–[50]. We note that despite the nicetheoretical properties, all these promising algorithms havebeen applied on low-dimensional problems, as importancesampling suffers from the curse of dimensionality.

4.2 Model-BasedWhile all model-free offline techniques rely purely on thesamples contained in the offline dataset, model-based tech-

niques aim to approximate the transitions from that dataset.The approximated model is then used to generate new artifi-cial samples. These approaches do not suffer from the distri-bution mismatch problem described above, as the samplescan be generated on-policy by the model (and therefore, inprinciple, any online algorithm can be used). This advantageis overshadowed by the disadvantage of having unrealistictrajectories generated by the approximated model. For thisreason, many recent works focus on discarding unrealisticsamples, relying on a pessimistic approach towards uncer-tainty. In some sense, we encounter again the problem ofquantifying our uncertainty - given the limited informationcontained in the dataset - and to prevent the policy orthe model to use or generate uncertain samples. In thisview, the Probabilistic Ensemble with Trajectory Sampling(PETS) [40] uses an ensemble of probabilistic models of thedynamics to quantify both the epistemic and the aleatoricuncertainties, and uses model predictive control to act in thereal environment. Model Based Offline Planning (MBOP)[51] uses a behavioral cloning policy as a prior to theplanning algorithm. Plan Online and Learn Offline (POLO)[52] proposes to learn the value function from a knownmodel of the dynamics and to use model predictive con-trol on the environment. Both Model Based ReinforcementLearning (MOReL) [20] and Model-based Offline PolicyOptimization (MOPO) [19] learn instead the model to traina parametric policy. MOReL learns a pessimistic versionof the real MDP, by partitioning the state-action space inknown and unknown and penalizing the policies visitingthe unknown region. Instead, MOPO builds a pessimisticMDP by introducing a reward penalization toward modeluncertainty.

NOPG, instead, provides a theoretical framework builton a nonparametric approximation of the reward and thestate transition. Such approximation is used to computethe policy gradient in closed-form. For this reason, ourmethod differs from classic model-based solutions since itdoes not generate synthetic trajectories. Most of the state-of-the-art model-free offline algorithms, on the other hand,utilize a biased and inconsistent gradient estimate. Instead,our approach delivers a full-gradient estimate that allowsa trade-off between bias and variance. The quality of thegradient estimate results in a particularly sample-efficient


policy optimization, as seen in the empirical section.

5 EMPIRICAL EVALUATION

In this section, we analyze our method. Therefore, we divideour experiments in two: the analysis of the gradient, andthe analysis of the policy optimization using a gradientascent technique. The analysis of the gradient comprisesan empirical evaluation of the bias, the variance and thegradient direction w.r.t. the ground truth, in relation to somequantities such as the size of the dataset or its degree of “off-policiness”. In the policy optimization analysis, instead, weaim to both compare the sample efficiency of our methodin comparison to state-of-the-art policy gradient algorithms,and to study its applicability to unstructured and human-demonstrated datasets.

5.1 Benchmarking TasksIn the following, we give a brief description of the tasksinvolved in the empirical analysis.

5.1.1 Linear Quadratic Gaussian ControllerA very classical control problem consists of linear dynamics,quadratic reward and Gaussian noise. The main advantageof this control problem relies in the fact that it is fullysolvable in closed-form, using the Riccati equations, whichmakes it appropriate for verifying the correctness of our al-gorithm. In our specific scenario, we have a policy encodedwith two parameters for illustration purposes. The LQG isdefined as

maxθ

∞∑

t=0

γtrt

s.t. st+1 = Ast +Bat; rt = −sᵀtQst − aᵀtRat

at+1 = Θst + Σεt; εt ∼ N (0, I),

with A, B, Q, R, Σ diagonal matrix and Θ = diag(θ) whereθ are considered the policy’s parameters. In the stochasticpolicy experiments, πθ(a|s) = N (a|Θs; Σ), while for thedeterministic case Σ = 0 and πθ(s) = Θs. For furtherdetails, please refer to the supplementary material.

5.1.2 OpenAI Pendulum-v0The OpenAI Pendulum-v0 [53] is a popular benchmark inreinforcement learning. It simulates a simple under-actuatedinverted-pendulum. The goal is to swing the pendulumuntil it reaches the top position, and then to keep it stable.The state of the system is fully described by the angle of thependulum ω and its angular velocity ω. The applied torqueτ ∈ [−2, 2] corresponds to the agent’s action. One of theadvantages of such a system, is that its well-known valuefunction is two-dimensional.

5.1.3 Quanser Cart-poleThe cart-pole is another classical task in reinforcement learn-ing. It consists of an actuated cart moving on a track, towhich a pole is attached. The goal is to actuate the cart in away to balance the pole in the top position. Differently fromthe inverted pendulum, the system has a further degree ofcomplexity, and the state space requires the position on thetrack x, the velocity of the cart x, the angle of the pendulumω and its angular velocity ω.

Acronym Description TypologyNOPG-D Our method with deterministic

policy. NOPG

NOPG-S Our method with stochastic policy.G(PO)MDP+N G(PO)MDP with normalized

importance sampling. PWIS

G(PO)MDP+BN G(PO)MDP with normalizedimportance sampling andgeneralized baselines.

DPG+Q Offline version of the deterministicpolicy gradient theorem with anoracle for the Q-function.

SG

DDPG Deep Deterministic PolicyGradient.

TD3 Improved version of DDPG.SAC Soft Actor Critic.BEAR Bootstrapping Error

Accumulation Reduction.BRAC Behavior Regularized Actor Critic.MOPO Model-based Offline Policy

Optimization. MB

MOReL Model-Based OfflineReinforcement Learning.

TABLE 1: Acronyms used in the paper to refer to practicalimplementation of the algorithms (SG: semi-gradient, PWIS:path-wise importance sampling, MB: model based).

5.1.4 OpenAI Mountain-CarThe mountain-car (also known as car-on-hill), consists onan under-powered car that must reach the top of a hill. Thecar is placed in the valley connecting two hills. In order toreach the goal position, it must first go in opposite directionin order to gain momentum. Its state is described by thex-position of the car, and by its velocity x. The episodesterminate when the car reaches the goal. In contrast tothe swing-up pendulum, which is hardly controllable by ahuman-being, this car system is ideal to provide human-demonstrated data.

5.1.5 U-MazeU-Maze is an environment from the D4RL dataset [54]. Itconsist of a simple maze with a 2d shape, where a ballshould reach a goal position. The state representation is 4-dimensional (2d position and velocity), and the 2d actionrepresents the velocity of the ball.

5.1.6 HopperThe Hopper is a popular one-legged robot, with 11-dimensional state and 3-dimensional action space, thatshould hop forward as fast as possible in a two-dimensionalworld. We use the implementation offered by MuJoCo [55].Also in this case, we test NOPG on a dataset provided byD4RL.

5.2 Algorithms Used for Comparisons

To provide an analysis of the gradient, we compare our al-gorithm against G(PO)MDP with importance sampling, andwith offline DPG (DPG with fixed dataset). Instead of usingthe naıve form of G(PO)MDP with importance sampling,which suffers from high variance, we used the normalizedimportance sampling [56], [57] (which introduces some biasbut drastically reduces the variance), and the generalizedbaselines [58] (which also introduce some bias, as they are


estimated from the same dataset). The offline version ofDPG, suffers from three different sources of bias: the semi-gradient, the critic approximation and the improper use ofthe discounted state distribution [59], [60]. To mitigate theseissues and focus more on the semi-gradient contribution tothe bias, we provide an oracle Q-function (we denote thisversion as DPG+Q). For the policy improvement, instead,we compare both with online algorithms (Figure 9) such asTD3 [12] and SAC [2], and with offline algorithms (BEAR[14], BRAC [15], MOPO [19] and MOReL [20]) as depictedin Figure 10. A full list of the algorithms used in thecomparisons with a brief description is available in Table 1.

5.3 Analysis of the Gradient

We want to compare the bias and variance of our gradientestimator w.r.t. the already discussed classical estimators.Therefore, we use the LQG setting described in Section 5.1.1,which allows us to compute the true gradient. Our goalis to estimate the gradient w.r.t. the policy πθ diagonalparameters θ1, θ2, while sampling from a policy which isa linear combination of Θ and Θ′. The hyper-parameter αdetermines the mixing between the two parameters. Whenα = 1 the behavioral policy will have parameters Θ′, whilewhen α = 0 the dataset will be sampled using Θ. In Figure 4,we can visualize the difference of the two policies withparameters Θ and Θ′. Although not completely disjoint,they are fairly far in the probability space, especially if wetake into account that such distance propagates in the lengthof the trajectories.

5.3.1 Sample Analysis

We want to study how the bias, the variance and thedirection of the estimated gradient vary w.r.t. the dataset’ssize. We are particularly interested in the off-policy strategyfor sampling, and in this set of experiments we will useconstant α = 0.5. Figure 6a depicts these quantities w.r.t. thenumber of collected samples. As expected, a general trendfor all algorithms is that with a higher number of sampleswe are able to reduce the variance. The importance samplingbased G(PO)MDP algorithms eventually obtain a low biasas well. Remarkably, NOPG has significantly both lower

s0 ′s0

Fig. 4: Evaluated in the initial state, the optimization policyhaving parameters θ1, θ2 and the behavioral policy havingparameters θ′1, θ

′2 exhibit a fair distance in probability space.

bias and variance, and its gradient direction is also moreaccurate w.r.t. the G(PO)MDP algorithms (note the differentscales of the y-axis). Between DPG+Q and NOPG there isno sensible difference, but we should take into account thealready-mentioned advantage of DPG+Q to have access tothe true Q-function.

5.3.2 Off-Policy AnalysisWe want to estimate the bias and the variance w.r.t. differentdegrees of “off-policiness” α, as defined in the beginning ofSection 5.3. We want to highlight that in the deterministicexperiment the behavioral policy remains stochastic. Thisis needed to ensure the stochastic generation of datasets,which is essential to estimate the bias and the varianceof the estimator. As depicted in Figure 6, the variance inimportance sampling based techniques tends to increasewhen the dataset is off-policy. On the contrary, NOPG seemsto be more subject to an increase of bias. This trend is alsonoticeable in DPG+Q, where the component of the bias isthe one playing a major role in the mean squared error. Thegradient direction of NOPG seems however unbiased, whileDPG+Q has a slight bias but remarkably less variance (notethe different scales of the y-axis). We remark that DPG+Quses an oracle for the Q-function, which supposedly resultsin lower variance and bias3. The positive bias of DPG+Q inthe on-policy case (α = 0) is caused by the improper useof discounting. In general, NOPG shows a decrease in biasand variance in order of magnitudes when compared to theother algorithms.

5.3.3 Bandwidth AnalysisIn the previous analysis, we kept the bandwidth’s pa-rameters of our algorithm fixed, even though a dynamicadaptation of this parameter w.r.t. the size of the datasetmight have improved the bias/variance trade-off. We arenow interested in studying how the bandwidth impacts thegradient estimation. For this purpose, we generated datasetsof 1000 samples with α = 0.5. We set all the bandwidthsof state, action and next state, for each dimension equalto κ. From Figure 8, we evince that a lower bandwidthcorresponds to a higher variance, while a larger bandwidthapproaches a constant bias and the variance tends to zero.This result is in line with the theory.

5.3.4 Trust RegionIn Section 3.3.1, we claimed that the nonparametric tech-nique used has the effect to not consider OOD actions,and more in general, state-action pairs that are in a lowdensity region: in fact the magnitude of the gradient inthese regions is low. To appreciate this effect, we consideredthe Pendulum-v0. We generated the data using a Gaussanpolicy N (µ = 0,Σ = 0.2I). Figure 5a depicts the generateddatataset. Subsequently, we generated a set of linear policiesa = θ0 sinω + θ1 cosω + θ2ω + θ3 where the parametersθ0 = θ1 = θ2 = 0, θ3 ∈ [−2, 2] and ω and ω represent theangle and the angular velocity of the pendulum. When θ3

3. Furthermore, we suspect that the particular choice of a LQG tasktends to mitigate the problems of DPG, as the fast convergence toa stationary distribution due to the stable attractor, united with theimproper discounting, results in a coincidental correction of the state-distribution.


ω

−3 −2 −10

12

3

ω

−8−6−4−2

02

46

8

a

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

(a) Dataset

−100 −80 −60 −40 −20 0

−50

0

log likelihood

log‖∇

Jπ‖

(b) Gradient magnitude

−2 0 2

ω

−8

−6

−4

−2

0

2

4

6

8

ω

−600

−400

−200

0

200

400

600

∇aJS(x,a

)| a=

0

(c) Gradient w.r.t. state

Fig. 5: (a) The dataset used for the experiment in Section 5.3.4. The blue plane represent a policy with constant action. Bysetting different values of θ3 we can obtain different policies. For θ3 ≈ 0, the policy is most fitting with the data. (b) Whenthe policy has low log-likelihood, the gradient quickly approaches the zero (i.e., ‖∇θJπ‖ → 0). (c) The magnitude of thegradient decreases in low density regions, where the prediction is most uncertain.

is close to 0, the policy is close to the samples contained inthe data, while if θ3 is close either to −2 or 2, then it is moredistant. In Figure 5b shows the estimated policy gradientfor different policies. In particular, each point representsthe log-likelihood of the policy (w.r.t. the actions containedin the dataset) on the x-axis, and the logarithm of themagnitude of the gradient on the y-axis. There is a clearcorrelation between the log-likelihood and the magnitudeof the gradient, which tells us that the gradient is close tozero for unlikely policies, supporting, therefore, our claim.Furthermore, we investigate the contribution of the singlestate-action pairs to the gradient. The heatmap in Figure 5cshows that the highest gradient magnitude appears on thediagonals of the state space. Looking at the generated data,in Figure 5a, we notice that the majority of samples arealso present on the aforementioned diagonals forming a “X”shape. Hence, our experiment shows that the magnitude ofthe gradient depends also on the density of the state-actionspace, with lower magnitude in correspondence of lowerdensity of the state space.

5.4 Policy Improvement

In the previous section, we analyzed the statistical proper-ties of our estimator. Conversely, in this section, we use theNOPG estimate to fully optimize the policy. At the currentstate, NOPG is a batch algorithm, meaning that it receivesas input a set of data, and it outputs an optimized policy,without any interaction with the environment. We study thesample efficiency of the overall algorithm. We compare itwith both other batch and online algorithms. Please noticethat online algorithms, such as DDPG-On, TD3 and SAC,can acquire more valuable samples during the optimizationprocess. Therefore, in a direct comparison, batch algorithmsare in disadvantage.

5.4.1 Uniform Grid

In this experiment we analyze the performance ofNOPG under a uniformly sampled dataset, since, as thetheory suggests, this scenario should yield the least biasedestimate of NOPG. We generate datasets from a grid over

the state-action space of the pendulum environment withdifferent granularities. We test our algorithm by optimizinga policy encoded with a neural-network for a fixed amountof iterations. The policy is composed of a single hidden layerwith 50 neurons and ReLU activations. This configurationis fixed across all the different experiments and algorithmsfor the remainder of this document. The resulting policyis evaluated on trajectories of 500 steps starting from thebottom position. The leftmost plot in Figure 9, depicts theperformance against different dataset sizes, showing thatNOPG is able to solve the task with 450 samples. Figure 7is an example of the value function and state distributionestimates of NOPG-D at the beginning and after 300 opti-mization steps. The ability to predict the state-distributionis particularly interesting for robotics, as it is possible topredict in advance whether the policy will move towardsdangerous states. Note that this experiment is not applicableto PWIS, as it does not admit non-trajectory-based data.

5.4.2 Comparison with Online AlgorithmsIn contrast to the uniform grid experiment, here we collectthe datasets using trajectories from a random agent in thependulum and the cart-pole environments. In the pendulumtask, the trajectories are generated starting from the up-right position and applying a policy composed of a mixtureof two Gaussians. The policies are evaluated starting fromthe bottom position with an episode length of 500 steps.The datasets used in the cart-pole experiments are collectedusing a uniform policy starting from the upright positionuntil the end of the episode, which occurs when the absolutevalue of the angle θ surpasses 3 deg. The optimization policyis evaluated for 104 steps. The reward is rt = cos θt. Sinceθ is defined as 0 in the top-right position, a return of 104

indicates an optimal policy behavior.We analyze the sample efficiency by testing NOPG in

an offline fashion with pre-collected samples, on a differentnumber of trajectories. In addition, we provide the learningcurve of DDPG, TD3 and SAC using the implementationin Mushroom [61]. For a fixed size of the dataset, weoptimize DDPG-Off and NOPG for a fixed number of steps.For NOPG, which is offline, we select the policy from the


0.0

0.1

0.2

0.3

0.4

0.5

MS

ENOPG-S

Bias2

Variance

MSE

0

20

40

60

G(PO)MDP+N

0.0

0.5

1.0

1.5

2.0

2.5

G(PO)MDP+BN

500 1000 1500 2000 2500 3000

n samples

−π/2

0

π/2

angl

e( ∇

Jπ−∇Jπ)

Mean Gradient Estimate

Correct Direction

±π/2

500 1000 1500 2000 2500 3000

n samples

−2

0

2

500 1000 1500 2000 2500 3000

n samples

−2

0

2

0.5

1.0

1.5

2.0

2.5

MS

E

NOPG-DBias2

Variance

MSE

0

10

20

30

DPG+Q

500 1000 1500 2000 2500 3000

n samples

−π/2

0

π/2

angl

e( ∇

Jπ−∇Jπ)


Correct Direction

±π/2

500 1000 1500 2000 2500 3000

n samples

−2

0

2

Sample Analysis (a)

0.00

0.25

0.50

0.75

1.00

MS

E

NOPG-S

0

50

100

150

G(PO)MDP+NBias2

Variance

MSE

0

1

2

3

4

G(PO)MDP+BN

0.0 0.2 0.4 0.6 0.8 1.0

α

−π/2

0

π/2

angl

e( ∇

Jπ−∇Jπ)


Correct Direction

±π/2

0.0 0.2 0.4 0.6 0.8 1.0

α

−2

0

2

0.0 0.2 0.4 0.6 0.8 1.0

α

−2

0

2

0

1

2

3

4

MS

E

NOPG-DBias2

Variance

MSE

0

50

100

DPG+Q

0.0 0.2 0.4 0.6 0.8 1.0

α

−π/2

0

π/2

angl

e( ∇

Jπ−∇Jπ)


Correct Direction

±π/2

0.0 0.2 0.4 0.6 0.8 1.0

α

−2

0

2

Off-Policy Analysis (b)

Fig. 6: Bias, variance, MSE and gradient direction analysis. The MSE plots are equipped with a 95% interval usingbootstrapping techniques. The direction analysis plots describe the distribution of angle between the estimates and theground truth gradient. NOPG exhibits favorable bias, variance and gradient direction compared to PWIS and semi-gradient.


−2 0 2

α

−8

−6

−4

−2

0

2

4

6

α

µπ0

−2 0 2

α

Vπ0

−2 0 2

−5

0

5

ω

ω

µπ0

−2 0 2

α

−8

−6

−4

−2

0

2

4

6

α

µπ0

−2 0 2

α

Vπ0

−2 0 2

ω

Vπ0

−2 0 2

α

µπ300

−2 0 2

α

Vπ300

−2 0 2

ω

µπ300

−2 0 2

α

µπ300

−2 0 2

α

Vπ300

−2 0 2

ω

Vπ300

Fig. 7: A phase portrait of the state distribution µπ and value function Vπ estimated in the swing-up pendulum task withNOPG-D. Green corresponds to higher values. The two leftmost figures show the estimates before any policy improvement,while the two rightmost show them after 300 offline updates of NOPG-D. Notice that the algorithm finds a very goodapproximation of the optimal value function and is able to predict that the system will reach the goal state ((ω, ω) = (0, 0)).

0.0

2.5

5.0

7.5

10.0

12.5

15.0

MS

E

NOPG-S

Bias2

Variance

MSE

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

κ

−π/2

0

π/2

angl

e( ∇

Jπ−∇Jπ) Mean Gradient Estimate

Correct Direction

±π/2

Fig. 8: A lower bandwidth corresponds to higher variance,while higher bandwidth increases the bias up to a plateau.

last optimization step. The two rightmost plots in Figure 9highlight that our algorithm has superior sample efficiencyby more than one order of magnitude w.r.t. the consideredonline algorithms (note the log-scale on the x-axis).

To validate the resulting policy learned in the simulatedcartpole (Figure 3), we apply the final learned controller on areal Quanser cart-pole, and observe a successful stabilizingbehavior as can be seen in the supplementary video.

5.4.3 Comparison with Offline Algorithms

We use the same environments provided in Section 5.4.2 tocompare against state-of-the-art offline algorithms. To thisend, we used the same datasets on Cart-Pole and Pendulum-v0 to train BRAC, BEAR, MOPO and MOReL. Furthermore,to allow a fairer comparison using the D4RL dataset, wetested our algorithm with the aforementioned baselineson the U-Maze. To perform the usual sample-analysis, we

sub-sampled the dataset in smaller datasets. The results ofsuch analysis can be viewed in Figure 10. NOPG exhibits acompetitive performance w.r.t. the baselines. In more detail,BEAR and MOReL perform suboptimally in our experi-ments, MOPO perform similarly to NOPG in Pendulum-v0,while failing in Quanser Cart-Pole. BRAC exhibit a similarbehavior with our NOPG. All the algorithms seems, how-ever, to fail with the U-Maze, probably due to the scarcity ofdata, and due to its sparse reward.

5.4.4 Human Demonstrated Data

In robotics, learning from human demonstrations is cru-cial in order to obtain better sample efficiency and toavoid dangerous policies. This experiment is designed toshowcase the ability of our algorithm to deal with suchdemonstrations without the need for explicit knowledgeof the underlying behavioral policy. The experiment is ex-ecuted in a completely offline fashion after collecting thehuman dataset, i.e., without any further interaction withthe environment. This setting is different from the classicalimitation learning and subsequent optimization [62]. As anenvironment we choose the continuous mountain car taskfrom OpenAI. We provide 10 demonstrations recorded by ahuman operator and assigned a reward of −1 to every step.A demonstration ends when the human operator surpassesthe limit of 500 steps, or arrives at the goal position. Thehuman operator explicitly provides sub-optimal trajectories,as we are interested in analyzing whether NOPG is ableto take advantage of the human demonstrations to learna better policy than that of the human, without any fur-ther interaction with the environment. To obtain a sampleanalysis, we evaluate NOPG on randomly selected sub-sets of the trajectories from the human demonstrations.Figure 11 shows the average performance as a functionof the number of demonstrations as well as an exampleof a human-demonstrated trajectory. A vanilla behavioralcloning approach trained with the whole dataset leads to aworse performance w.r.t. the human demonstrator. In fact,by simply replicating the demonstrator, the cloned behavioris most of the times not able to reach the flag. Notice thatboth NOPG-S and NOPG-D manage to learn a policy that


2 4.5 8 18 32

−3

−2

−1

·103

Sample Size ·102

Ret

urn

Pendulum-v0

102 103 104 105

−4

−2

0·103

Sample Size

Pendulum-v0

103 104

0

0.5

1

·104

Sample Size

Quanser Cart-Pole

NOPG-D NOPG-S DDPG TD3 SAC

Fig. 9: Comparison of NOPG in its deterministic and stochastic versions to state-of-the-art online algorithms on continuouscontrol tasks: Swing-Up Pendulum with uniform grid sampling (left), Swing-Up Pendulum with the random agent(center) and the Cart-Pole stabilization (right). The figures depict the mean and 95% confidence interval over 10 trials.NOPG outperforms the baselines w.r.t the sample complexity. Note the log-scale along the x-axis.

102 103 104 105

−4

−2

0·103

Sample Size

Pendulum-v0

103 104

0

0.5

1

·104

Sample Size

Quanser Cart-Pole

103 104

0

100

200

Sample Size

U-Maze

NOPG-D NOPG-S BEAR BRAC MOPO MOReL

Fig. 10: Comparison of NOPG in its deterministic and stochastic versions to state-of-the-art offline algorithms on continuouscontrol tasks: Swing-Up Pendulum with random agent (left), the Cart-Pole stabilization (center) and U-Maze with D4RLdataset (right). The figures depict the mean and 95% confidence interval over 10 trials. NOPG is competitive with thesample efficiency of the considered baselines. Note the log-scale along the x-axis.

surpasses the human operator’s performance and reach theoptimal policy with two demonstrated trajectories.

5.4.5 Test on a Higher-Dimension TaskAll the considered tasks, are relatively low dimensional,with the higher-dimension of 6, accounting for both statesand actions, in the U-Maze. We tested NOPG also on theHopper, using the D4RL dataset. NOPG improves w.r.t. thenumber of samples provided, altough it does not reacha satisfactoy policy (Figure 12). The low performance isprobably caused by the poor scaling to high dimensionsof nonparametric methods, and the typical need of a largeamount of samples due to the complexity of the consideredtask.

5.5 Computational ComplexityOur nonparametric approach involves a layer of computa-tion that is not usual in classic deep reinforcement learningsolutions. In particular, the construction of the matrix Pγ

π

and the inversion of Λπ can be expensive. In the following,

we analyze the computational resources required by NOPG.In particular, we use different sizes of the Pendulum-v0dataset and investigate the time required to compute an it-eration of Algorithm 1. As stated in Section 3.2, the iterationtime grows quadratically w.r.t. the number of samples con-tained in the dataset. The computational cost can be loweredby lowering the number of non-zero element of the matrixPγπ (Figure 13). However, while our algorithm considers all

the samples at every iteration, classic deep reinforcementlearning algorithms usually consider only a fixed amountof data (called ”mini batch”), and the computational cost ofthe iterations results, therefore, to be constant.

6 CONCLUSION AND FUTURE WORK

In this paper, we presented and analyzed an off-policy gra-dient technique Nonparametric Off-policy Policy Gradient(NOPG) [21]. Our estimator overcomes the main issues ofthe techniques of off-policy gradient estimation. On the onehand, in contrast to semi-gradient approaches, it deliversa full-gradient estimate; and on the other hand, it avoids


2 4 6 8 10

−9

−8

−7

−6

−5

−4

−3

−2

−1

·102

Number of trajectories

Ret

urn

Mountain Car

NOPG-DNOPG-SBehavioral CloningHuman baseline

x

x

Human Demonstration

DemonstrationOptimized trajectoryBehavioral CloningGoal

Fig. 11: With a small amount of data NOPG is able to reach apolicy that surpasses the human demonstrator (dashed line)in the mountain car environment. Depicted are the meanand 95% confidence over 10 trials (left). An example of ahuman-demonstrated trajectory and the relative optimizedversion obtained with NOPG (right). Although the humantrajectories in the dataset are suboptimal, NOPG convergesto an optimal solution (right).

103 104

0

100

200

Sample Size

Hopper

NOPG-D NOPG-S

Fig. 12: Performance of NOPG on the Hopper-v0, using theD4RL dataset.

the high variance of importance sampling by phrasingthe problem with nonparametric techniques. The empiri-cal analysis clearly showed a better gradient estimate interms of bias, variance, and direction. Our experiments alsoshowed that our method has high sample efficiency andthat our algorithm can be behavioral-agnostic and cope withunstructured data.

However, our algorithm, which is built on nonparamet-ric techniques, suffers from the curse of dimensionality.The future work aims to mitigate this problem. We planto investigate better sparsification techniques united withan adaptive bandwidth. The promising properties of theproposed gradient estimation can, in the future, be adaptedto the parametric inference, to extend our approach to themore versatile deep-learning setting.

ACKNOWLEDGMENTS

The research is financially supported by the Bosch-Forschungsstiftung program and the European Union’s

500 1,000 1,500 2,000 2,500 3,000

0

0.1

0.2

0.3

N Samples

Iter

atio

nTi

me

[s]

k = 10

k = 20

k = 30

k = 40

k = 50

Fig. 13: For a fixed amount k of non-zero elements per rowof the transition matrix, the iteration time grows quadrat-ically w.r.t. the number of samples. A lower k requires ashorter iteration time.

Horizon 2020 research and innovation program under grantagreement #640554 (SKILLS4ROBOTS).

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,“Human-Level Control Through Deep Reinforcement Learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015. [Online]. Available:http://www.nature.com/articles/nature14236

[2] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning witha Stochastic Actor,” in Proceeding of the 35th International Conferenceon Machine Learning, 2018, pp. 1856–1865.

[3] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel,“Trust Region Policy Optimization,” in Proceedings of the 32ndInternational Conference on Machine Learning, 2015, pp. 1889–1897.

[4] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-Based BatchMode Reinforcement Learning,” Journal of Machine LearningResearch, vol. 6, no. Apr, pp. 503–556, 2005. [Online]. Available:http://www.jmlr.org/papers/v6/ernst05a.html

[5] M. Riedmiller, “Neural Fitted Q Iteration – First Experienceswith a Data Efficient Neural Reinforcement Learning Method,”in European Conference of Machine Learning, ser. Lecture Notes inComputer Science. Springer Berlin Heidelberg, 2005, pp. 317–328.

[6] L. Baird, “Residual Algorithms: Reinforcement Learning withFunction Approximation,” Machine Learning Proceedings, pp. 30–37, 1995.

[7] T. Lu, D. Schuurmans, and C. Boutilier, “Non-DelusionalQ-learning and Value-Iteration,” in Advances in NeuralInformation Processing Systems. Curran Associates, Inc., 2018,pp. 9949–9959. [Online]. Available: http://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration.pdf

[8] E. Imani, E. Graves, and M. White, “An Off-Policy Policy Gradi-ent Theorem Using Emphatic Weightings,” in Advances in NeuralInformation Processing Systems, 2018, pp. 96–106.

[9] T. Degris, M. White, and R. S. Sutton, “Off-Policy Actor-Critic,” arXiv:1205.4839 [cs], May 2012, arXiv: 1205.4839. [Online].Available: http://arxiv.org/abs/1205.4839

[10] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic Policy Gradient Algorithms,” in Proceedingsof the 31 st International Conference on Machine Learning, 2014.

[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous Control with DeepReinforcement Learning,” in International Conference on LearningRepresentations, 2016, arXiv: 1509.02971. [Online]. Available:http://arxiv.org/abs/1509.02971

[12] S. Fujimoto, D. Meger, and D. Precup, “Off-Policy DeepReinforcement Learning without Exploration,” in Proceedingof the 36th International Conference on Machine Learning, 2019,pp. 2052–2062. [Online]. Available: http://proceedings.mlr.press/v97/fujimoto19a/fujimoto19a.pdf

http://www.nature.com/articles/nature14236

http://www.jmlr.org/papers/v6/ernst05a.html

http://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration.pdf

http://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration.pdf

http://arxiv.org/abs/1205.4839


http://proceedings.mlr.press/v97/fujimoto19a/fujimoto19a.pdf

http://proceedings.mlr.press/v97/fujimoto19a/fujimoto19a.pdf


[13] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “ConservativeQ-Learning for Offline Reinforcement Learning,” arXiv preprintarXiv:2006.04779, 2020.

[14] A. Kumar, J. Fu, G. Tucker, and S. Levine, “Stabilizing Off-PolicyQ-Learning via Bootstrapping Error Reduction,” arXiv preprintarXiv:1906.00949, 2019.

[15] Y. Wu, G. Tucker, and O. Nachum, “Behavior Regularized OfflineReinforcement Learning,” arXiv preprint arXiv:1911.11361, 2019.

[16] C. R. Shelton, “Policy Improvement for POMDPs UsingNormalized Importance Sampling,” in Proceedings of theSeventeenth Conference on Uncertainty in Artificial Intelligence,ser. UAI’01. Morgan Kaufmann Publishers Inc., 2001, pp.496–503, event-place: Seattle, Washington. [Online]. Available:http://dl.acm.org/citation.cfm?id=2074022.2074083

[17] N. Meuleau, L. Peshkin, and K.-E. Kim, “Exploration inGradient-Based Reinforcement Learning,” Massachusetts Instituteof Technology, Tech. Rep., 2001. [Online]. Available: https://dspace.mit.edu/handle/1721.1/6076

[18] L. Peshkin and C. R. Shelton, “Learning from Scarce Experience,”in Proceedings of the Nineteenth International Conference onMachine Learning, 2002, arXiv: cs/0204043. [Online]. Available:http://arxiv.org/abs/cs/0204043

[19] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn,and T. Ma, “MOPO: Model-based Offline Policy Optimization,” inProceedings of the 33nd International Conference on Neural InformationProcessing Systems, 2020.

[20] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims,“MOReL: Model-based Offline Reinforcement Learning,” arXivpreprint arXiv:2005.05951, 2020.

[21] S. Tosatto, J. Carvalho, H. Abdulsamad, and J. Peters, “A Non-parametric Off-Policy Policy Gradient,” in Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AIS-TATS), S. Chiappa and R. Calandra, Eds., Palermo, Italy, 2020.

[22] M. White, “Unifying Task Specification in Reinforcement Learn-ing,” in Proceedings of the 34th International Conference on MachineLearning. JMLR. org, 2017, pp. 3742–3750.

[23] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “PolicyGradient Methods for Reinforcement Learning with Function Ap-proximation,” in Advances in Neural Information Processing Systems,2000, pp. 1057–1063.

[24] R. J. Williams, “Simple Statistical Gradient-Following Algorithmsfor Connectionist Reinforcement Learning,” Machine learning,vol. 8, no. 3-4, pp. 229–256, 1992.

[25] C. J. C. H. Watkins and P. Dayan, “Q-learning,” MachineLearning, vol. 8, no. 3, pp. 279–292, 1992. [Online]. Available:https://doi.org/10.1007/BF00992698

[26] D. Ormoneit and S. Sen, “Kernel-Based Reinforcement Learning,”Machine Learning, vol. 49, no. 2, pp. 161–178, 2002. [Online].Available: https://doi.org/10.1023/A:1017928328829

[27] X. Xu, D. Hu, and X. Lu, “Kernel-Based Least Squares PolicyIteration for Reinforcement Learning,” IEEE Transactions on NeuralNetworks, vol. 18, no. 4, pp. 973–992, 2007.

[28] Y. Engel, S. Mannor, and R. Meir, “Reinforcement Learning withGaussian Processes,” in Proceedings of the 22nd International Confer-ence On Machine Learning. ACM, 2005, pp. 201–208.

[29] G. Taylor and R. Parr, “Kernelized Value Function Approximationfor Reinforcement Learning,” in Proceedings of the 26th InternationalConference on Machine Learning, ser. ICML ’09. ACM, 2009,pp. 1017–1024, event-place: Montreal, Quebec, Canada. [Online].Available: http://doi.acm.org/10.1145/1553374.1553504

[30] O. B. Kroemer and J. R. Peters, “A Non-ParametricApproach to Dynamic Programming,” in Advances in NeuralInformation Processing Systems. Curran Associates, Inc., 2011,pp. 1719–1727. [Online]. Available: http://papers.nips.cc/paper/4182-a-non-parametric-approach-to-dynamic-programming.pdf

[31] F. Borrelli, A. Bemporad, and M. Morari, Predictive Control forLinear and Hybrid Systems. Cambridge University Press, Jun. 2017,google-Books-ID: 7NUoDwAAQBAJ.

[32] E. A. Nadaraya, “On Estimating Regression,” Theory of Probability& Its Applications, vol. 9, no. 1, pp. 141–142, 1964.

[33] G. S. Watson, “Smooth Regression Analysis,” Sankhya: The IndianJournal of Statistics, Series A, pp. 359–372, 1964.

[34] J. Fan, “Design-Adaptive Nonparametric Regression,” Journal ofthe American Statistical Association, vol. 87, no. 420, pp. 998–1004,1992.

[35] L. Wasserman, All of Nonparametric Statistics. Springer, 2006.[Online]. Available: https://books.google.it/books?hl=it&lr=

&id=MRFlzQfRg7UC&oi=fnd&pg=PA2&dq=wasserman+2006+all&ots=SPSQp53XJz&sig=R9JPan0NnS8GkezXCj85U2ndFmc#v=onepage&q=wasserman%202006%20all&f=false

[36] S. Tosatto, R. Akrour, and J. Peters, “An Upper Bound of the Biasof Nadaraya-Watson Kernel Regression under Lipschitz Assump-tions,” arXiv preprint arXiv:2001.10972, 2020.

[37] J. Peters, K. Mulling, and Y. Altun, “Relative Entropy PolicySearch,” in Twenty-Fourth AAAI Conference on Artificial Intelligence,2010.

[38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal Policy Optimization Algorithms,” arXiv preprintarXiv:1707.06347, 2017.

[39] A. M. Metelli, M. Papini, F. Faccio, and M. Restelli, “PolicyOptimization via Importance Sampling,” in Advances in NeuralInformation Processing Systems. Curran Associates, Inc., 2018,pp. 5442–5454. [Online]. Available: http://papers.nips.cc/paper/7789-policy-optimization-via-importance-sampling.pdf

[40] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Re-inforcement Learning in a Handful of Trials using ProbabilisticDynamics Models,” in Advances in Neural Information ProcessingSystems. Curran Associates, Inc., 2018, pp. 4754–4765.

[41] A. Antos, R. Munos, and C. Szepesvari, “Fitted Q-Iteration inContinuous Action-Space MDPs,” in Neural Information ProcessingSystems, 2007.

[42] O. Kroemer, E. Ugur, E. Oztop, and J. Peters, “A Kernel-BasedApproach to Direct Action Perception,” in International Conferenceon Robotics and Automation. IEEE, 2012, pp. 2605–2610.

[43] J. Baxter and P. L. Bartlett, “Infinite-Horizon Policy-Gradient Esti-mation,” Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.

[44] S. Kakade, “A Natural Policy Gradient,” in Proceedings of the 14thInternational Conference on Neural Information Processing Systems,2001, pp. 1531–1538.

[45] J. Peters and S. Schaal, “Natural Actor-Critic,” Neurocomputing,vol. 71, no. 7-9, pp. 1180–1190, 2008, publisher: Elsevier.

[46] A. B. Owen, Monte Carlo Theory, Methods and Examples, 2013.[47] R. S. Sutton, A. R. Mahmood, and M. White, “An Emphatic Ap-

proach to the Problem of Off-Policy Temporal-Difference Learn-ing,” The Journal of Machine Learning Research, vol. 17, no. 1, pp.2603–2631, 2016, publisher: JMLR. org.

[48] Q. Liu, L. Li, Z. Tang, and D. Zhou, “Breaking the Curse ofHorizon: Infinite-Horizon Off-Policy Estimation,” in Advances inNeural Information Processing Systems, 2018, pp. 5356–5366.

[49] Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill, “Off-Policy Policy Gradient with State Distribution Correction,”arXiv:1904.08473, 2019, arXiv: 1904.08473. [Online]. Available:http://arxiv.org/abs/1904.08473

[50] O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuur-mans, “AlgaeDICE: Policy Gradient from Arbitrary Experience,”arXiv:1912.02074v1, 2019.

[51] A. Argenson and G. Dulac-Arnold, “Model-Based Offline Plan-ning,” in Proceeding of the 9th International Conference on LearningRepresentations, 2020.

[52] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch,“Plan Online, Learn Offline: Efficient Learning and Exploration viaModel-Based Control,” arXiv preprint arXiv:1811.01848, 2018.

[53] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv:1606.01540, 2016,arXiv: 1606.01540. [Online]. Available: http://arxiv.org/abs/1606.01540

[54] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4rl:Datasets for Deep Data-Driven Reinforcement Learning,” arXivpreprint arXiv:2004.07219, 2020.

[55] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A Physics Engine forModel-Based Control,” in 2012 IEEE/RSJ International Conference onIntelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.

[56] C. R. Shelton, “Policy Improvement for POMDPs Using Normal-ized Importance Sampling,” arXiv preprint arXiv:1301.2310, 2013.

[57] R. Y. Rubinstein and D. P. Kroese, Simulation and the Monte CarloMethod. John Wiley & Sons, 2016, vol. 10.

[58] T. Jie and P. Abbeel, “On a Connection Between ImportanceSampling and the Likelihood Ratio Policy Gradient,” in Advancesin Neural Information Processing Systems, 2010, pp. 1000–1008.

[59] P. Thomas, “Bias in Natural Actor-Critic Algorithms,” in Interna-tional Conference on Machine Learning, 2014, pp. 441–448.

http://dl.acm.org/citation.cfm?id=2074022.2074083

https://dspace.mit.edu/handle/1721.1/6076

https://dspace.mit.edu/handle/1721.1/6076

http://arxiv.org/abs/cs/0204043

https://doi.org/10.1007/BF00992698

https://doi.org/10.1023/A:1017928328829

http://doi.acm.org/10.1145/1553374.1553504

http://papers.nips.cc/paper/4182-a-non-parametric-approach-to-dynamic-programming.pdf

http://papers.nips.cc/paper/4182-a-non-parametric-approach-to-dynamic-programming.pdf

https://books.google.it/books?hl=it&lr=&id=MRFlzQfRg7UC&oi=fnd&pg=PA2&dq=wasserman+2006+all&ots=SPSQp53XJz&sig=R9JPan0NnS8GkezXCj85U2ndFmc#v=onepage&q=wasserman%202006%20all&f=false




http://papers.nips.cc/paper/7789-policy-optimization-via-importance-sampling.pdf

http://papers.nips.cc/paper/7789-policy-optimization-via-importance-sampling.pdf





[60] C. Nota and P. S. Thomas, “Is the Policy Gradient a Gradient?” inProceedings of the 19th International Conference on Autonomous Agentsand Multiagent Systems, 2020.

[61] C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters,MushroomRL: Simplifying Reinforcement Learning Research, 2020,publication Title: arXiv preprint arXiv:2001.01102. [Online].Available: https://github.com/MushroomRL/mushroom-rl

[62] J. Kober and J. R. Peters, “Policy Search for Motor Primitives inRobotics,” in Advances in Neural Information Processing Systems,2009, pp. 849–856.

Samuele Tosatto received his Ph.D. in Com-puter Science from the Technical University ofDarmstadt in 2020. He previously obtained hisM.Sc. in the Polytechnic University of Milan. Heis currently a post-doc fellow with the Universityof Alberta, working in the Reinforcement Learn-ing and Artificial Intelligence group. His researchinterests center around reinforcement learningwith a focus on its application to real roboticsystems.

Joao Carvalho is currently a Ph.D. student atthe Intelligent Autonomous Systems group of theTechnical University of Darmstadt. Previously, hecompleted a M.Sc. degree in Computer Sciencefrom the Albert-Ludwigs-Universitat Freiburg,and studied Electrical and Computer Engineer-ing at the Instituto Superior Tecnico of the Uni-versity of Lisbon. His research is focused onlearning algorithms for control and robotics.

Jan Peters is a full professor (W3) for IntelligentAutonomous Systems at the Computer ScienceDepartment of the Technical University of Darm-stadt. He has been a senior research scientistand group leader at the MaxPlanck Institute forIntelligent Systems, where he headed the inter-departmental Robot Learning Group. Jan Pe-ters has received the Dick Volz Best 2007 USPh.D. Thesis Runner-Up Award, the Robotics:Science & Systems - Early Career Spotlight, theINNS Young Investigator Award, and the IEEE

Robotics & Automation Society’s Early Career Award as well as numer-ous best paper awards. In 2015, he received an ERC Starting Grant andin 2019, he was appointed as an IEEE Fellow.

https://github.com/MushroomRL/mushroom-rl


APPENDIX

1 SUPPORT TO THE THEORETICAL ANALYSIS

1.1 Existence and Uniqueness of the Fixed Point of the Nonparametric Bellman Equation

In the following, we discuss the existence and uniqueness of the solution of the nonparametric Bellman equation (NPBE).To do so, we will first notice that the solution must be a linear combination of the responsability vector επ(s), and thereforeany solution of the NPBE must be bounded. We will also show that such a bound is ±Rmax/(1 − γc). We will use thisbound to show that the solution must be unique. Subsequently, we will prove Theorem 1.

Proposition 1. Space of the SolutionThe solution of the NBPE is εᵀπ(s)q where q ∈ Rn, where n is the number of support points.Informal Proof: The NPBE is

Vπ(s) = εᵀπ(s)

(r +

∫

Sφγ(s′)Vπ(s′) ds′

)(12)

Both r and∫S φγ(s′)Vπ(s′) ds′ are constant w.r.t. s, therefore Vπ(s) must be a linear respect to επ(s).

A first consequence of Proposition 1 is that, since επ(s) is a vector of kernels, Vπ(s) is bounded.In Theorem 1 we state that the solution of the NPBE is V ∗π (s) = εᵀπ(s)Λ−1π r. It is trivial to show that such a solution is

a valid solution of the NPBE; however, the uniqueness of such a solution is non-trivial. In order to prove it, we will beforeshow that if a solution to the NPBE exists, then the solution must be bounded by [−Rmax/(1− γc), Rmax/(1− γc)], whereRmax = maxi |ri|. Note that this nice property is not common for other policy-evaluation algorithms (e.g., Neural FittedQ-Iterations [5]).

Proposition 2. Bound of NPBEIf Vπ : S → R is a solution to the NPBE, then |Vπ(s)| ≤ Rmax/(1− γc).

Proof. Suppose, by contradiction, that a function f(s) is the solution of a NPBE, and that ∃z ∈ S : f(z) = Rmax/(1−γc)+εwhere ε > 0. Since the solution of the NPBE must be bounded, we can further assume without any loss of generality thatf(s) ≤ f(z). Then,

Rmax

1− γc+ ε = εᵀπ(z)r + εᵀπ(z)

∫

Sφγ(s′)f(s′) ds′.

Since by assumption the previous equation must be fulfilled, and επ(z) is a stochastic vector, |εᵀπ(z)r| ≤ Rmax, we have∣∣∣∣Rmax

1− γc+ ε− εᵀπ(z)

∫

Sφγ(s′)f(s′) ds′

∣∣∣∣ ≤ Rmax. (13)

However, noticing also that 0 ≤ φπ,γi (s) ≤ γc, we have

Rmax

1− γc+ ε− εᵀπ(z)

∫

Sφγ(s′)f(s′) ds′ ≥ Rmax

1− γc+ ε− γc max

sf(s)

≥ Rmax

1− γc+ ε− γc

Rmax

1− γc= Rmax + ε,

which is in contradiction with Equation 13. A completely symmetric proof can be derived assuming by contradiction that∃z ∈ S : f(z) = −Rmax/(1− γc)− ε and f(s) ≥ f(z).

Proposition 3. If r is bounded by Rmax and if f∗ : S → R satisfies the NPBE, then there is no other function f : S → R for which∃z ∈ S and |f∗(z)− f(z)| > 0.

Proof. Suppose, by contradiction, that exists a function g : S → R such that f∗(s) + g(s) satisfies Equation 12. Furthermoreassume that ∃z : g(z) 6= 0. Note that, since f : S → R is a solution of the NPBE, then

∫

Sεᵀπ(s)φγ(s′)f∗(s′) ds′ ∈ R, (14)

and similarly ∫

Sεᵀπ(s)φγ(s′) (f∗(s′) + g(s′)) ds′ ∈ R. (15)

The existence of the integrals in Equations 14 and 15 implies∫

Sεᵀπ(s)φγ(s′)g(s′) ds′ ∈ R. (16)


Note that

|g(s)| = |f∗(s)− (f∗(s) + g(s))|

=

∣∣∣∣f∗(s)− εᵀπ(s)

(r +

∫

Sφγ(s′)

(f∗(s′) + g(s′)

)ds′)∣∣∣∣

=

∣∣∣∣εᵀπ(s)

(r +

∫

Sφγ(s′)f∗(s′) ds′

)− εᵀπ(s)

(r +

∫

Sφγ(s′)

(f∗(s′) + g(s′)

)ds′)∣∣∣∣

=


∫

Sφγ(s′)g(s′) ds′

∣∣∣∣

=


∫

Sφγ(s′)g(s′) ds′

∣∣∣∣.

Using Jensen’s inequality

|g(s)| ≤ εᵀπ(s)

∫

Sφγ(s′)|g(s′)|ds′.

Since both f∗ and f + g are bounded by Rmax1−γc , then |g(s)| ≤ A = 2Rmax

1−γc , then

|g(s)| ≤ εᵀπ(s)

∫

Sφγ(s′)|g(s′)|ds′ (17)

≤ Aεᵀπ(s)

∫

Sφγ(s′) ds′

≤ γcA.

We can iterate this reasoning now posing |g(s)| ≤ γcA, and eventually we notice that |g(s)| ≤ 0, which is in contradictionwith the assumption made.

Proof of Theorem 1

Proof. Saying that V ∗π is a solution for Equation 12 is equivalent to say

V ∗π (s)− εᵀπ(s)

(r + γ

∫

Sφγ(s′)V ∗π (s′) ds′

)= 0 ∀s ∈ S.

We can verify that by simple algebraic manipulation

V ∗π (s)− εᵀπ(s)

(r +

∫

Sφγ(s′)V ∗π (s′) ds′

)

= εᵀπ(s)Λ−1π r− εᵀπ(s)

(r +

∫

Sφγ(s′)εᵀπ(s′)Λ−1π r ds′

)

= εᵀπ(s)

(Λ−1π r− r−

∫

Sφγ(s′)εᵀπ(s′)Λ−1π r ds′

)

= εᵀπ(s)

((I −

∫

Sφγ(s′)εᵀπ(s′) ds′

)Λ−1π r− r

)

= εᵀπ(s)

(ΛπΛ−1π r− r

)

= 0. (18)

Since equation 12 has (at least) one solution, Proposition 3 guarantees that the solution (V ∗π ) is unique.

1.2 Bias of the Nonparametric Bellman Equation

In this section, we want to show the findings of Theorem 3. To do so, we introduce the infinite-samples extension of theNPBE.


Proposition 4. Let us suppose to have a dataset of infinite samples, and in particular one sample for each state-action pair of thestate-action space. In the limit of infinite samples the NPBE defined in Definition 2 with a data-set limn→∞Dn collected underdistribution β on the state-action space and MDPM converges to

VD(s) = limn→∞

∫

A

∑ni=1 ψi(s)ϕi(a)

(ri + γi

∫S φi(s

′)Vπ(s′) ds

)

∑nj=1 ψj(s)ϕj(a)

π(a|s) da

=

∫

A

limn→∞1n

∑ni=1 ψi(s)ϕi(a)

(ri + γi

∫S φi(s

′)Vπ(s′) ds

)

limn→∞1n

∑nj=1 ψj(s)ϕj(a)

π(a|s) da

=

∫

A

∫S×A ψ(s, z)ϕ(a,b)

(R(z,b) +

∫S φ(s′, z′)pγ(z′|b, z)Vπ(s′) ds

)β(z,b) dz db

∫S×A ψ(s, z)ϕ(a,b)β(z,b) dz db

π(a|s) da. (19)

If we impose the process generating samples to be non-degenerate distribution, and −Rmax ≤ R ≤ Rmax, we see thatPropositions 1-3 remain valid. Furthermore, from Equation 19 we are able to infer that ED[VD(s)] = VD(s) (since D is aninfinite dataset, it does not matter if we re-sample it, the resulting value-function will be always the same).

Proof of Theorem 3. To keep the notation uncluttered, let us introduce

ε(s,a, z,b) =ψ(s, z)ϕ(a,b)β(z,b)∫

S×A ψ(s, z)ϕ(a,b)β(z,b) dz db. (20)

We want to bound

V (s)− V ∗(s) =

∫

A

(∫ε(s,a, z,b)R(z,b) dz db +

∫

S×Aε(s,a, z,b)

∫

Sφi(s

′, z′)pγ(z′|b, z)V (s′) ds′ dz db

−R(s,a)−∫

SV ∗(s′)pγ(s′|s,a) ds′

)π(a|s) da

=⇒∣∣V (s)− V ∗(s)

∣∣ ≤ maxa

∣∣∣∣∫ε(s,a, z,b) (R(z,b)−R(s,a)) dz db

∣∣∣∣ (21)

+ maxa

∣∣∣∣∫

Spγ(z′|b, z)

(φi(s

′, z′)V (s′)− V ∗(s′))

dz db

∣∣∣∣

≤ maxa

∣∣∣∣∣∣∣∣

∫ε(s,a, z,b) (R(z,b)−R(s,a)) dz db

︸︷︷︸A

∣∣∣∣∣∣∣∣(22)

+γc maxa

∣∣∣∣∣∣∣∣∣

∫

S×Aε(s,a, z,b)

∫

Sp(z′|b, z)

(φi(s

′, z′)V (s′)− V ∗(s′))

ds′ dz db

︸︷︷︸B

∣∣∣∣∣∣∣∣∣= ABias + γcBBias (23)

Term A is the bias of Nadaraya-Watson kernel regression, as it is possible to observe in [36], therefore Theorem 2 applies

ABias =

LR∑dk=1 hk

(∏di6=k e

L2βh

2i

2

(1 + erf

(hiLβ√

2

)))(1√2π

+ Lβhke

L2βh

2k

22

(1 + erf

(hkLβ√

2

)))

∏di=1 e

L2βh

2i

2

(1− erf

(hiLβ√

2

)) ,


where h = [hψ,hϕ] and d = ds + da.

Bbias ≤ γc maxa

∣∣∣∣∣


( ∫S×S V (z′)φ(z′, s′)p(s′|s,a) ds′ dz′ −

∫S V∗(s′)p(s′|s,a) ds′

)β(z,b) dz db∫

S,A ψ(s, z)ϕ(a,b)β(z,b) dz db

∣∣∣∣∣

= γc maxa

∣∣∣∣∣


( ∫S∫S(V (z′)φ(z′, s′)− V ∗(s′)

)p(s′|s,a) ds′ dz′

)β(z,b) dz db∫


∣∣∣∣∣

≤ γc maxa,s′

∣∣∣∣∣


( ∫S V (z′)φ(z′, s′)− V ∗(s′) dz′

)β(z,b) dz db∫


∣∣∣∣∣

= γc maxa,s′

∣∣∣∣∣

∫S×A ψ(s, z)ϕ(a,b)β(z,b) dz db∫S,A ψ(s, z)ϕ(a,b)β(z,b) dz db

(∫

SV (z′)φ(z′, s′)− V ∗(s′) dz′

)∣∣∣∣∣

= γc maxs′

∣∣∣∣∫

SV (z′)φ(z′, s′)− V ∗(s′) dz′

∣∣∣∣

= γc maxs′

∣∣∣∣∫

SV (s′ + δ)φ(s′ + δ, s′)− V ∗(s′) dδ

∣∣∣∣. (24)

Note that

φ(s′ + δ, s′) =ds∏

i=1

e− δ2i

2h2φ,i

√2πh2φ,i

,

thus, using the Lipschitz inequality,

maxs′

∣∣∣∣∫

SV (s′ + δ)φ(s′ + δ, s′)− V ∗(s′) dδ

∣∣∣∣

≤ maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+

∫

SLV

( ds∑

i=1

|δi|) ds∏

i=1

e− δ2i

2h2φ,i

√2πh2φ,i

dδ

= maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV

∫

S

( ds∑

i=1

|δi|) ds∏

i=1

e− δ2i

2h2φ,i

√2πh2φ,i

dδ

= maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV

ds∑

k=1

( ds∏

i 6=k

∫ +∞

−∞

e− δ2i

2h2φ,i

√2πh2φ,i

dδi

)∫ +∞

−∞|δk|

e− δ2k

2h2φ,k

√2πh2φ,k

dδk

= maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV 2

ds∑

k=1

∫ +∞

0δk

e− δ2k

2h2φ,k

√2πh2φ,k

dδk

= maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV

ds∑

k=1

hφ,k√2π,

which means that

∣∣∣∣V (s)− V ∗(s)

∣∣∣∣ ≤ ABias + γc

(maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV

ds∑

k=1

hφ,k√2π

).


Since both V (s) and V ∗(s) are bounded by −Rmax/(1− γc) and Rmax(1− γc), then∣∣V (s)− V ∗(s)

∣∣ ≤ 2 Rmax1−γc , thus

∣∣∣∣V (s)− V ∗(s)


(maxs′

∣∣∣∣V (s′)− V ∗(s′)∣∣∣∣+ LV

ds∑

k=1

hφ,k√2π

)(25)

∣∣∣∣V (s)− V ∗(s)


(2Rmax

1− γc+ LV

ds∑

k=1

hφ,k√2π

)

=⇒∣∣∣∣V (s)− V ∗(s)


(ABias + γc

(2Rmax

1− γc+ LV

ds∑

k=1

hφ,k√2π

)+ LV

ds∑

k=1

hφ,k√2π

)using Equation (25)

=⇒∣∣∣∣V (s)− V ∗(s)

∣∣∣∣ ≤∞∑

t=0

γtc

(ABias + γcLV

ds∑

k=1

hφ,k√2π

)using Equation (25)

=⇒∣∣∣∣V (s)− V ∗(s)

∣∣∣∣ ≤1

1− γc

(ABias + γcLV

ds∑

k=1

hφ,k√2π

).

2 SUPPORT TO THE EMPIRICAL ANALYSIS

2.1 Gradient AnalysisThe parameters used for the LQG are

A =

[1.2 00 1.1

]; B =

[1 00 1

]; Q =

[1 00 1

]; R =

[0.1 00 0.1

]; Σ =

[1 00 1

]; s0 = [−1,−1].

The discount factor is γ = 0.9, and the length of the episodes is 50 steps. The parameters of the optimization policy areθ = [−0.6,−0.8] and the off-policy parameters are θ′ = [−0.35,−0.5]. The confidence intervals have been computedusing bootstrapped percentile intervals. The size of the bootstrapped dataset vary from plot to plot (usually from 1000 to5000 different seeds). The confidence intervals, instead, have been computed using 10000 bootstraps. We used this methodinstead the more classic standard error (using a χ2 or a t-distribution), because often, due to the importance sampling,our samples are highly non-Gaussian and heavy-tailed. The bootstrapping method relies on less assumptions, and theirconfidence intervals were more precise in this case.

2.2 Policy Improvement analysisWe use a policy encoded as neural network with parameters θ. A deterministic policy is encoded with a neural networka = fθ(s). The stochastic policy is encoded as a Gaussian distribution with parameters determined by a neural networkwith two outputs, the mean and covariance. In this case we represent by fθ(s) the slice of the output corresponding to themean and by gθ(s) the part of the output corresponding to the covariance.

NOPG can be described with the following hyper-parameters

NOPG Parameters Meaningdataset sizes number of samples contained in the dataset used for trainingdiscount factor γ discount factor in infinite horizon MDPstate ~hfactor constant used to decide the bandwidths for the state-spaceaction ~hfactor constant used to decide the bandwidths for the action-spacepolicy parametrization of the policypolicy output how is the output of the policy encodedlearning rate the learning rate and the gradient ascent algorithm usedNMCπ (NOPG-S) number of samples drawn to compute the integral επ(s) with MonteCarlo sampling

NMCφ number of samples drawn to compute the integral over the next state

∫φ(s′) ds′

NMCµ0

number of samples drawn to compute the integral over the initialdistribution

∫Vπ(s)µ0(s) ds

policy updates number of policy updates before returning the optimized policy

A few considerations about NOPG parameters. If NMCφ = 1 we use the mean of the kernel φ as a sample to approximate

the integral over the next state. When optimizing a stochastic policy represented by a Gaussian distribution, we set andlinearly decay the variance over the policy optimization procedure. The kernel bandwidths are computed in two steps: firstwe find the best bandwidth for each dimension of the state and action spaces using cross validation; second we multiply


each bandwidth by an empirical constant factor (~hfactor). This second step is important to guarantee that the state andaction spaces do not have a zero density. For instance, in a continuous action environment, when sampling actions from auniform grid we have to guarantee that the space between the grid points have some density. The problem of estimatingthe bandwidth in kernel density estimation is well studied, but needs to be adapted to the problem at hand, specially witha low number of samples. We found this approach to work well for our experiments but it can be further improved.

2.2.1 Pendulum with Uniform DatasetTables 3 and 4 describe the hyper-parameters used to run the experiment shown in the first plot of Figure 9.

Dataset Generation: The datasets have been generated using a grid over the state-action spaces θ, θ, u, where θ and θare respectively angle and angular velocity of the pendulum, and u is the torque applied. In Table 3 are enumerated thedifferent datasets used.

#θ #θ #u Sample size10 10 2 20015 15 2 45020 20 2 80025 25 2 125030 30 2 180040 40 2 3200

TABLE 3: Pendulum uniform grid dataset configurations This table shows the level of discretization for each dimensionof the state space (#θ and #θ) and the action space (#u). Each line corresponds to a uniformly sampled dataset, whereθ ∈ [−π, π], θ ∈ [−8, 8] and u ∈ [−2, 2]. The entries under the states’ dimensions and action dimension correspond tohow many linearly spaced states or actions are to be queried from the corresponding intervals. The Cartesian product ofstates and actions dimensions is taken in order to generate the state-action pairs to query the environment transitions. Therightmost column indicates the total number of corresponding samples.

Algorithm details: The configuration used for NOPG-D and NOPG-S are listed in Table 4.

NOPGdiscount factor γ 0.99state ~hfactor 1.0 1.0 1.0

action ~hfactor 50.0policy neural network parameterized by θ

1 hidden layer, 50 units, ReLU activationspolicy output 2 tanh(fθ(s)) (NOGP-D)

µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s)) (NOGP-S)learning rate 10−2 with ADAM optimizerNMCπ (NOPG-S) 15

NMCφ 1

NMCµ0

(non applicable) fixed initial statepolicy updates 1.5 · 103

TABLE 4: NOPG configurations for the Pendulum uniform grid experiment

2.2.2 Pendulum with Random AgentThe following tables show the hyper-parameters used for generating the second plot starting from the left in Figure 9

NOPGdataset sizes 102, 5 · 102, 103, 1.5 · 103, 2 · 103, 3 · 103,

5 · 103, 7 · 103, 9 · 103, 104

discount factor γ 0.99

state ~hfactor 1.0 1.0 1.0



µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s)) (NOGP-S)learning rate 10−2 with ADAM optimizer


NMCπ0

(NOPG-S) 10NMCφ 1

NMCµ0

(non applicable) fixed initial statepolicy updates 2 · 103

SACdiscount factor γ 0.99rollout steps 500actor neural network parameterized by θactor

1 hidden layer, 50 units, ReLU activationsactor output 2 tanh(u), u ∼ N (·|µ = fθactor (s), σ = gθactor (s))actor learning rate 10−3 with ADAM optimizercritic neural network parameterized by θcritic

2 hidden layers, 50 units, ReLU activationscritic output fθcritic (s,a)critic learning rate 5 · 10−3 with ADAM optimizermax replay size 5 · 105

initial replay size 128batch size 64soft update τ = 5 · 10−3

policy updates 2.5 · 105

DDPG / TD3discount factor γ 0.99rollout steps 500actor neural network parameterized by θactor

1 hidden layer, 50 units, ReLU activationsactor output 2 tanh(fθactor (s))actor learning rate 10−3 with ADAM optimizercritic neural network parameterized by θcritic

2 hidden layers, 50 units, ReLU activationscritic output fθcritic (s,a)critic learning rate 10−2 with ADAM optimizermax replay size 5 · 105



PWISdataset sizes 102, 5 · 102, 103, 2 · 103, 5 · 103, 7.5 · 103,

104, 1.2 · 104, 1.5 · 104, 2 · 104, 2.5 · 104

discount factor γ 0.99policy neural network parameterized by θ

1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−2 with ADAM optimizerpolicy updates 2 · 103

BEARdataset sizes 102, 2 · 102 5 · 102, 103, 2 · 103, 5 · 103,

104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activations


policy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−4

policy updates 1 · 103

BRAC-(dual)dataset sizes 102, 2 · 102 5 · 102, 103, 2 · 103, 5 · 103,

104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−3 with Adampolicy updates 5 · 104

batch size ≤ 256soft update τ = 5 · 10−3

MOPOdataset sizes 102, 2 · 102 5 · 102, 103, 2 · 103, 5 · 103,

104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 3 · 10−4 with Adamnumber of epochs 5 · 102

batch size 256soft update τ = 5 · 10−3

BNN hidden dims 64BNN max epochs 100BNN ensemble size 7

MOReLdataset sizes 102, 2 · 102 5 · 102, 103, 2 · 103, 5 · 103,

104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))step size 0.02number of iterations 1 · 103

dynamics hidden dims (128, 128), ReLU activationsdynamics lr 0.001dynamics batch-size 256dynamics fit-epochs 25dynamics num-models 4

TABLE 5: Algorithms configurations for the Pendulum random data experiment

2.2.3 Cart-pole with Random AgentThe following tables show the hyper-parameters used to generate the third plot in Figure 9.


NOPGdataset sizes 102, 2.5 · 102, 5 · 102, 103, 1.5 · 103, 2.5 · 103,

3 · 103, 5 · 103, 6 · 103, 8 · 103, 104


state ~hfactor 1.0 1.0 1.0



µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s)) (NOGP-S)learning rate 10−2 with ADAM optimizerNMCπ (NOPG-S) 10

NMCφ 1

NMCµ0

15policy updates 2 · 103

SACdiscount factor γ 0.99rollout steps 10000actor neural network parameterized by θactor

1 hidden layer, 50 units, ReLU activationsactor output 5 tanh(u), u ∼ N (·|µ = fθactor (s), σ = gθactor (s))actor learning rate 10−3 with ADAM optimizercritic neural network parameterized by θcritic

2 hidden layers, 50 units, ReLU activationscritic output fθcritic (s,a)critic learning rate 5 · 10−3 with ADAM optimizermax replay size 5 · 105



DDPG / TD3discount factor γ 0.99rollout steps 10000actor neural network parameterized by θactor

1 hidden layer, 50 units, ReLU activationsactor output 5 tanh(fθactor (s))actor learning rate 10−3 with ADAM optimizercritic neural network parameterized by θcritic

1 hidden layer, 50 units, ReLU activationscritic output fθcritic (s,a)critic learning rate 10−2 with ADAM optimizersoft update τ = 10−3


PWISdataset sizes 102, 5 · 102, 103, 2 · 103, 3.5 · 103, 5 · 103,

8 · 103, 104, 1.5 · 104, 2 · 104, 2.5 · 104


1 hidden layer, 50 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−3 with ADAM optimizer




104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−4



104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−3 with Adampolicy updates 5 · 104



104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 3 · 10−4 with Adamnumber of epochs 5 · 102


BNN hidden dims 32BNN max epochs 100BNN ensemble size 7


104, 2 · 104, 5 · 104, 105


1 hidden layer, 50 units, ReLU activationspolicy output µ = 2 tanh(fθ(s)), σ = sigmoid(gθ(s))step size 0.02number of iterations 500dynamics hidden dims (128, 128), ReLU activationsdynamics lr 0.001dynamics batch-size 256dynamics fit-epochs 25


dynamics num-models 4TABLE 6: Algorithms configurations for the CartPole random data experiment.

2.2.4 U-Maze with D4RL

NOPGdataset sizes 102, 3.5 · 102, 5 · 102, 6.5 · 102, 8 · 102, 1 · 103


state ~hfactor 2.0


2 hidden layers, 64 and 32 units, ReLU activationspolicy output 5 tanh(fθ(s)) (NOGP-D)

µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s)) (NOGP-S)learning rate 3 · 10−4 with ADAM optimizerNMCπ (NOPG-S) 1

NMCφ 1

NMCµ0



104, 2 · 104, 5 · 104, 105


2 hidden layer, 64 and 32 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−4

policy updates 103


104, 2 · 104, 5 · 104, 105


2 hidden layer, 64 and 32 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 10−3 with Adampolicy updates 5 · 104



104, 2 · 104, 5 · 104, 105


2 hidden layer, 64 and 32 units, ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))learning rate 3 · 10−4 with Adamnumber of epochs 5 · 102


BNN hidden dims 64


BNN max epochs 100BNN ensemble size 7


104, 2 · 104, 5 · 104, 105


(64, 32), ReLU activationspolicy output µ = 5 tanh(fθ(s)), σ = sigmoid(gθ(s))step size 0.005number of iterations 500dynamics hidden dims (32, 32), ReLU activationsdynamics lr 0.001dynamics batch-size 256dynamics fit-epochs 25dynamics num-models 4

TABLE 7: Algorithms configurations for the CartPole random data experiment.

2.2.5 Hopper with D4RL

NOPGdataset hopper-expert-v2dataset sizes 3 · 103, 5 · 103, 1 · 104


state ~hfactor 4.0


2 hidden layers, 256 and 256 units, ReLU activationspolicy output tanh(fθ(s)) (NOGP-D)

µ = tanh(fθ(s)), σ = sigmoid(gθ(s)) (NOGP-S)learning rate 3 · 10−4 with ADAM optimizerNMCπ (NOPG-S) 1

NMCφ 1

NMCµ0


2.2.6 Mountain Car with Human DemonstratorThe dataset (10 trajectories) for the experiment in Figure 11 has been generated by a human demonstrator, and is availablein the source code provided.

NOPGdiscount factor γ 0.99

state ~hfactor 1.0 1.0

action ~hfactor 50.0

policy neural network parameterized by ~θ1 hidden layer, 50 units, ReLU activations

policy output tanh(f~θ(s)) (NOGP-D)µ = tanh(f~θ(s)), σ = sigmoid(g~θ(s)) (NOGP-S)

learning rate 10−2 with ADAM optimizer


NMCπ (NOPG-S) 15

NMCφ 1

NMCµ0

15policy updates 1.5 · 103

TABLE 9: NOPG configurations for the MountainCar experiment.

2.3 Computational and Memory ComplexityHere we detail the computational and memory complexity of NOPG. We denoteNMC

µ0Monte-Carlo samples for expectations

under the initial state distribution, NMCπ samples for expectations under the stochastic policy (for deterministic it is 1), and

NMCφ samples for the expectations of each entry of Pγ

π . We keep the main constants throughout the analysis and dropconstant factors, e.g. for the normalization of a matrix row, which involves summing up and diving each element of therow.

• Constructing the vector επ,0 takes O(NMCµ0NMCπ n

)time. Storing επ,0 occupies O

(NMCµ0NMCπ n

)memory.

• We compute Pγπ row by row by sparsifying each row selecting the largest k elements, amounting to a complexity

of O(NMCφ NMC

π n2 log k)

, with k � n, since the cost of processing one row is O(NMCφ NMC

π n log k)

, e.g. with a

Max-Heap, and there are n rows. Storing Pγπ needs O

(NMCφ NMC

π nk)

memory.• We solve the linear system of equations r = Λπqπ and επ,0 = Λᵀ

πµπ for qπ and µπ using the conjugate gradientmethod. Both computations take O

(√δ(k + 1)n

), where δ is the condition number of Λπ after sparsification, and

(k + 1)n the number of nonzero elements. The plus one comes from the computation of Λπ , since subtracting Pγπ

from the identity matrix leads to an increase of n nonzero elements. The conjugate gradient method is speciallyadvantageous when using sparse matrices. As a side note, computing the condition number δ is computationallyintensive, but there exist methods to compute an upper bound.

• In computing the surrogate loss for the gradient computation, the cost of the vector-vector multiplication ∂∂θε

ᵀπ,0qπ

is O(NMCµ0NMCπ n

), and the vector-(sparse) matrix-vector multiplication µᵀ

π

(∂∂θ P

γπ

)qπ is O

(NMCφ NMC

π n2k)

, thus

totaling O(NMCµ0NMCπ n+NMC

φ NMCπ n2k

). Assuming the number of policy parameters M to be much lower than

the number of samples, M � n, we ignore the gradient computation, since even when using finite differences, wewould have O(M)� O(n).

Even thought the Monte-Carlo sample terms are fixed constants, and usually set to 1, we left these terms to emphasizethat the policy parameters are inside each entry of επ,0 and Pγ

π , and thus we need to keep these terms until we computethe gradient with automatic differentiation. Since modern frameworks for gradient computation, such as Tensorflow orPyTorch, build a (static or dynamic) computational graph to backpropagate gradients, we cannot simply ignore theseconstants in terms of time and memory complexity. The only exception is when computing qπ and µπ with the conjugategradient. Since we do not need to backprogagate through this matrix, here we drop the Monte-Carlo terms because Λπ isevaluated at the current policy parameters, leading to a matrix represented by n× k elements.

Taking all costs into account, we conclude that the computational complexity of NOPG per policy update is

O(NMCµ0NMCπ n

)

︸︷︷︸επ,0

+O(NMCφ NMC

π n2 log k)

︸︷︷︸Pγπ

+O(√

δ(k + 1)n)

︸︷︷︸conj. grad qπ,µπ

+O(NMCµ0NMCπ n

)

︸︷︷︸∂∂θ ε

ᵀπ,0qπ

+O(NMCφ NMC

π n2k)

︸︷︷︸µᵀπ∂∂θ P

γπqπ

= O(NMCµ0NMCπ n

)+O

(NMCφ NMC

π n2(k + log k))

+O(√

δ(k + 1)n).

The memory complexity is

O(NMCµ0NMCπ n

)

︸︷︷︸επ,0

+O(NMCφ NMC

π nk)

︸︷︷︸Pγπ

+O (n)︸︷︷︸qπ

+O (n)︸︷︷︸µπ

.

The quantities in “underbrace” indicate the source of the complexities as described in the previous paragraphs. Hence,after dropping task-specific constants, the algorithm to implement NOPG has close to quadratic computational complexityand linear memory complexity with respect to the number of samples n, per policy update.

submitted to ieee transaction on pattern analysis ... - …

Documents