prioritized sequence experience replay

Prioritized Sequence Experience Replay

Marc Brittain * 1 Josh Bertram * 1 Xuxi Yang * 1 Peng Wei 1

AbstractExperience replay is widely used in deep rein-forcement learning algorithms and allows agentsto remember and learn from experiences from thepast. In an effort to learn more efficiently, re-searchers proposed prioritized experience replay(PER) which samples important transitions morefrequently. In this paper, we propose PrioritizedSequence Experience Replay (PSER) a frame-work for prioritizing sequences of experience inan attempt to both learn more efficiently and toobtain better performance. We compare the per-formance of PER and PSER sampling techniquesin a tabular Q-learning environment and in DQNon the Atari 2600 benchmark. We prove theoreti-cally that PSER is guaranteed to converge fasterthan PER and empirically show PSER substan-tially improves upon PER.

1. IntroductionReinforcement learning is a powerful technique to solvesequential decision making problems. Advances in deeplearning applied to reinforcement learning resulted in theDQN algorithm (Mnih et al., 2015) which uses a neuralnetwork to represent the state-action value. With experiencereplay and a target network, DQN achieved state-of-the-art performance in the Atari 2600 benchmark and otherdomains at the time.

While the performance of deep reinforcement learning algo-rithms can be above human-level in certain applications, theamount of effort required to train these models is staggeringboth in terms of data samples required and wall-clock timeneeded to perform the training. This is because reinforce-ment learning algorithms learn control tasks via trial anderror, much like a child learning to ride a bicycle (Sutton &Barto, 1998). In gaming environments, experience is reason-ably inexpensive to acquire, but trials of real world controltasks often involve time and resources we wish not to waste.

*Equal contribution 1Department of Aerospace Engineering,Iowa State University, Ames, USA. Correspondence to: MarcBrittain <[email protected]>.

Alternatively, the number of trials might be limited due towear and tear of the system, making data-efficiency critical(Gal, 2016). In these cases where simulations are not avail-able or where acquiring samples requires significant effortor expense, it becomes necessary to utilize the acquired datamore efficiently for better generalization.

As an important component in deep reinforcement learn-ing algorithms, experience replay has been shown to bothprovide uncorrelated data to train a neural network and tosignificantly improve the data efficiency (Lin, 1992; Wanget al., 2016a; Zhang & Sutton, 2017). In general, experiencereplay can reduce the amount of experience required to learnat the expense of more computation and memory (Schaulet al., 2016).

There are various sampling strategies to sample transitionsfrom the experience replay memory. The original primarypurpose of the experience replay memory was to decorrelatethe input passed into the neural net, and therefore the orig-inal sampling strategy was uniform sampling. Prioritizedexperience replay (PER) (Schaul et al., 2016) demonstratedthat the agent can learn more effectively from some transi-tions than from others. By sampling important transitionswithin the replay memory more often at each training step,PER makes experience replay more efficient and effectivethan uniform sampling.

In this paper we propose an extension to PER that we termPrioritized Sequence Experience Replay (PSER) that notonly assigns high sampling priority to important transitions,but also increases the priorities of previous transitions lead-ing to the important transitions. To motivate our approach,we use the ‘Blind Cliffwalk’ environment introduced inSchaul et al. (2016). To evaluate our results, we use theDQN algorithm (Mnih et al., 2015) with PSER and PER toprovide a fair comparison of the sampling strategy on thefinal performance of the algorithm on the Atari 2600 bench-mark. We also prove theoretically that PSER convergesfaster than PER. Our experimental and theoretical resultsshow using PSER substantially improves upon the perfor-mance of PER in both the Blind Cliffwalk environment andthe Atari 2600 benchmark.

arX

iv:1

905.

1272

6v2

[cs

.LG

] 1

9 Fe

b 20

20


2. Related Work2.1. DQN and its extensions

With the DQN algorithm described in Mnih et al. (2015),deep learning and reinforcement learning were successfullycombined by using a deep neural network to approximate thestate-action values, where the input of the neural networkis the current state s in the form of pixels, representingthe game screen and the output is the state-action valuescorresponding to different actions (i.e. Q-values). It isknown that neural networks may be unstable and divergewhen applying non-linear approximators in reinforcementlearning (RL) algorithms (Sutton & Barto, 1998). DQNuses experience replay and target networks to address theinstability issues. At each time step, based on the currentstate, the agent selects an action based on some policy (i.e.ε-greedily) with respect to the action values, and adds atransition (st, at, rt, st+1) to a replay memory. The neuralnetwork is then optimized using stochastic gradient descentto minimize the squared TD error of the transitions sampledfrom the replay memory. The gradient of the loss is back-propagated only into the parameters of the online networkand a target network is updated from the online networkperiodically.

Many extensions to DQN have been proposed to improve itsperformance. Double Q-learning (Van Hasselt et al., 2016)was proposed to address the overestimation due to the actionselection using the online network. Prioritized experiencereplay (PER) (Schaul et al., 2016) was proposed to replayimportant experience transitions more frequently, enablingthe agent to learn more efficiently. Dueling networks (Wanget al., 2016b) is a neural network architecture which canlearn state and advantage value, which is shown to stabilizelearning. Using multi-step targets (Sutton & Barto, 1998)instead of a single reward is also shown to lead to fasterlearning. Distributional RL (Bellemare et al., 2017) wasproposed to learn the distribution of the returns instead of theexpected return to more effectively capture the informationcontained in the value function. Noisy DQN (Fortunatoet al., 2018) proposed another exploration technique byadding parametric noise to the network weights.

Rainbow (Hessel et al., 2018) combined the above men-tioned 6 variants together into one agent, achieving betterdata efficiency and performance on the Atari 2600 bench-mark, leading to a new state-of-the-art at the time. Throughthe ablation procedure described in the paper, the contribu-tion of each component was isolated. Distributed prioritizedreplay (Horgan et al., 2018) utilized a massively parallelapproach to show that with enough scaling a new state-of-the-art score can be achieved, but at a cost of orders ofmagnitude more data. (Typical amounts of frames used forAtari 2600 benchmark games are 200 million frames. Thedistributed prioritized replay paper has a faster wall-clock

time execution, but orders of magnitude more frames wererequired.) A comprehensive survey of deep reinforcementlearning algorithms including other extensions of DQN canbe found at (Li, 2018).

2.2. Experience replay

Experience replay has played an important role in providinguncorrelated data for the online neural network training ofdeep reinforcement learning algorithms (Mnih et al., 2015;Lillicrap et al., 2015). There are also studies into how ex-perience replay can influence the performance of deep rein-forcement learning algorithms (de Bruin et al., 2015; Zhang& Sutton, 2017).

In the experience replay of DQN, the observation sequencesare stored in the replay memory and sampled uniformly forthe training of the neural network in order to remove thecorrelations in the data. However, this uniform samplingstrategy ignores the importance of each transition and isshown to be inefficient for learning (Schaul et al., 2016).

It is well-known that model-based planning algorithms suchas value iteration can be made more efficient by prioritizingupdates in an appropriate order. Based on this idea, priori-tized sweeping (Moore & Atkeson, 1993; Andre et al., 1998)was proposed to update the states with the largest Bellmanerror, which can speed up the learning for state-based andcompact (using function approximator) representations ofthe model and the value function. Similar to prioritizedsweeping, prioritized experience replay (PER) (Schaul et al.,2016) assigns priorities to each transition in the experiencereplay memory based on the TD error (Sutton & Barto,1998) in model-free deep reinforcement learning algorithms,which is shown to improve the learning efficiency tremen-dously compared with uniform sampling from the experi-ence replay memory. There are also several other proposedmethods trying to improve the sample efficiency of deepreinforcement learning algorithms. Lee et al. (2018) pro-posed a sampling technique which updates the transitionsbackward from a whole episode. Karimpanal & Bouffanais(2018) proposed an approach to select appropriate transitionsequences to accelerate the learning.

In another recent study by Zhong et al. (2017), the authorsinvestigate the use of back-propagating a reward stimulusto previous transitions. In our work, we follow the method-ology set forth in Schaul et al. (2016) to provide a moregeneral approach by using the current TD error as the pri-ority signal and introduce techniques we found critical tomaximize performance.

The approach proposed in this paper is an extension of PER.While we assign a priority to a transition in the replay mem-ory, we also propagate this priority information to previoustransitions in an efficient manner, and experimental results


Figure 1: The Blind Cliffwalk environment. At each statethere are 2 available actions (correct action and wrong ac-tion). The agent has to learn to take the correct action ateach state to reach the final reward.

on the Atari 2600 benchmark show that our algorithm sub-stantially improves upon PER sampling.

3. Prioritized Sequence Experience ReplayUsing a replay memory is shown to be able to stabilize theneural network (Mnih et al., 2015), but the uniform samplingtechnique is shown to be inefficient (Schaul et al., 2016).To improve the sampling efficiency, Schaul et al. (2016)proposed prioritized experience replay (PER) to use the lastobserved TD error to make more effective use of the replaymemory for learning. In this paper we propose an extensionof PER named Prioritized Sequence Experience Replay(PSER), which can also take advantage of information aboutthe trajectory by propagating the priorities back through thesequence of transitions.

3.1. A motivating example

To motivate and understand the potential benefits of PSER,we implemented four different agents in the artificial ‘BlindCliffwalk’ environment introduced in Schaul et al. (2016),shown in Figure 1. With only n states and two actions,the environment requires an exponential number of randomsteps until the first non-zero reward; to be precise, the chancethat a random sequence of actions will lead to the rewardis 2−n. The first agent replays transitions uniformly fromthe experience at random, while the second agent invokesan oracle to prioritize transitions, which greedily selectsthe transition that maximally reduces the global loss in itscurrent state (in hindsight, after the parameter update). Thelast two agents use PER and PSER sampling techniquesrespectively. In this chain environment with sparse reward,there is only one non-zero reward that is located at the endof chain marked in green and labelled R = 1. The agentmust learn the correct sequence of actions in order to reachthe goal state and collect the reward. At any state alongthe chain, the action that leads to the next state varies –sometimes a1, sometimes a2 – to prevent the agent fromadopting a trivial solution (e.g., always take action a1.) Any

incorrect action results in a terminal state and the agentstarts back at the beginning of the chain. For the details ofthe experiment setup, see the appendix.

To provide some early intuition of the benefits of PSER,we compare performance on the Blind Cliffwalk betweenPSER and PER. We track the mean squared error betweenthe ground truth Q-value and Q-learning result every 100iterations. Better performance in this experiment means thatthe loss curve more closely matches the oracle. The PERpaper demonstrated that by prioritizing transitions based onthe TD error, improvements in performance were obtainedover uniform sampling. Our results show that by prioritizingthe transition with TD error and decaying a portion of thispriority to previous transitions, further improvements areobtained with earlier convergence as compared to PER. Weshow results of this Blind Cliffwalk environment with 16states in Figure 2 and also show how the initialization of thetransition’s priority (max priority or small non-zero priority,ε) in the replay memory affects convergence speed. We findthat PSER consistently outperforms PER in this problem.

Examining the curves in Figure 2 more closely, there is aninitial period where PSER is comparable to both uniformand PER. This is due to all samples initially having thesame priority in the replay memory which results in uniformsampling. This uniform sampling continues until the goaltransition is sampled from the replay memory and a non-zero TD error is encountered.

At this point, how the agent updates the priority of thistransition in the replay memory leads to the divergence inperformance between the algorithms. Uniform samplingcontinues to sample transitions with equal probability fromthe replay memory. PER updates the priority of the onetransition in the memory, but all other transitions in thememory are still chosen at uniform which still results ininefficient sampling as we need to wait until the transitionpreceding the goal state is sampled. PSER capitalizes onthe high TD error that was received and decays a portion ofthe new priority of the goal state to the preceding states toencourage sampling of the states that led to the goal state.It is clear from Figure 2 that by decaying the priority of thehigh TD error states, we can encourage faster convergenceto the true Q-value in an intuitive and effective way.

To support this intuition, we offer the following Theoremto describe the convergence rates of PER and PSER (due tospace restrictions, the proof can be found in Appendix).

Theorem 1 Consider the Blind Cliffwalk environment withn states, if we set the learning rate of the asynchronousQ-learning algorithm to 1, then with a pre-filled state tran-sitions in the replay memory by exhaustively executing all2n possible action sequences, the expected steps for the Q-learning algorithm to converge with PER sampling strategy


0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Number of iterations 1e5

0.0

0.5

1.0

1.5

2.0

2.5M

ean

squa

red

erro

r

uniformPERPSERoracle

(a) 16 states with all transitions initialized with max priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(b) 16 states with all transitions initialized with ε priority.

Figure 2: Comparison of convergence speed for a PSER, PER, uniform, and oracle agent in the Blind Cliffwalk environmentwith 16 states. PSER shows improved convergence speed as compared to PER and uniform in all cases. The shaded arearepresents 68% confidence interval from 10 independent runs with different seeds.

is represented by:

EPER,n[N ] = 1 + (2n+1 − 2)(1− 1

2n−1) (1)

and expected steps for the Q-learning algorithm to convergewith PSER sampling strategy with decaying coefficient ρ is

EPSER,n[N ] ≤ n

1− ρ− ρ− ρn+1

(1− ρ)2(2)

In Figure 3 we plot the expected number of iterations forconvergence from the result of Theorem 1, from whichwe can see for the Q-learning algorithm, PSER samplingstrategy theoretically converges faster.

3.2. Prioritized sequence decay

In this subsection we formally define the concept of theprioritized sequence and decaying priority backward in timewithin the replay memory.

Formally, the priority of transitions will be decayed as fol-lows:

Suppose in one episode, we have a trajectory of transitionsT0 to Tn−1 (Ti = (si, ai, ri, si+1)) stored in the experiencereplay memory with priorities p = (p0, p1, · · · , pn−1). Ifthe agent observes a new transition Tn = (sn, an, rn, sn+1),we first calculate its priority pn based on its TD error similarto the PER algorithm:

δ = rn + γmaxa

Qtarget(sn+1, a)−Q(sn, an) (3)

pn = |δ|+ ε, (4)

where ε is a small positive constant to allow transitions withzero TD-error a small probability to be resampled.

As in Schaul et al. (2016), according to the calculated prior-ity, the probability of sampling transition i is:

P (i) =pαi∑k p

αk

, (5)

where the exponent α determines how much prioritizationis used. We then decay the priority exponentially (withdecay coefficient ρ) to the previous transitions stored in thereplay memory for that episode and apply a max operatorin an effort to preserve any previous priority assigned to thedecayed transitions:

pn−1 = max{pn · ρ, pn−1}pn−2 = max{pn · ρ2, pn−2}pn−3 = max{pn · ρ3, pn−3}

· · ·

(6)

We refer to this decay strategy as the MAX variant. Oneother potential way to decay the priority is to simply addthe decayed priority pn · ρi with the previous priority pn−iassigned to the transition. This we refer to as the ADDvariant. Note in the ADD variant, we keep the priority lessthan the max priority when decaying the priority backwards,to avoid overflow issues.

pn−1 = min{pn · ρ+ pn−1,maxn

pn}

pn−2 = min{pn · ρ2 + pn−2,maxn

pn}

pn−3 = min{pn · ρ3 + pn−3,maxn

pn}

· · ·

(7)

Figure 4 illustrates this for a case where a priority decayat transition T7 was calculated, then another priority decayoccurs at transition T13. Without the max operator applied,


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Number of states

100

101

102

103

104

105

106

Expe

cted

num

ber o

f ite

ratio

nsUniformPSERPER

Figure 3: From Theorem 1, the expected number of itera-tions until convergence given the number of states in theBlind Cliffwalk, where lower values along the y-axis meanconvergence occurs earlier.

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13Experience in replay memory

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Prio

rity

Figure 4: A max operator is used to prevent the prioritydecay due to T13 from overwriting a previously calculatedpriority decay due to T7.

the priority p7 for transition T7 in the replay memory wouldbe set to ρ · p8 where p8 is the priority for transition T8.

Here we note that as the priority is decayed, we expect thatafter some number of updates the decayed priority pn−k isnegligible and is therefore wasted computation. We there-fore define a window of size W over which we will allowthe priority pn to be decayed, after which we will stop. Wearbitrarily selected a threshold of 1% of pn as a cutoff forwhen the decayed priority becomes negligible. We com-pute the window size W , then, based off the value of thehyperparameter ρ as follows:

pn · ρW ≤ 0.01pn (8)

W ≤ ln 0.01

ln ρ. (9)

Through the above formulation for PSER, we identifiedan issue which we termed “priority collapse” during thedecay process. Suppose for a given environment, PSER hasalready decayed the priority backward for the “surprising”transition, which we will call Ti. Let’s assume that currentlyall of the Q-values are 0 and we sampled a transition in thereplay memory, Ti−2, that led to Ti. From Equation 3 andEquation 4 the priority for transition Ti−2 would drop toε. The result is that a priority sequence that was recentlydecayed has almost no effect as it is almost guaranteed tobe eliminated at the next sampling. When this happensto multiple states we term this “priority collapse” and thepotential benefits of PSER are eliminated making it nearlyequivalent to traditional PER.

In order to prevent this catastrophic “priority collapse” wedesign a parameter, η which forces the priority to decreaseslowly. When updating the priority of a sampled transitionin the replay memory, we want to maintain a portion of its

previous priority to prevent it from decreasing too quickly:

pi ← max(|δ|+ ε, η · pi). (10)

where i here refers to the index of the sampled transitionwithin the replay memory. Without this decay parameterη, we experimentally found PSER to have no significantbenefit over PER which confirms our intuition.

Our belief is that the decay parameter η provides time for theBellman update process to propagate information about theTD error through the sequence and for the neural network tomore readily learn an appropriate Q-value approximation.

3.3. Annealing the bias

As discussed in Schaul et al. (2016), prioritized replay intro-duces bias because it changes the sampling distribution ofthe replay memory. To correct the bias, Schaul et al. (2016)introduced importance-sampling (I.S.) weights defined asfollows:

wi =

(1

N· 1

P (i)

)β, (11)

where N is the size of the replay memory and P (i) is theprobability of sampling transition i. Non-uniform probabili-ties are fully compensated for if β = 1. In PSER, we adaptthe I.S. weights to correct for the sampling bias. The fullalgorithm is presented in Algorithm 1 in the Appendix.

4. Experimental Methods4.1. Evaluation methodology

We used the Arcade Learning Environment (Bellemare et al.,2013) to evaluate the performance of our proposed algo-rithm. We follow the same training and evaluation proce-dures of Hessel et al. (2018); Mnih et al. (2015); Van Hasselt


Ventu

reSo

laris

Frostb

iteJam

esbon

dIce

Hocke

yTe

nnis

Grav

itar

MsPa

cman

Bowli

ngTu

tankh

amRiv

erraid

Doub

leDun

kSk

iing

Assau

ltPri

vateE

yePo

ngFre

eway

Monte

zumaR

even

ge Krull

Pitfal

lAtl

antis

Boxin

gCe

ntipe

deUp

NDow

nEn

duro

Break

out

Seaq

uest

Fishin

gDerb

yBa

nkhe

istPh

oenix

Name

ThisG

ame

Kang

aroo

Road

Runn

erBe

amRid

erHe

roQb

ertTim

ePilot

CrazyC

limbe

rVid

eoPin

ball

YarsR

even

geBa

ttlezon

eAli

enKu

ngFu

Maste

rBe

rzerk

StarG

unne

rDe

monA

ttack

Amida

rSp

aceInv

aders

Goph

erAs

terix

Aster

oids

Wizar

dOfW

orZa

xxon

Robo

tank

Chop

perCo

mman

d

-100%

0%

100%

200%

Relat

ive pe

rform

ance

of PS

ER to

PER

Figure 5: Relative performance of prioritized sequence experience replay (PSER) to prioritized experience replay (PER)in all 55 Atari 2600 benchmark games where human scores are available. 0% on the vertical axis implies equivalentperformance; positive numbers represent the cases where PSER performed better; negative numbers represent the caseswhere PSER performed worse.

et al. (2016). We calculate the average score during train-ing every 1M frames in the environment. After every 1Mframes, we then stop training and evaluate the agent’s per-formance for 500K frames. We also truncate the episodelengths to 108K frames (or 30 minutes of simulated play)as in Van Hasselt et al. (2016); Hessel et al. (2018). Inthe results section, we report the mean and median humannormalized scores of PSER and PER in the Atari 2600benchmark and in the appendix we provide full learningcurves for all games in the no-op starts testing regime.

4.2. Hyperparameter tuning

DQN has a number of different hyperparameters that canbe tuned. To provide a comparison with our baseline DQNagent, we used the hyperparameters that are provided inMnih et al. (2015) for the DQN agent formulation (seeAppendix for more details).

Our PSER implementation also has hyperparameters thatrequire tuning. Due to the large amount of time it takes torun the full 200M frames for the DQN tests (multiple days),we used a coordinate descent approach to tune the PSERparameters for a subset of the Atari 2600 benchmark. Inthe coordinate descent approach, we define a set of differentvalues to test for each parameter. Then, holding all otherparameters constant, we tune one parameter until the bestresult is obtained. We then fix this tuned parameter and

move to the next parameter and repeat this process untilall parameters have been tuned. While this does not testevery combination of parameter values, it greatly reducesthe hyperparameter search space and proved to provide goodresults.

The hyperparameters obtained during the hyperparametersearch were used for all Atari 2600 benchmark results re-ported in this paper. Hyperparameters were not tuned foreach game so as to better measure how the algorithm gen-eralizes over the whole suite of the Atari 2600 benchmark.We found the best results were obtained with W = 5, ρ =0.4, and η = 0.7.

5. AnalysisIn this section, we analyze the main experimental results us-ing the Atari 2600 benchmark available within the OpenAIgym environment (Brockman et al., 2016). We show that byadding PSER to the DQN agent we can achieve substantialimprovement to performance as compared to PER.

5.1. Baselines

We compared PSER to PER using the version of DQNdescribed in Mnih et al. (2015). This way we can providea fair comparison by minimally modifying the algorithmto attribute any performance differences to the sampling


0.0

0.2

0.4

0.6

0.8

1.0

Norm

alized score

BeamRider Breakout Pong

0 50 100 150 200Millions of frames

0.0

0.2

0.4

0.6

0.8

1.0

Norm

alized score

Qbert


Seaquest


SpaceInvaders

PSER_CurrentTD_ADD PSER_CurrentTD_MAX PSER_MaxPrio_ADD PER_CurrentTD PER_MaxPrio PSER_MaxPrio_MAX

Figure 6: Ablation study performed on six Atari 2600 games. We show the full learning curves from the evaluation periodthat occurs following each 1M frames of training. Scores are normalized by the maximum and minimum value recordedacross all ablations for each game. The legend is read as Sampling Strategy_Initial Prioritization_Decay Scheme. Forexample, PSER_CurrentTD_MAX corresponds to the learning curve for PSER, CurrenTD initial prioritization, and MAXdecay strategy. Results are smoothed with a 10M frame rolling average to improve clarity.

strategy. Both DQN agents used identical hyperparametersthat we list in the Appendix.

5.2. Comparison with baselines

Figure 5 shows the relative performance of prioritized se-quence experience replay and prioritized replay for the 55Atari 2600 games for which human scores are available. Wecompute the human normalized scores for PER and PSERfollowing the methodology in Schaul et al. (2016) which werepeat in the Appendix for clarity. A comparison of all 60games showing percent improvement of PSER over PER isalso available in the Appendix.

We can see from Figure 5 that PSER leads to substantialimprovements over PER. In the games where PSER outper-formed PER, we can see that the range of relative differenceis much larger as compared to the games where PER out-performed PSER. For PSER, 8 games achieved a relativedifference of over 100% as compared to 3 for PER.

In Table 1 we compare the final evaluation performance ofPSER and PER on the Atari 2600 benchmark by calculatingthe median and mean human normalized scores (See theappendix for the learning curves of all Atari games). PSERachieves a median score of 109% and a mean score of 832%in the no-ops regime, significantly improving upon PER.

Table 1: Median and Mean human normalized scores of thebest agent snapshot across 55 Atari games for which humanscores are available.

SAMPLING STRATEGY MEDIAN MEAN

PSER 109% 832%PER 88% 607%

5.3. Ablation Study

To understand how the initial priority assigned to a transitioninteracts with prioritized sampling, we conducted additionalexperiments to evaluate the performance.

There are two variants of initial priority assignment that weconsidered in our ablation study. First, in Mnih et al. (2015);Van Hasselt et al. (2016); Hessel et al. (2018), transitionsare added to the replay memory with the maximum priorityever seen. Second, in Horgan et al. (2018) transitions areadded with priority calculated from the current TD error ofthe online model. We refer to the variants as MaxPrio andCurrentTD, respectively.

In each ablation study, we test combinations of the followingparameters: a) prioritized sampling strategy (PSER, PER),b) the initial priority assignment (MaxPrio, CurrentTD), and


c) decay scheme1 (MAX, ADD) as described in (6) and (7).

Figure 6 compares the performance across six Atari 2600games. We can see that the choice of the initial priorityassignment doesn’t appear to lead to a substantial differencein the initial learning speed or performance for both PSERand PER in each game except Seaquest. In Seaquest, weobserve that the CurrentTD variant leads to faster learningin the initial 75M frames, but begins to hurt performancethroughout the remainder of training, potentially due toover-fitting.

We also find that the MAX decay strategy led to better per-formance over the ADD decay strategy. Intuitively, to helpencourage the Bellman update process from states with highTD error, it makes sense to decay the priority exponentiallybackwards instead of adding the priorities together.

5.4. Learning Speed

Each agent is run on a single GPU and the learning speedfor each variant varies depending on the game. For a full200 million frames of training, this corresponded to approx-imately 5-10 days of computation time depending on thehardware used2. We found that the learning speed of PSERis comparable to PER when a small decay coefficient valueis used. As this value increases, there is an increase in thecomputation time due to the larger decay window.

6. DiscussionWe have demonstrated that PSER achieves substantial per-formance increases through the Blind Cliffwalk environ-ment and the Atari 2600 benchmark.

While performing this analysis, we tested different config-urations of PSER and discovered phenomena that we didnot expect. Most important was the priority collapse issuedescribed in section 3.2. By introducing the parameter ηto maintain a portion of a transition’s previous priority, weprevent the altered priorities created by PSER from quicklyreverting to the priorities assigned by PER. We believe thatthere are two processes inherent in deep reinforcement learn-ing: the Bellman update process inherent in all reinforce-ment learning and Markov Decision Processes, and the neu-ral network gradient descent update process. Both processesare very slow and require many samples to converge. Wehypothesize that keeping the previous transitions’ prioritieselevated in the replay memory results in additional Bellmanupdates for the sequence with valuable information. Whilethis speeds up the Bellman update process, it also serves to

1The decay scheme is unique to PSER, so this is not tested forPER.

2We adapted the Dopamine (Castro et al., 2018) code-base forPSER and PER to compare performance.

provide the neural net with better targets which improvesthe overall convergence rate.

When running experiments on the Atari 2600 benchmark,we needed to choose a fixed hyperparameter set for a faircomparison between PSER and PER. However, the gamesin the Atari 2600 benchmark vary in how to obtain rewardand how long the delay is between action and reward. Eventhough we achieved substantial improvement over PER witha fixed decay window, allowing the decay window to varyfor each game may lead to better performance. One ap-proach is to introduce an adaptive decay window based offthe magnitude of the TD error. We leave the investigationof an adaptive decay window to future work.

It remains unclear whether the MaxPrio or CurrentTD ini-tial priority assignment should be used when adding newtransitions to the replay memory. For the Blind Cliff Walkexperiments, we found that MaxPrio approach delayed theconvergence to the true Q-value as compared to CurrentTD.However on Atari, we found the MaxPrio approach to bemore effective. Intuitively, adding transitions to the replaymemory with the current TD error makes sense to encouragethe agent to initially sample these high priority transitionssooner. Adding with max priority should result in an artifi-cially high priority for most new transitions. We hypothe-size that this may be related to the priority collapse problemwhere these artificially high priorities are temporarily allow-ing better information flow during the learning process.

We chose to implement PSER on top of DQN primarilyfor the purpose of enabling a fair comparison in the experi-ments between PER and PSER, but combining PSER withother algorithms is an interesting direction for future work.For example, PSER can also be used with other off-policyalgorithms such as Double Q-learning and Rainbow.

7. ConclusionIn this paper we introduced Prioritized Sequence Experi-ence Replay (PSER), a novel framework for prioritizingsequences of transitions to both learn more efficiently andeffectively. This method shows substantial performanceimprovements over PER and Uniform sampling in theBlind Cliffwalk environment, and we show theoreticallythat PSER is guaranteed to converge faster than PER. Wealso demonstrate the performance benefits of PSER in theAtari 2600 benchmark with PSER outperforming PER in40 out of 60 Atari games. We show that improved abilityfor information to flow during the training process can leadto faster convergence, as well as, increased performance,potentially leading to increased data efficiency for deepreinforcement learning problems.


ReferencesAndre, D., Friedman, N., and Parr, R. Generalized pri-

oritized sweeping. In Advances in Neural InformationProcessing Systems, pp. 1001–1007, 1998.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.

Bellemare, M. G., Dabney, W., and Munos, R. A distribu-tional perspective on reinforcement learning. In Interna-tional Conference on Machine Learning, pp. 449–458,2017.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym.arXiv preprint arXiv:1606.01540, 2016.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Belle-mare, M. G. Dopamine: A research framework for deepreinforcement learning. CoRR, abs/1812.06110, 2018.URL http://arxiv.org/abs/1812.06110.

de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. Theimportance of experience replay database compositionin deep reinforcement learning. In Deep ReinforcementLearning Workshop, NIPS, 2015.

Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I.,Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin,O., et al. Noisy networks for exploration. InternationalConference on Learning Representations, 2018.

Gal, Y. Uncertainty in deep learning. University of Cam-bridge, 2016.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deepreinforcement learning. Association for the Advancementof Artificial Intelligence, 2018.

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel,M., Van Hasselt, H., and Silver, D. Distributed prioritizedexperience replay. International Conference on LearningRepresentations, 2018.

Karimpanal, T. G. and Bouffanais, R. Experience replayusing transition sequences. Frontiers in neurorobotics,12:32, 2018.

Lee, S. Y., Choi, S., and Chung, S.-Y. Sample-efficientdeep reinforcement learning via episodic backward up-date. arXiv preprint arXiv:1805.12375, 2018.

Li, Y. Deep reinforcement learning. CoRR, abs/1810.06339,2018. URL http://arxiv.org/abs/1810.06339.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Lin, L.-J. Self-improving reactive agents based on reinforce-ment learning, planning and teaching. Machine learning,8(3-4):293–321, 1992.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Moore, A. W. and Atkeson, C. G. Prioritized sweeping:Reinforcement learning with less data and less time. Ma-chine learning, 13(1):103–130, 1993.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prior-itized experience replay. International Conference onLearning Representations, 2016.

Sutton, R. S. and Barto, A. G. Introduction to reinforcementlearning, volume 135. MIT press Cambridge, 1998.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-ment learning with double q-learning. In AAAI, volume 2,pp. 5. Phoenix, AZ, 2016.

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,Kavukcuoglu, K., and de Freitas, N. Sample effi-cient actor-critic with experience replay. arXiv preprintarXiv:1611.01224, 2016a.

Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M.,and Freitas, N. Dueling network architectures for deepreinforcement learning. In International Conference onMachine Learning, pp. 1995–2003, 2016b.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.

Zhang, S. and Sutton, R. S. A deeper look at experiencereplay. arXiv preprint arXiv:1712.01275, 2017.

Zhong, Y., Wang, B., and Wang, Y. Reward backpropagationprioritized experience replay. Unknown, 2017.

http://arxiv.org/abs/1812.06110




Appendix

1. TheoremWe define the Blind Cliffwalk as the following MarkovDecision Process (MDP). The state space of this MDP iscomposed by n different state: {s1, s2, · · · , sn}. At eachstate, the agent has two actions to choose {a1, a2} and herewe assume a1 is the correct action and a2 is the wrongaction. Correct action will take the agent to next state andwrong action will take the agent back to the initial state s1:

T (si, aj) =

{si+1, for i ∈ {1, · · · , n− 1}, j = 1,s1, otherwise

(12)The reward function is defined where the agent can getpositive reward r only from taking the correct action fromstate sn:

R(si, aj) =

{1, if i = n and j = 1

0, otherwise(13)

The Q-learning algorithm (Watkins & Dayan, 1992) esti-mates the state-action value function (for discounted return)as follows:

Qt+1(s, a) =(1− αt)Qt(s, a)

+ αt(R(s, a) + γ maxb∈U(s′)

Qt(s′, b)) (14)

where s′ is the state reached from state s when performingaction a at time t, and α is the learning rate of theQ-learningalgorithm at time t.

In this paper we consider an asynchronous Q-learning pro-cess which updates a single entry at each step with differentsampling strategies (PER and PSER) from the experiencereplay memory.

Since the Blind Cliffwalk environment is a deterministicworld, we can set the learning rate of the Q-learning algo-rithm to 1, which means after one update we can get theaccurate Q-value. Next we present the convergence speedof Q-learning algorithm with PER and PSER sampling strat-egy, showing PSER sampling strategy can help Q-learningalgorithm converge much faster than PER sampling strategy.

Theorem 1 Consider the Blind Cliffwalk environment withn states, if we set the learning rate of the asynchronousQ-learning algorithm in Equation 14 to 1, then with a pre-filled state transitions in the replay memory by exhaustivelyexecuting all 2n possible action sequences, the expectedsteps for the Q-learning algorithm to converge with PERsampling strategy is

EPER,n[N ] = 1 + (2n+1 − 2)(1− 1

2n−1) (15)

and expected steps for the Q-learning algorithm to convergewith PSER sampling strategy with decaying coefficient ρ is

EPSER,n[N ] ≤

{n(n+1)

2 , if ρ = 0.5n

1−2ρ −2ρ−(2ρ)n+1

(1−2ρ)2 , otherwise(16)

Proof We first define the “Q-interval” for the Q-learningprocess, then we show after n Q-intervals, the Q-learningalgorithm will be guaranteed to converge to the trueQ-value,finally we calculate the expected steps of each Q-intervalfor PER and PSER sampling strategies.

Here we define a “Q-interval” to be an interval in whichevery state-action pair (s, a) is tried at least once. Withoutloss of generality, we initialize the Q-value for each state-action pair to be 0 (e.g., Q̂0(s, a) = 0,∀s, a), and we denotethe Q-value of the Q-learning algorithm after ith Q-intervalas Q̂i(s, a).

Then we show the Q-learning algorithm will converge tothe true Q-value after n Q-intervals by induction. Fromvalue iteration we know the true Q-value function has thefollowing form:

Q∗(si, aj) =

{γn−i, for j = 1

0, otherwise(17)

The base case is after the first Q-interval, we have

Q̂i(s, a) = Q∗(s, a),∀s ∈ {sn}, a ∈ {a1, a2} (18)

This is true since we have

Q̂1(sn, a1) =R(sn, a1) + γmaxa′

Q̂0(s1, a′)

=1

=γn−n

(19)

and

Q̂1(sn, a2) = R(sn, a1) + γmaxa′

Q̂0(s1, a′) = 0 (20)

Assume after ith Q-interval, we have

Q̂i(s, a) = Q∗(s, a),∀s ∈ {sn−i+1, · · · , sn}, a ∈ {a1, a2}(21)

Then after the (i+ 1)th Q-interval, we have

Q̂i+1(sn−i, a1) =R(sn−i, a1) + γmaxa′

Q̂i(sn−i+1, a′)

=0 + γ ×maxa′

Q∗(sn−i+1, a′)

=γ ×max{γi−1, 0}=γi

=Q∗(sn−i)

(22)


and

Q̂i+1(sn−i, a2) = R(sn−i, a2) + γmaxa′

Q̂0(s1, a′) = 0

(23)Also, for s ∈ {sn−i+1, · · · , sn}, their values Q̂i+1(s, a)will not change since

Q̂i+1(s, a) =R(s, a) + γmaxa′

Q̂i(s′, a′)

=R(s, a) + γmaxa′

Q∗(s′, a′)

=Q∗(s, a)

(24)

where we use the Bellman optimality equation.

Thus from Equations 22,23,24 we conclude that after thenth Q-interval,

Q̂i(s, a) = Q∗(s, a),∀s ∈ {s1, · · · , sn}, a ∈ {a1, a2}(25)

Finally we calculate the expected steps of each interval forPER and PSER sampling strategy. For a fair comparison be-tween PER sampling strategy and PSER sampling strategy,the replay memory for sampling is first filled with state tran-sitions by exhaustively executing all 2n possible sequencesof actions until termination (in random order), in this waythe total number of state transitions will be 2n+1−2. Whileinitializing the replay memory, each transition will be as-signed a priority equal to the TD error of the transition. Afterthe initialization, only the state transition (sn, a1, r, s1) haspriority 1 and all the remaining states transitions have prior-ity 0 (here we keep the priorities to be 0 instead of a smallnumber ε for simplicity). When performing the Q-learningiteration, the state transitions are sampled according to theirpriorities.

In fact, from the induction process we can see the expectednumber of steps in the ith Q-interval (denoted as Ni) equalsthe expected number for state-action (sn−i+1, a1) to besampled. We use this insight to calculate E[Ni].

For PER sampling strategy,

E[N1] = 1

E[Ni] =2n+1 − 2

2i−1, for i ∈ {2, · · · , n}

(26)

The first equation is immediately from the fact that onlytransition (sn, a1, r, s1) has non-zero priority (whose pri-ority after updating drops to 0). The second equationfollows from the fact that there are 2i−1 state transitions(sn−i+1, a1, r, sn−i). After the (i−1)thQ-interval, all tran-sitions have equal priority and the probability for transition(sn−i+1, a1, r, sn−i) to get sampled equals

p =2i−1

2n+1 − 2(27)

Thus the expected number of steps for PER sampling strat-egy to converge equals

EPER,n[N ] =

n∑i=1

E[Ni]

= 1 +2n+1 − 2

2+ · · ·+ 2n+1 − 2

2n−1

= 1 + (2n+1 − 2)(1− 1

2n−1)

(28)

We can see that as n → ∞, EPER,n[N ] → 2n+1, whichindicates the number of steps to convergence will growexponentially.

Next we consider PSER sampling strategy. When ini-tializing the replay memory, the PSER sampling strategywill assign the state transition (sn, a1, r, s1) priority 1 ac-cording to its TD error. Then PSER decays the prioritybackwards according to decay coefficient ρ ∈ (0, 1) sothat transition (sn−1, a1, r, sn) has priority ρ, transition(sn−i+1, a1, r, sn−i) has priority ρi−1. Thus in the firstQ-interval, let N1 denote the expected number of steps tosample transition (sn, a1, r, s1), letAk denote the event thattransition (sn−1, a1, r, sn) gets sampled at the kth samplewhere k ∈ (0,∞), then we have

E[N1] =

∞∑k=0

P (Ak)k

= P (A1)× 1 + P (A2)× 2 + · · ·≤ p× 1 + (1− p)p× 2 + · · ·= 1/p

=

n∑i=1

ρn−i

(29)

where p = P (A1) = 1∑ni=1(ρ

n−i) and the inequality isdue to the fact that if we samples other state transitions,their priority will drop and the probability of transition(sn−1, a1, r, sn) gets sampled will increase. Similarly, wehave

E[Ni] ≤∑n−1k=i−1 ρ

k

ρi−1

=

n−i∑k=0

ρk(30)

Thus, we have

EPSER,n[N ] =

n∑i=1

E[Ni]

≤n∑i=1

n−i∑k=0

ρk

=n

1− ρ− ρ− ρn+1

(1− ρ)2

(31)


We can see that as n → ∞, EPSER,n[N ] → n, which indi-cates the number of steps to convergence will grow linearlywith n. Therefore the PSER sampling strategy convergesmuch faster than the PER in Blind Cliffwalk. �

Next we show for PSER sampling strategy, the expectedsteps in one Q-interval are fewer than PER sampling strategy.In fact, calculating the expected steps for each Q-interval isintractable since the priorities keep changing throughout thesampling process. So here we consider the expected stepsin one Q-interval for any given priorities.

2. Blind Cliffwalk ExperimentFor the Blind Cliffwalk experiments, we use a tabular Q-learning setup with four different experience replay scheme,where the Q-values are represented using a tabular look-uptable.

For the tabular Q-learning algorithm, the replay memory ofthe agent is first filled by exhaustively executing all 2n possi-ble sequences of actions until termination (in random order).This guarantees that exactly one sequence will succeed andhit the final reward, and all others will fail with zero reward.The replay memory contains all the relevant experience (thetotal number of transitions is 2n+1−2) at the frequency thatit would be encountered when acting online with a randombehavior policy.

After generating all the transitions in the replay memory, theagent will next select a transition from the replay memoryto learn at each time step. For each transition, the agent firstcomputes its TD-error using:

δt := Rt + γt maxa

Q(St, a)−Q(St−1, At−1) (32)

and updates the parameters using stochastic gradient ascent:

θ ← θ+η ·δt ·∇θ Q|St−1,At−1= θ+η ·δt ·φ(St−1, At−1)

(33)

The four different replaying schemes we will be using hereare uniform, oracle, PER, and PSER. For uniform replayingscheme, the agent will randomly select the transition fromthe replay memory uniformly. For the oracle replayingscheme, the agent will greedily select the transition thatmaximally reduces the global loss (in hindsight, after theparameter update). For the PER replaying scheme, the agentwill first set the priorities of all transitions to either 0 or 1.Then after each update, the agent will assign new priority tothe sampled transition using:

p = |δ|+ ε (34)

and the probability of sampling transition i is

P (i) =pαi∑k p

αk

(35)

where δ is the TD error for the sampled transition whichcan be calculated from equation (1), α = 0.5 and ε =0.0001. For the PSER replaying scheme, we first calculatethe priority as in equation (3) and propagate back the priority5 steps before:

pn−1 = max{ρ1pn, pn−1}pn−2 = max{ρ2pn, pn−2}pn−3 = max{ρ3pn, pn−3}

· · ·

(36)

Then the agent will sample transitions from the replay mem-ory with probability based on Equation 5.

For this experiment, we vary the size of the problem (numberof states n) from 13 to 16. The discount factor is set toγ = 1− 1

n which keeps values on approximately the samescale independently of n. This allows us to use a fixedstep-size of η = 1

4 in all experiments.

We track the mean squared error (MSE) between the groundtruth Q value and Q-learning result every 100 iterations.Better performance in this experiment means that the losscurve more closely matches the oracle. The PER paperdemonstrated that PER improves performance over uniformsampling. Our results show that PSER further improves theresults with much faster and earlier convergence as com-pared to PER. We show results varying the state-space sizefrom 13 to 16 and and that PSER consistently outperformsPER in this problem as shown in Figure 7.

3. Evaluation MethodologyIn our Atari experiments, our primary baseline for compari-son was Prioritized Experience Replay (PER). For each im-plementation we used the standard DQN algorithm withoutany additional modifications to provide a fair comparisonsbetween the different sampling techniques. All of the hyper-parameters for DQN were the same between the PSER andPER implementations. The hyperparameters are shown inTable 4.

3.1. Hyperparameters

In selecting our final set of hyperparameters for PSER, wetested a range of different values over a subset of Atarigames. Table 1 lists the range of values that were triedfor each parameter and Table 2 lists the chosen parameters.To obtain the final set of parameters, two parameters wereheld constant while we tuned one, then we fixed the tunedparameter with best performance and tuned the next. Thisgreatly reduced the search space of parameters and led to aset of parameters that performed well.


0.0 0.2 0.4 0.6 0.8Number of iterations 1e5

0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(a) 13 states with all transitions initialized withmax priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(b) 13 states with all transitions initialized withε priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(c) 14 states with all transitions initialized withmax priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(d) 14 states with all transitions initialized withε priority.

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of iterations 1e5

0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(e) 15 states with all transitions initialized withmax priority.

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of iterations 1e5

0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(f) 15 states with all transitions initialized withε priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(g) 16 states with all transitions initialized withmax priority.


0.0

0.5

1.0

1.5

2.0

2.5

Mea

n sq

uare

d er

ror


(h) 16 states with all transitions initialized withε priority.

Figure 7: Results of the Blind Bliffwalk environment comparing the number of iterations until convergence of the true Qvalue among the PSER, PER, uniform, and oracle agents. We can see that in each case, PSER further improves upon theperformance of PER and leads to faster convergence to the true Q value.


Table 2: PSER hyperparameters tested in experiments.

HYPERPARAMETER RANGE OF VALUES

DECAY WINDOW W 5, 10, 20DECAY COEFFICIENT ρ 0.4, 0.65, 0.8PREVIOUS PRIORITY η 0, 0.3, 0.5, 0.7

Table 3: Finalized PSER hyperparameters.

PARAMETER VALUE

DECAY WINDOW W 5DECAY COEFFICIENT ρ 0.4PREVIOUS PRIORITY η 0.7

Table 4: DQN hyperparameters

PARAMETER VALUE

MINIBATCH SIZE 32MIN HISTORY TO START LEARNING 50K FRAMESRMSPROP LEARNING RATE 0.00025RMPPROP GRADIENT MOMENTUM 0.95EXPLORATION ε 1.0→ 0.01EVALUATION ε 0.001TARGET NETWORK PERIOD 10K FRAMESRMSPROP ε 1.0 × 10−5

PRIORITIZATION TYPE PROPORTIONALPRIORITIZATION EXPONENT α 0.5PRIORITIZATION I.S. β 0.5

3.2. Normalization

The normalized score for the Atari 2600 games is calculatedas in (Schaul et al., 2016):

scorenormalized =scoreagent − scorerandom

|scorehuman − scorerandom|. (37)

We have listed the reported Human and Random scores thatwere used for normalization in Table 5.

For the ablation study presented in Figure 6, we normalizedthe results based on the maximum and minimum valuesachieved during the ablation study, for each game.

4. PsuedocodeAlgorithm 1 lists the pseudocode for the PSER algorithm.Note this pseudo code closely follows the pseudo code from(Schaul et al., 2016) and the difference is how we updatethe priority for transitions in replay memory.

5. Full resultsWe now present the full results on the Atari benchmark.Figure 8 shows the detailed learning curves of all 60 Atari

games using the no-op starts testing regime. These learningcurves are smoothed with a moving average of 10M framesto improve readability. In each Atari game, DQN with PERand DQN with PSER are presented.

Figure 9 shows the percent change of PSER baselinedagainst PER. Here we present the results from all 60 Atarigames, as the scores are not human normalized. The percentchange is calculated as follows:

Percent Change =scorePSER − scorePER

scorePER. (38)

Table 5 presents a breakdown of the best scores achieved byeach algorithm on all 55 Atari games where human scoreswere available. Bolded entries within each row highlightthe result with the highest performance between PSER andPER.


Algorithm 1 Prioritized Sequence Experience Replay (PSER)

Input: minibatch k, step-size ξ, replay period K and size N , exponents α and β, budget T , decay window W , decaycoefficient ρ, previous priority η.Initialize replay memoryH = ∅, ∆ = 0, p1 = 1Observe S0 and choose A0 ∼ πθ(S0)for t = 1 to T do

Observe St, Rt, γtStore transition (St−1, At−1, Rt−1, γt, St) inH with maximal priority pt = maxi<t piif t ≡ 0 mod K then

for j = 1 to k doSample transition j ∼ P (j) = pαj /

∑i pαi

Compute importance-sampling weight wj = (N · P (j))−β/maxi wiCompute TD-errorδj = Rj−1 + γj maxaQtarget(Sj , a)−Q(Sj−1, Aj−1)Update transition priority pj ← max{|δj |+ ε, η · pj}Accumulate weight-change ∆← ∆ + wj · δj · ∇θQ(Sj−1, Aj−1)for l = 1 to W do

Update transition priority pj−l ← max{(|δj |+ ε) · ρl, pj−l} to transition l steps backwardend for

end forUpdate weights θ ← θ + ξ ·∆, reset ∆ = 0From time to time copy weights into target network θtarget ← θ

end ifChoose action At ∼ πθ(St)

end for


�

��

��

��

��%03'

�+3�#+&

�

��

��

��

�-+'/

�

��

��

�.+&#3

�

��

��

��

��44#6-5

�

��

��

��45'3+9

��

��

��

��

��45'30+&4

�

��

��

��

��

�%03'

�5-#/5+4

�

��

��

��

��

�#/,�'+45

��

��

��

��#55-'"0/'

�

��

��

��

�'#.�+&'3

��

��

��

��

�'3;'3,

��

��

��

�08-+/)

<��

�

��

��

�%03'

�09+/)

�

��

��

��

��

�3'#,065

�

��

��

#3/+7#-

��

��

��

��

'/5+1'&'

��

��

��

*011'3 0..#/&

�

��

��

3#;: -+.$'3

�

��

��

��

��

�%03'

�'.0/�55#%,

<��

<��

�

��

�06$-'�6/,

�

��

��

�-'7#503�%5+0/

�

��

��

��

��/&630

<��

<��

�

�+4*+/)�'3$:

<��

<��

��

��

��3''8#:

�

��

��

��

�%03'

�3045$+5'

�

��

��

��

�01*'3

�

��

��

��

��3#7+5#3

�

��

��

��

��'30

<��

�

�%'�0%,':

�

��

��

�#.'4$0/&

<��

<��

�

�%03'

�063/':�4%#1'

�

��

��

��#/)#300

��

��

��

�36--

�

��

��

�6/)�6�#45'3

<��

<��

��

��

��0/5';6.#�'7'/)'

��

��

��

�4�#%.#/

�

��

��

��

�%03'

�#.'�*+4�#.'

�

��

��

��

�*0'/+9

<��

<��

<��

<��

��+5(#--

<��

<��

�

��

��0/)

�

��

��

�00:#/

<��

<��

�

��3+7#5'�:'

�

��

��

��

�%03'

�$'35

�

��

��

��

��

�+7'33#+&

�

��

��

�0#&�6//'3

�

��

��

�0$05#/,

�

��

��

�'#26'45

<��

<��

<��

<��

�,++/)

�

��

��

��

�%03'

�0-#3+4

�

��

��

��

��1#%'�/7#&'34

�

��

��

��

�5#3�6//'3

<��

�

��

�'//+4

��

��

��

��+.'�+-05

�

��

��

��

��

�65#/,*#.

� �� +--+0/4�0(�(3#.'4

�

��

��

��

��

�%03'

�1��08/

� �� +--+0/4�0(�(3#.'4

�

��

��

��

��'/563'

� �� +--+0/4�0(�(3#.'4

�

��

��

��

��

�+&'0�+/$#--

� �� +--+0/4�0(�(3#.'4

�

��

��

��

+;#3&�( 03

� �� +--+0/4�0(�(3#.'4

�

��

��

!#34�'7'/)'

� �� +--+0/4�0(�(3#.'4

�

��

��

"#990/

��

Figure 8: Learning curves for DQN with PSER (orange) and DQN with PER (blue) for all 60 games of the Atari 2600benchmark. Each curve corresponds to a single training run over 200 million unique frames with a moving average smoothedover 10 million frames for clarity.


IceHo

ckey

Tennis

ElevatorActio

nVe

nture

Frostbite

Jamesbond

JourneyEscape

Gravita

rDo

ubleDu

nkMsPacman

Tutankham

Skiing

Riverra

idAssault

Bowling

Solaris

PrivateEye

Pong

Freewa

yMontezumaR

evenge

Krull

Atlantis

Centipede

Boxing

Carnival

UpND

own

NameThisG

ame

Enduro

TimePilot

Breakout

AirRaid

Seaquest

Bankheist

Phoenix

Kangaroo

RoadRu

nner

Beam

Rider

Hero

Qbert

CrazyC

limber

YarsRe

venge

VideoPinball

Battlezone

Alien

Pooyan

Asteroids

ChopperCom

mand

KungFuMaster

Berzerk

Pitfa

llStarGu

nner

FishingD

erby

DemonAttack

Amidar

SpaceInvaders

Gopher

Asterix

Wiza

rdOfWor

Zaxxon

Robotank

0%

200%

400%

600%

800%

Percen

t cha

nge of PSE

R relativ

e to PER

Figure 9: Percent change of PSER to PER in all 60 Atari 2600 benchmark games. 0% on the vertical axis implies equivalentperformance; positive numbers represent the cases where PSER performed better; negative numbers represent the caseswhere PSER performed worse.


Table 5: no-op starts evaluation regime: Here we report the raw scores across all games, averaged over 200 evaluationepisodes, from the agent snapshot that obtained the highest score during training. PER and PSER were evaluated using theDQN algorithm described in (Mnih et al., 2015).

GAME RANDOM HUMAN PER PSER

AIRRAID - - 8,660.8 10,504.2ALIEN 227.8 7,127.7 2,724.1 4,297.9AMIDAR 5.8 1,719.5 364.2 1,351.9ASSAULT 222.4 742.0 7,761.3 6,758.1ASTERIX 210.0 8,503.3 7,806.0 32,766.5ASTEROIDS -719.1 47,388.7 905.6 1,566.8ATLANTIS 12,850.0 29,028.1 810,043.0 848,064.5BANKHEIST 14.2 753.1 894.5 1,091.5BATTLEZONE 2,360.0 37,187.5 26,215.0 39,195.0BEAMRIDER 363.9 16,926.5 24,100.2 30,548.9BERZERK 123.7 2,630.4 618.5 1,228.0BOWLING 23.1 160.7 46.7 41.7BOXING 0.1 12.1 92.8 99.9BREAKOUT 1.7 30.5 355.1 429.1CARNIVAL - - 5,560.0 6,086.5CENTIPEDE 2,090.9 12,017.0 6,192.1 6,542.0CHOPPERCOMMAND 811.0 7,387.8 746.5 1,317.5CRAZYCLIMBER 10,780.5 35,829.4 104,080.5 140,918.0DEMONATTACK 152.1 1,971.0 22,711.9 74,366.0DOUBLEDUNK -18.6 -16.4 20.6 13.7ELEVATORACTION - - 47,825.5 75.0ENDURO 0.0 860.5 753.0 901.4FISHINGDERBY -91.7 -38.7 13.1 36.3FREEWAY 0.0 29.6 0.0 0.0FROSTBITE 65.2 4,334.7 3,501.1 1,162.2GOPHER 257.6 2,412.5 4,446.8 17,524.7GRAVITAR 173.0 3,351.4 1,569.5 918.8HERO 1,027.0 30,826.4 15,678.6 20,447.5ICEHOCKEY -11.2 0.9 7.3 -2.8JAMESBOND 29.0 302.8 3,908.8 1,572.0JOURNEYESCAPE - - 7,423.0 3,898.5KANGAROO 52.0 3,035.0 12,150.5 15,051.0KRULL 1,598.0 2,665.5 8,189.1 8,436.6KUNGFUMASTER 258.5 22,736.3 14,673.5 28,658.0MONTEZUMAREVENGE 0.0 4,753.3 0.0 0.0MSPACMAN 307.3 6,951.6 4,875.3 3,834.3NAMETHISGAME 2,292.3 8,049.0 6,398.2 7,370.2PHOENIX 761.4 7,242.6 12,465.5 15,228.2PITFALL -229.4 6,463.7 -8.8 0.0PONG -20.7 14.6 21.0 21.0POOYAN - - 3,802.2 6,013.2PRIVATEEYE 24.9 69,571.3 253.0 247.0QBERT 163.9 13,455.0 11,463.1 15,396.1RIVERRAID 1,338.5 17,118.0 9,684.4 8,169.9ROADRUNNER 11.5 7,845.0 41,578.5 51,851.0ROBOTANK 2.2 11.9 5.9 52.7SEAQUEST 68.4 42,054.7 8,547.4 10,375.2SKIING -17,098.1 -4,336.9 -8,343.3 -9,807.1SOLARIS 1,236.3 12,326.7 1,331.7 1,253.2SPACEINVADERS 148.0 1,668.7 1,774.0 6,754.8STARGUNNER 664.0 10,250.0 15,672.0 35,448.5TENNIS -23.8 -8.3 23.7 0.0TIMEPILOT 3,568.0 5,229.2 7,545.0 9,033.5TUTANKHAM 11.4 167.6 223.3 180.9UPNDOWN 533.4 11,693.2 10,786.2 12,098.9VENTURE 0.0 1,187.5 1,115.0 152.5VIDEOPINBALL 16,256.9 17,667.9 232,144.6 340,562.7WIZARDOFWOR 563.5 4,756.5 1,674.0 8,644.5YARSREVENGE 3,092.9 54,576.9 19,538.9 28,049.8ZAXXON 32.5 9,173.3 790.0 6,207.5

prioritized sequence experience replay

Documents