proximal policy optimization with mixed distributed training · 2019-07-16 · our approach on the...

Proximal Policy Optimization with MixedDistributed Training

1st Zhenyu ZhangSchool of Engineering and Science

Shanghai UniversityShanghai China

[email protected]

2nd Xiangfeng LuoSchool of Engineering and Science

Shanghai UniversityShanghai [email protected]

3nd Shaorong XieSchool of Engineering and Science

Shanghai UniversityShanghai [email protected]

4rd Jianshu WangSchool of Engineering and Science


ces [email protected]

5th Wei WangSchool of Engineering and Science


[email protected]

6th Yang LiSchool of Engineering and Science


[email protected]

Abstract—Instability and slowness are two main problems indeep reinforcement learning. Even if proximal policy optimizationis the state of the art, it still suffers from these two problems.We introduce an improved algorithm based on proximal policyoptimization (PPO), mixed distributed proximal policy optimiza-tion (MDPPO), and show that it can accelerate and stabilizethe training process. In our algorithm, multiple different policiestrain simultaneously and each of them controls several identicalagents that interact with environments. Actions are sampled byeach policy separately as usual but the trajectories for trainingprocess are collected from all agents, instead of only one policy.We find that if we choose some auxiliary trajectories elaboratelyto train policies, the algorithm will be more stable and quickerto converge especially in the environments with sparse rewards.

Index Terms—reinforcement learning, distributed reinforce-ment learning, Unity

I. INTRODUCTION

Function approximation is now becoming a standard methodto represent any policy or value function instead of tabularform in reinforcement learning [25]. An emerging trend offunction approximation is to combine deep learning withmuch more complicated neural networks such as convolutionalneural network (DQN [14]) in large-scale or continuous state.However, while DQN solves problems with high-dimensionalstate spaces, it can only handle discrete action spaces. Policy-gradient-like [26] or actor-critic style [11] algorithms directlymodel and train the policy. The output of a policy net-work is usually a continuous Gaussian distribution. However,compared with traditional lookup tables or linear functionapproximation, nonlinear function approximations cannot beguaranteed to convergence [2] due to its nonconvex optimiza-tion. Thus the performance of algorithms using nonlinear valuefunction or policy is very sensitive to the initial value ofnetwork weights.

Although PPO which we will mainly discuss in this paper isthe state of the art, it is under the actor-critic framework. Ac-cording to [17], actor-critic is a kind of generative adversarial

network to some extent which has a very tricky disadvantageof instability during the training process. Combining nonlinearfunction approximations makes the training even worse. Thus,the speed to convergence varies greatly or divergence mayhappen inevitably even if two policies have the same hyperpa-rameters and environments, which makes developers confusedabout whether the reason is algorithm itself, hyperparametersor just initial values of neural network weights. So, for now,instability and slowness to convergence are still two signifi-cant problems in deep reinforcement learning with nonlinearfunction approximation [3] and actor-critic style architecture.

In this paper, we propose a mixed distributed trainingapproach which is more stable and quick to convergence,compared with standard PPO. We set multiple policies withdifferent initial parameters to control several agents at thesame time. Decisions that affect the environments can onlybe made by each agent, but all trajectories are gathered andallocated to policies to train after elaborate selection. Wedefine two selection criteria, (i) complete trajectories, whereagent successfully completes a task like reaching a goal or lasttill the end of an episode, and (ii) auxiliary trajectories, wherean agent doesn’t complete a task but the cumulative rewardis much higher than other trajectories. In previous works,[23],[16],[8],[5] only distribute agents, not policies. [13] orRL with evolution strategies like [18] distribute policies,but they only choose and dispatch the best policy with thehighest cumulative rewards. So, this work is the first, to ourknowledge, to distribute both policies and agents and mixtrajectories to stabilize and accelerate RL training.

For the purposes of simple implementation, we evaluatedour approach on the Unity platform with Unity ML-AgentsToolkit [10] which is able to easily tackle with parallel phys-ical scenes. We tested four Unity scenarios, simple roller, 3dball, obstacle roller and simple boat. Our algorithm performsvery well in all scenarios with random initial weights. Thesource code can be accessed at https://github.com/BlueFisher/

arX

iv:1

907.

0647

9v1

[cs

.LG

] 1

5 Ju

l 201

9

RL-PPO-with-Unity.

II. RELATED WORK

Several works have been done to solve the problem ofinstability. In deep Q network (DQN [15]), a separated neuralnetwork is used to generate the target Q value in online Q-learning update, making the training process much more sta-ble. The experience replay technique stores agent’s transitionsat each time-step in a data set and samples a mini-batch fromit during the training process, in order to break the correlationbetween transitions since the input of neural networks ispreferably independent and identically distributed. Due tothe maximization step, conventional DQN is affected by anoverestimation bias, Double Q-learning [27] addresses thisoverestimation by decoupling the selection of the action fromits evaluation. Prioritized experience replay [20] improves dataefficiency, by replaying more often transitions. The duelingnetwork architecture [28] helps to predict advantage functioninstead of state-action value function. Deep deterministic pol-icy gradient (DDPG [12]) applies both experience replay andtarget network into the deterministic policy gradient [24], anduses two different target networks to represent deterministicpolicy and Q value. However, in the aspect of stochasticpolicy gradient, almost all algorithms that based on actor-critic style [11] have the problem of divergence, becausethe hardness of convergence of both actor and critic makesthe whole algorithm even more unstable. Trust region policyoptimization (TRPO [21]) and proximal policy optimization(PPO [23],[7]) try to bound the update of parameters betweenthe policies before and after update, in order to make the extentof optimization not too large to out of control.

In terms of accelerating training process, the general rein-forcement learning architecture (Gorila [16]) , asynchronousadvantage actor critic (A3C [13]) and distributed PPO (DPPO[7]) take the full use of multi-core processors and distributedsystems. Multiple agents act in their own copy of the envi-ronments simultaneously. The gradients are computed sepa-rately and sent to a central parameter brain, which updatesa central copy of the model asynchronously. The updatedparameters will be then added and sent to every agent at aspecific frequency. Distributed prioritized experience replay[8] overcomes the disadvantage that reinforcement learningcan’t efficiently utilize the computing resource of GPU, but itonly can be applied to the algorithm that uses replay bufferlike DQN or DDPG. Besides, a distributed system is noteasy and economical to implement. Reinforcement learningwith unsupervised auxiliary tasks (UNREAL [9]) acceleratestraining by adding some additional tasks to learn. However,these tasks mainly focus on the tasks with graphical state input,which predict the change of pixels. And it is difficult to betransferred to some simple problems with vector state spaceenvironments.

III. BACKGROUND

In reinforcement learning, algorithms based on policy gra-dient provide an outstanding paradigm for continuous action

space problems. The purpose of all such algorithms is to max-imize the cumulative expected rewards L = E [

∑t r(st, at)].

The most commonly used gradient of objective function Lwith baseline can be written into the following form

∇θL =

∫Sµ(s)

∫A∇θπθ(a|s)A(s, a)

= E [∇θ log πθ(a|s)A(s, a)](1)

where S denotes the set of all states and A denotes the set ofall actions respectively. µ(s) is an on-policy distribution verstates. π a stochastic policy that maps state s ∈ S to actiona ∈ A, and A an advantage function.

TRPO replaces objective function L with a surrogate objec-tive, which has a constraint on the extent of update from anold policy to a new one. Specifically,

maximizeθ

E[πθ(a|s)πθold(a|s)

A(s, a)

]subject to E

[KL[πθold(·|s), πθ(·|s)]

]≤ δ

(2)

where πθold is an old policy that generates actions but theparameters are fixed during the training process. πθ is thepolicy that we try to optimize but not too much, so an upperlimit δ is added to constrain the KL divergence between πθand πθold .

PPO can be treated as an approximated but much simplerversion of TRPO. It roughly clip the ratio between πθold andπθ and change the surrogate objective function into the form

yLCLIPθ = E[min

(r(θ)A, clip

(r(θ), 1− ε, 1 + ε

)A)]

(3)

where r(θ) = πθ(a|s)πθold (a|s) and ε is the clipping bound. Noted

that PPO here is an actor-critic style algorithm where actor ispolicy πθ and critic is advantage function A.

In intuition, advantage function represents how good a state-action pair is compared with the average value of current state,intuitively, i.e., A(s, a) = Q(s, a) − V (s). The most usedtechnique for computing advantage function is generalizedadvantage estimation (GAE [22]). One popular style that canbe easily applied to PPO or any policy-gradient-like algorithmis

At(st, at) = rt + γrt+1 + · · ·+ γT−t+1rT−1

+ γT−tV (sT )− V (st)(4)

where T denotes the maximum length of a trajectory but notthe terminal time step of a complete task, and γ is a discountedfactor. If the episode terminates, we only need to set V (sT ) tozero, without bootstrapping, which becomes At = Gt−V (st)where Gt is the discounted return following time t defined in[25]. An alternative option is to use a more generalized versionof GAE:

At(st, at) = δt + (γλ)δt+1 + · · ·+ (γλ)T−t+1δT−1

(5)

where δt = rt + γV (st+1) − V (st) which is known as td-error. Noted that (5) will be reduced to (4) when λ = 1but has high variance, or be reduced to δt when λ = 0 but

introduce bias. Furthermore, in DPPO, both data collectionand gradient computation are distributed. All workers shareone centralized network, synchronized calculated gradientsand updated parameters.

We only need to approximate V̂θv (s) instead of Q̂(s, a)to compute the approximation of advantage function whereδ̂t = rt+ γV̂θv (st+1)− V̂θv (st). So, instead of optimizing thesurrogate objective, we have to minimize loss function

LVθv =(rt + γrt+1 + · · ·+ γT−t+1rT−1

+ γT−tV̂θvold(sT )− V̂θv (st))2 (6)

where V̂θvold is the fixed value function which is used tocompute δ̂t.

IV. MIXED DISTRIBUTED PROXIMAL POLICYOPTIMIZATION

We distribute both policies and agents in contrast toDPPO where only agents are distributed. We set N policiesp1, · · · , pN that have exactly the same architectures of policynetwork and value function network as well as hyperparame-ters. Each policy, like a command center, controls a group ofM identical agents ei1, · · · , eiM where i denotes the serial num-ber of policy that the group related to. So, N×M environmentsare running paralleled, interacting with N×M agents dividedinto N groups. Each agent eij interacts its own shared policypi, sending states and receiving actions. When all agents havefinished a complete episode s0, a0, r0, s1 · · · , sT , aT , rt, sT+1,each policy is not only going to be fed with trajectories fromthe group of agents that it controls but also a bunch of onesfrom other groups elaborately that may accelerate and stabilizethe training process. Figure 1 illustrates how decision andtraining data flows.

We define two sorts of trajectories being used to mixedtraining.

Complete trajectories, where an agent successfully com-pletes a task. We simply divide tasks into two categories, (i)task with a specific goal, like letting an unmanned vehiclereach a designated target point, or making a robotic arm graban object, and (ii) without a goal but trying to make thetrajectory as long as possible till the maximum episode timeis reached, like keeping a balance ball on a wobbling plate.

Auxiliary trajectories, where an agent doesn’t complete atask but the cumulative reward is much higher than other tra-jectories. For in many real-world scenarios, extrinsic rewardsare extremely sparse. An unmanned surface vehicle may getno reward or cannot reach the target point for a long time inthe beginning. However, we can utilize trajectories that havethe highest cumulative rewards to encourage all policies tolearn from currently better performance.

We mix complete trajectories and auxiliary trajectories,and allocate them to every policy during the training process.So each policy computes gradient with its usual and mixeddata altogether to update policy networks and value functionsindependently.

Noted that in practice, mixed transitions from other policiesmay cause the denominator of ratio r(θ) in (3) be zero if

Algorithm 1: Mixed Distributed Proximal Policy Opti-mization (MDPPO)

1 for iteration = 1, 2, · · · do2 for agent pi = x1, · · · , pN do3 for environment eij = ei1, · · · , eiM do4 Run policy πθiold in environment eij for T

timesteps and compute advantage function.5 Store trajectories and cumulative rewards in

dataset Di.6 end7 end8 Compute ”auxiliary” and ”complete” trajectories

from D1, · · · , DN and store them in D−1 , · · · , D−N .

9 for agent xi = x1, · · · , xN do10 optimize LCLIPi (or LCLIP

+

i ) and LVirespectively or combined LCLIP+V

i (orLCLIP

++Vi ) with mixed transitions from Di and

D−1 , · · · , D−i−1, D

−i+1, · · · , D

−N .

11 end12 end

πθold(a|s) is too small that exceeds the floating range ofcomputer. So that gradients cannot be computed correctly andresult in ”nan” problem when using deep learning frameworkslike TensorFlow [1]. We introduce four solutions, (i) simplyremoving all transitions that make πθold(a|s) < ε beforetraining, (ii) adding a small number to denominator suchthat r(θ) = πθ(a|s)/(πθold(a|s) + ε) to make sure it has alower bound, (iii) rewriting the ratio into exponential formexp(πθ(a|s) − πθold(a|s)) without modifying the clippingbound, and (iv), which we used in our experiments, rewritingthe ratio into subtraction form instead of fractional one andlimit the update of new policy to the range of [−ε, ε]:

LCLIP+

θ = E[min

(r(θ)A, clip

(r(θ),−ε, ε

)A)],

where r(θ) = πθ(a|s)− πθold(a|s)(7)

We no longer need to worry about ”nan” problem since thereis no denominator in (7).

Furthermore, there are two common forms of loss func-tions and network architectures. (i) training separated LV

and LCLIP as the standard Actor-Critic style algorithmswith completely different weights between policy and valuefunction networks, and (ii) policy and value function networkscan share a part of the same weights [23]. In this case,LV and LCLIP should be combined into one loss function,LCLIP+V = LCLIP −LV , to make sure the update of sharedweights is stable.

Noted that the entropy term LS in [23] only acts as aregularizer, instead of maximizing the entropy [6]. We discoverthat in our approach, the entropy term is optimal. Because apart of data being used to update policy is actually generatedby other different policies, which is the reason that MDPPOcan encourage exploration. Although all policies’ architectures

Agents 1:M Agents 1:M Agents 1:M

Policy 1 Policy 2 Policy N

Environment

...

...

Fig. 1: MDPPO: There are N policies in the figure. Each policy controls M agents (red lines), gives them the decision thathow to interact with environments (blue and green lines). The data for computing gradients flows through gray lines whereeach policy gets updated not only by the data generated from agents that it controls, but also from other groups of agents.

and hyperparameters are identical, the initial weights of neuralnetworks are totally different, which is the source of instabilityin standard PPO that gives MDPPO a more powerful abilityof exploration.

Algorithm 1 illustrates a mixed distributed proximal policyoptimization (MDPPO).

In contrast to DPPO where the computation of gradientsis distributed to every agent and gradients are summarized tothe center policy update then, in our approach, to take the fulladvantage of GPU resources, agents are only used to generatedata, and transitions data are transferred to the policy that itcontrols. So, gradients are computed in each policy centralized.In order to break the correlation between transitions, we tendto shuffle all transitions before training.

A. MDPPO with Separated Critic

Actually, although the label of red box in Figure 1 isshown as ”policy”, the data through gray lines should be usedto update both policy and value function, which we usuallycall them actor and critic in actor-critic style algorithms. Wefind that having N different value function networks is notquite necessary. We expect more actors and mix them tosome extent to encourage exploration and increase diversity,which may speed up the training process and make it morestable. However, critic only needs state-reward transitionsas much as possible, trying to make the approximation ofstates’ value more correct. Besides, the loss function LVt =(G(st) − V̂ (st))

2 is irrelevant to the policy or action. So,we can separate the value function network from Figure 1and feed all transitions to one centralized critic. Figure 2shows the separated value function which is extracted fromthe policies in Figure 1 .

Algorithm 2 illustrates MDPPO with separated critic wherethe separated value function is denoted as V SEP .

However, under the circumstances of sparse positive re-wards, the value function may be fed with a large number

Algorithm 2: Mixed Distributed Proximal Policy Opti-mization with Separated Critic (MDPPOSC)

1 for iteration = 1, 2, · · · do2 for agent pi = x1, · · · , pN do3 for environment eij = ei1, · · · , eiM do4 Run policy πθiold in environment eij for T

timesteps and compute advantage function.5 Store trajectories and cumulative rewards in

dataset Di.6 end7 end8 Compute ”auxiliary” and ”complete” trajectories

from D1, · · · , DN and store them in D−1 , · · · , D−N .

9 minimize LVSEP

with transitions from D1, · · · , DN

10 for agent xi = x1, · · · , xN do11 optimize LCLIPi (or LCLIP

+

i ) with mixedtransitions from Di andD−1 , · · · , D

−i−1, D

−i+1, · · · , D

−N .

12 end13 end

of useless transitions generated from a bad policy at thebeginning, which will cause the critic to converge to a localoptimum in a very short time and hard to jump out the gap,since the algorithm would have learned a terrible policy withthe guide of a terrible advantage function. In practice, wepropose two tricks, (i) reducing transitions fed to train. Weonly use a part of data, usually 1/N , to train value function.(ii) We take some ideas from prioritized experience replay[19], but only simply feed transitions where td-error is greaterthan a threshold. We also set the threshold to exponential decayfor we’d like value function to explore more possibilities atthe beginning, but converge to the real optimal point at theend.

Value Function

Agents 1:M Agents 1:M Agents 1:M

Policy 1 Policy 2 Policy N

Environment

...

...

Fig. 2: MDPPO with Separated Critic (MDPPOSC): The architecture is the same as Figure 1, except for the separated valuefunction. The data through gray lines are now not only fed to actual N policies, but to centralized value function (gray box),which needs all transitions generated by N ×M agents without filtering.

V. EXPERIMENTS

We tested our approach in four environments, Simple Roller,3D Ball, Obstacle Roller and USV on the Unity platform withUnity ML-Agents Toolkit [10]. Distributed experiment envi-ronments are difficult and expensive to build in the physicalworld. Even in one computer, it is still not easy to implement aparallel system which has to manage multiple agents, threadsand physical simulations. However, as a game engine, Unitycan provide us with fast and distributed simulation. What weonly need to do is copying an environment several times, Unitywill run all environments in parallel without any extra coding.Besides, Unity has a powerful physical engine, which makesit easy to implement an environment that conforms physicallaw without the details of physical algorithms like gravity,collision detection, etc. Simple Roller and 3D Ball are twostandard example environments in Unity ML-Agents Toolkit.Obstacle Roller is an obstacle avoidance environment based onSimple Roller which will be introduced in detail in SectionV-A . USV is a custom target tracking environment on thewater, and will be introduced in Section V-A . The details ofexperimental hyperparameters are listed in Appendix A .

A. Environments Setup

Simple Roller ( Figure 3a ) is a tutorial project of UnityML-Agents Toolkit. The agent roller’s goal is to hit thetarget as quickly as possible while avoiding falling down thefloor plane. The state space size is 8, representing relativeposition between agent and target, agent’s distance to edgesof floor plane and agent’s velocity. The action space size is 2,controlling force in the x and z directions respectively whichwill be applied to the roller. We set the initial state of eachepisode to be the terminal state of the previous episode if

the agent in the previous episode hit the target. Otherwise, wereset the initial state. Hence, a little oscillation around optimalrewards was normal.

3D Ball ( Figure 3b ) is an example project of Unity ML-Agents Toolkit, where we try to control a platform agent tocatch a falling ball and keep it balanced on the platform. Thestate space size is 8, representing the agent’s rotation, therelative position between ball and agent, and agent’s velocity.The action space size is 2, controlling the agent’s rotationalong x and z axes.

There is a hard version of 3D Ball, 3D Ball Hard, whichhas the same environment setting and goal except for the stateinformation. The state space size is only 6, where the velocityof the ball is removed. So, the agent platform can only beaware of a part of the environment and the training processbecomes a partially observed MDP problem. The agent has totake the current and a bunch of past observations as the inputstate of policy and value function neural network. We stacked9 observations which were recommended by ML-Agents. So,the state space was actually 54.

Obstacle Roller ( Figure 3c ) is a custom obstacle avoidanceenvironment based on Simple Roller. We added a white ballbetween the target and the agent roller, and gave it a staticpolicy, trying to prevent the agent from hitting the target. If theagent roller hit the yellow target, we rewarded it 1. If it fell orhit the obstacle, we rewarded it −1. Otherwise, we gave it timepenalty −0.01 as usual in Simple Roller. However, besides thetime penalty, to tackle the problem of sparse extrinsic rewardsignal, we added 0.01α reward every step where α ∈ [−1, 1]represented whether the agent’s velocity direction was to thetarget. The initial state of an episode was the last state of theprevious episode unless the agent fell the platform or hit the

TABLE I: Physical coefficients of USV

max angle 35 degreerotation speed 50 degree/smax speed 35 m/smax RPM 800min RPM 6000max thrust 5000reverse thrust coefficient 0.5

obstacle. We tried to make the roller stay on the platform aslong as possible while avoiding hitting the obstacle.

USV ( Figure 3d ) is a target tracking environment justlike Simple Roller. The agent is an unmanned surface vesselsailing on the water, trying to hit the white cuboid target whileavoiding sailing out of a circular area. The target will appearrandomly in the area after the USV reaches the target. We de-veloped a realistic double-rudder twin-engine USV and oceansurface effect with the full advantage of the Unity physicalsystem. Compare to the simple 2D environment in [4], ourenvironment exhibits more realistic physics engine (gravity,water friction, drift), graphics (water reflection, illuminations,etc.) and collision system. The USV model has the realisticthrust, rudder design compare with the USV in real world.Table I shows the main physical coefficients of USV. Forthe simplicity of training, the state space is 12, representingagent’s position, velocity, heading angle and target’s position.The actions control USV’s thrust and steering. In order totackle the sparse extrinsic reward,the time penalty of rewardfunction is −0.01 + 0.005rv + 0.005rd where rv and rdrepresent whether USV’s velocity and heading directions aretoward the target. So the goal of an USV is to reach thetarget as soon as possible, besides the heading angle and speeddirection toward the target as far as possible.

B. Our Experiments

Simple Roller. We first tested Simple Roller under 5 poli-cies, i.e., N = 5. Each policy was modeled as a multivariateGaussian distribution with a mean vector and a diagonalcovariance matrix, parameterized by θ = {θµ ∈ R2, θΣ ∈ R2}.We set 20 distributed agents and environments in all, i.e.,each policy controlled 4 agents and M = 4. The ”complete”trajectories were simply regarded as the transitions that rollershit targets. As for ”auxiliary” trajectories, we chose the top40% transitions that had the greatest cumulative rewards.Figure 4 shows our four experiments, MDPPO, MDPPOwith subtraction ratio, standard PPO with four distributedagents and twenty distributed agents. We divided them intotwo groups, one with shared weights and the other not.

Both MDPPO and MDPPO with subtraction ratio outper-formed standard PPO. The time to convergence was shorter,and most importantly, MDPPO was much more stable. All fivepolicies were converged almost simultaneously and had similarperformance. In contrast, only a few policies that optimized bystandard PPO had converged, which showed the instability ofPPO. Besides, the speed to convergence of some policies wasquite different. Some converged as quickly as MDPPO, but

(a) Simple Roller (b) 3D Ball

(c) Obstacle Roller (d) USV

Fig. 3: Four environments where we tested our algorithms.Noted that the environment of 3D Ball Hard was the same as3D Ball, except that it removed the ball’s velocity as a part ofthe state.

TABLE II: Count and time to convergence of different algo-rithms

Algorithm Parameters Policies Converged Episode*

MDPPO

separated

5 5 200 - 250MDPPO Sub 5 5 125 - 750PPO 4agents 5 0 N/A

PPO 20agents 10 6 350 - 2k+**

MDPPO

shared

5 5 150 - 250MDPPO Sub 5 5 125 - 400PPO 4agents 5 2 450 - 2k+**

PPO 20agents 10 9 150 - 2k+**

MDPPOSC 5 5 400 - 400MDPPOSC Sub 5 5 200 - 400* There is no such criterion of time to convergence, so the episode here

is quite subjective and has error between the range [−20, 20].** We ran these algorithms 2000 episodes. They showed a trend of

convergence around the 2000th episode but didn’t actually converge tothe optimal reward. We use ”2k+” for we believe that these algorithmsmay converge in the future episodes.

most of them delayed much, even not converged. We couldn’tfind a pattern of time to convergence with different initialparameters for now. Table II compares the four algorithmsin terms of time and convergence.

We also tested the performance of our approach withdifferent combinations of N and M , but the total number ofagents was set to 20, i.e., N ×M = 20. Figure 5a shows thecomparison between five combinations. We found that with thedecrease of N , the algorithm will be more unstable, especially

0 250 500 750 1000 1250 1500 1750 2000episode

2.5

2.0

1.5

1.0

0.5

0.0

0.5re

ward

mdppomdppo_subppo, M = 4ppo, M = 20

(a) MDPPO with separated weights

0 250 500 750 1000 1250 1500 1750 2000episode

2.5

2.0

1.5

1.0

0.5

0.0

0.5

rewa

rd

mdppomdppo_subppo, M = 4ppo, M = 20

(b) MDPPO with shared weights

Fig. 4: MDPPO (red) and MDPPO with subtraction ratio(green) were tested with 5 policies. Each policy had 4 agents.Yellow and blue lines are standard PPO algorithms with 4agents and 20 agents respectively. PPO with 4 agents weretested 5 times, meanwhile PPO with 20 agents were tested 10times. The lighter color in the background shows the best andworst performance of each algorithm test. The more narrowthe background, the more stable the algorithm.

when N = 1, i.e., standard PPO. Besides, even the algorithmthat had the best performance was slower to convergence thanany other algorithms that N > 1. However, it was also notalways good to increase N . In this experiment, a larger Nmight indeed accelerate the training process, but it would alsomake the algorithm much more unstable.

Finally, we tested MDPPO with separated critic. The SimpleRoller is actually not a sparse positive reward environment,so we didn’t set any td-error threshold. Figure 5b illustratesthe results of algorithms with two ratio forms. Both showedsimilar great performance.

3D Ball. Similar to Simple Roller experiments, we testedMDPPO and PPO with separated and shared weights andMDPPOSC. However, 3D Ball had a different structure ofreward function, so the criterion of ”complete” trajectoriesshould be modified. Unlike Simple Roller, where we defined

0 250 500 750 1000 1250 1500 1750 2000episode

3.0

2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

rewa

rd

N = 1, M = 20N = 2, M = 10N = 4, M = 5N = 5, M = 4N = 10, M = 2

(a) Comparison between combinations of N and M

0 250 500 750 1000 1250 1500 1750 2000episode

2.5

2.0

1.5

1.0

0.5

0.0

0.5

rewa

rdmdpposcmdpposc_sub

(b) MDPPOSC

Fig. 5: 5a compares five combinations of policies number Nand agents number each policy M , which shows that with thedecrease of policies, the algorithm is getting more unstableand slow to convergence. However, too many policies mayalso cause the instability. A moderate number of policies is abetter choice from the experiments. 5b shows MDPPO withseparated critic. Two algorithms’ performance is similar andoutperform standard PPO both.

”complete” trajectories as the agent hitting the target, therewas no such specific goal or target in 3D Ball. What the agenttried to do was keeping the ball on the platform as long as itcould until the maximum step reached. Thus, we modified thedefinition of ”complete” trajectories into the ones that lastedthe longest steps which were 500 we set. Figure 6 shows theresult of 3D Ball. Due to the simpleness and full awareness ofthe environment, the performance of our algorithms was quitesimilar to PPO but still had a little improvement of stability.

Figure 7 shows that both MDPPO and MDPPOSC out-perform PPO. Although the speed to convergence was not ouralgorithms’ advantage in this environment, they showed strongstability compared to PPO. All five policies converged quicklyand almost simultaneously.

Obstacle Roller. Due to the sparse extrinsic reward in thisscenario, we set the td-error threshold to 0.2 in MDPPOSC to

0 250 500 750 1000 1250 1500 1750 2000episode

10

20

30

40

50

rewa

rd

mdppomdppo_subppo, M = 4

(a) MDPPO (separated)

0 250 500 750 1000 1250 1500 1750 2000episode

10

20

30

40

50

rewa

rd


(b) MDPPO (shared)

0 250 500 750 1000 1250 1500 1750 2000episode

10

20

30

40

50

rewa

rd

mdpposcmdpposc_sub

(c) MDPPOSC

Fig. 6: The results of 3D Ball.

0 500 1000 1500 2000 2500 3000 3500 4000episode

10

20

30

40

50

rewa

rd



0 500 1000 1500 2000 2500 3000 3500 4000episode

10

20

30

40

50

rewa

rd


(b) MDPPO (shared)

0 500 1000 1500 2000 2500 3000 3500 4000episode

10

20

30

40

50

rewa

rd

mdpposcmdpposc_sub

(c) MDPPOSC

Fig. 7: The results of 3D Ball Hard.

reduce a large number of negative transitions that may leadthe value function into the convergence of local optimum toosoon, which may cause the agent to learn a terrible policy withthe guide of false advantage.

Our experiments results are shown in Figure 8 . BothMDPPO and MDPPOSC had better performance and weremore stable. All policies performed similarly. In contrast, thestandard PPO, even training with 20 agents paralleled, couldn’tperform well and had much bigger uncertainty. Most initialparameters would cause very terrible policies.

USV. Figure 9 shows the result of USV environment.MDPPOSC didn’t perform well in the USV scenario. Only 2of the 5 fractional form policies and 4 of 5 subtraction formpolicies converged. However, MDPPO showed robust perfor-mance, especially MDPPO with separated weights. Althoughit was very difficult for a person to control an USV sailingon dynamic water surface to hit small target, the USV learnedwhen to increase speed or slow down and when to rudder tothe right or left according to the new position of the target. Sothat the USV could head to it as soon as possible. We held amatch between human control and our trained policy control,to see who can complete the task faster. The result was thatthe trained USV won. It not only reached the target faster, butit also got out of the border less. However

VI. CONCLUSION

Slowness and instability are two major problems in rein-forcement learning, which makes it difficult to directly applyRL algorithms to real-world environments. To tackle with

these two problems, we presented Mixed Distributed ProximalPolicy Optimization (MDPPO) and Mixed Distributed Prox-imal Policy Optimization with Separated Critic (MDPPOSC)in this paper and made three contributions. (i), we proposeda method that distributed not only agents but also policies.Multiple policies trained simultaneously and each of themcontrolled multiple identical agents. (ii), the transitions weremixed from all policies and fed to each policy for trainingafter elaborate selection. We defined complete trajectories andauxiliary trajectories to accelerate the training process, (iii),we formulated a practical MDPPO algorithm based on theUnity platform with Unity ML-Agents Toolkit to simplify theimplementation of distributed reinforcement learning system.Our experiments indicated that MDPPO was much more robustunder any random initial neural network weights and couldaccelerate the training process compared with standard PPO.However, because of the limitation of training equipment, wecouldn’t test our algorithms in a real distributed system. Also,the implementation of such system is quite complex. The datatransmission and mixture may slow down the training speed.We leave this engineering problem for future work.

REFERENCES

[1] Martn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,Michael Isard, and others. Tensorflow: A system for large-scale machinelearning. In 12th ${$USENIX$}$ Symposium on Operating SystemsDesign and Implementation (${$OSDI$}$ 16), pages 265–283.

[2] Leemon Baird. Residual algorithms: Reinforcement learning withfunction approximation. In Machine Learning Proceedings 1995, pages30–37. Elsevier.

0 1000 2000 3000 4000 5000 6000episode

3

2

1

0

1

rewa

rd



0 1000 2000 3000 4000 5000 6000episode

3

2

1

0

1

rewa

rd


(b) MDPPO (shared)

0 1000 2000 3000 4000 5000 6000episode

5

4

3

2

1

0

1

rewa

rd

mdpposcmdpposc_sub

(c) MDPPOSC

Fig. 8: The results of Obstacle Roller.

0 1000 2000 3000 4000 5000episode

2.5

2.0

1.5

1.0

0.5

0.0

0.5

rewa

rd



0 1000 2000 3000 4000 5000episode

2.0

1.5

1.0

0.5

0.0

0.5

rewa

rd


(b) MDPPO (shared)

0 1000 2000 3000 4000 5000episode

2.0

1.5

1.0

0.5

0.0

0.5

rewa

rd

mdpposcmdpposc_sub

(c) MDPPOSC

Fig. 9: The results of USV

[3] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton,Hamid R. Maei, and Csaba Szepesvri. Convergent temporal-differencelearning with arbitrary smooth function approximation. In Y. Bengio,D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 1204–1212. Curran Associates, Inc.

[4] Sam Michael Devlin and Daniel Kudenko. Dynamic potential-basedreward shaping. In Proceedings of the 11th International Conferenceon Autonomous Agents and Multiagent Systems, pages 433–440. IFAA-MAS, 2012.

[5] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan,Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley,Iain Dunning, et al. Impala: Scalable distributed deep-rl with importanceweighted actor-learner architectures. arXiv preprint arXiv:1802.01561,2018.

[6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Softactor-critic: Off-policy maximum entropy deep reinforcement learningwith a stochastic actor. 2018.

[7] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, JoshMerel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. AliEslami, Martin Riedmiller, and David Silver. Emergence of locomotionbehaviours in rich environments.

[8] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, MatteoHessel, Hado van Hasselt, and David Silver. Distributed prioritizedexperience replay.

[9] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, TomSchaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Rein-forcement learning with unsupervised auxiliary tasks.

[10] Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, HunterHenry, Marwan Mattar, and Danny Lange. Unity: A general platformfor intelligent agents.

[11] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. InAdvances in neural information processing systems, pages 1008–1014.

[12] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuouscontrol with deep reinforcement learning.

[13] Volodymyr Mnih, Adri Puigdomnech Badia, Mehdi Mirza, Alex Graves,Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.Asynchronous methods for deep reinforcement learning.

[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves,Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playingatari with deep reinforcement learning. page 9.

[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller,Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level controlthrough deep reinforcement learning. 518(7540):529–533.

[16] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, RoryFearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Su-leyman, Charles Beattie, Stig Petersen, and others. Massively parallelmethods for deep reinforcement learning.

[17] David Pfau and Oriol Vinyals. Connecting generative adversarialnetworks and actor-critic methods.

[18] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever.Evolution strategies as a scalable alternative to reinforcement learning.arXiv preprint arXiv:1703.03864, 2017.

[19] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Priori-tized experience replay.

[20] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Priori-tized experience replay. arXiv preprint arXiv:1511.05952, 2015.

[21] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, andPieter Abbeel. Trust region policy optimization.

[22] John Schulman, Philipp Moritz, Sergey Levine, Michael I Jordan, andPieter Abbeel. HIGH-DIMENSIONAL CONTINUOUS CONTROLUSING GENERALIZED ADVANTAGE ESTIMATION. page 14.

[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and OlegKlimov. Proximal policy optimization algorithms.

[24] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra,and Martin Riedmiller. Deterministic policy gradient algorithms. page 9.

[25] Richard S Sutton and Andrew G Barto. Reinforcement learning: Anintroduction. MIT press.

[26] Richard S Sutton, David A McAllester, Satinder P Singh, and YishayMansour. Policy gradient methods for reinforcement learning withfunction approximation. In Advances in neural information processingsystems, pages 1057–1063, 2000.

[27] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement

learning with double q-learning. In Thirtieth AAAI Conference onArtificial Intelligence, 2016.

[28] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, MarcLanctot, and Nando De Freitas. Dueling network architectures for deepreinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

APPENDIX

TABLE III: General hyperparameters for MDPPO and MDP-POSC

batch size 2048epoch size 10entropy coefficient 0.001epsilon 0.2epsilon (subtraction form) 0.02lambda 0.99auxiliary transitions ratio 0.4

proximal policy optimization with mixed distributed training · 2019-07-16 · our approach on the...

Documents