deep rl in continuous action spaces - eth z...trust region policy optimization, schulman et al. 2015...
TRANSCRIPT
![Page 1: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/1.jpg)
Deep RL in continuous action spaces
On-Policy set up
-- Samriddhi Jain
![Page 2: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/2.jpg)
Discrete action environments
● Atari, Board games
![Page 3: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/3.jpg)
Continuous action environments
● MuJoCo (Multi-Joint dynamics with Contact) environments
![Page 4: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/4.jpg)
Continuous action environments
● Simulated goal-based tasks for the Fetch and ShadowHand robots
https://openai.com/blog/ingredients-for-robotics-research/
![Page 5: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/5.jpg)
Handling continuous actions
Policy Network
State
Distribution Parameters
Control
Sampling
![Page 6: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/6.jpg)
Policy Gradients: review
Pitfalls:● Sample inefficiency
![Page 7: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/7.jpg)
Policy Gradients: step size and convergence
![Page 8: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/8.jpg)
Policy Gradients: pitfalls
● Relation between policy space and model parameter space?
![Page 9: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/9.jpg)
Research works covered
● Approximately Optimal Approximate Reinforcement Learning, Kakade and
Langford 2002
● Trust Region Policy Optimization, Schulman et al. 2015
● Proximal Policy Optimization Algorithms, Schulman et al. 2017
![Page 10: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/10.jpg)
Surrogate Loss
● How to improve sample efficiency?○ Use trajectories from other policies with importance sampling
● Gradients are same
![Page 11: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/11.jpg)
Trust Region Methods
Search in a trusted region
https://reference.wolfram.com/language/tutorial/UnconstrainedOptimizationTrustRegionMethods.htmlhttps://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods
![Page 12: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/12.jpg)
Line search vs Trust region search
![Page 13: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/13.jpg)
New loss
● Constrained:
● Penalty:
In practise, 𝛽 is harder to tune than ẟ.
![Page 14: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/14.jpg)
Minorize Maximization theorem
![Page 15: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/15.jpg)
By Kakade and Langford (2002)
Local approximation
![Page 16: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/16.jpg)
![Page 17: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/17.jpg)
Sample based estimation
![Page 18: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/18.jpg)
Natural Policy Gradients
● Approximate using taylor expansion:
Loss reduces to the first order term, constraint to second order term
![Page 19: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/19.jpg)
Natural Policy Gradients
Computationally (very?) expensive
O(N3)
![Page 20: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/20.jpg)
Truncated Natural Policy Gradient
● Solutions:
○ Use kronecker vector product: ACKTR
○ Use Conjugate gradients to compute H-1g without actually inverting H:
Truncated Natural Policy Gradient
![Page 21: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/21.jpg)
Line search for TRPO
● With the quadratic approximations, the constraint may be violated
● Solution: Enforce KL constraint, backtracking line search with exponential decay
![Page 22: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/22.jpg)
Trust Region Policy Optimization
![Page 23: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/23.jpg)
Results
![Page 24: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/24.jpg)
Issues with TRPO
● Doesn’t work well with CNNs and RNNs
● Scalability
● Complexity
![Page 25: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/25.jpg)
Proximal Policy Optimization (PPO)
● Motivation is same as that of TRPO
● Uses first order derivative solutions
● Designed to be simpler to implement
● Two versions:
○ PPO Penalty
○ PPO Clip
https://openai.com/blog/openai-baselines-ppo/
![Page 26: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/26.jpg)
PPO adaptive KL penalty
● Considers the unconstrained objective
● Updates 𝛽k between iterations
○ If d < dtarg/1.5, 𝛽k ← 𝛽k/ 2
○ If d > dtarg × 1.5, 𝛽k←𝛽k× 2
![Page 27: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/27.jpg)
PPO Clip objective
![Page 28: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/28.jpg)
Surrogate objectives
![Page 29: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/29.jpg)
PPO actor pseudocode
Sample efficient version of PPO
![Page 30: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/30.jpg)
Results
![Page 32: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/32.jpg)
Discussion
● Implementation matters
Implementation Matters in Deep RL: A Case Study on PPO and TRPO, https://openreview.net/forum?id=r1etN1rtPB
![Page 33: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/33.jpg)
References
● Trust Region Policy Optimization (https://arxiv.org/pdf/1502.05477.pdf)● Constrained Policy Optimization, Achiam et al. 2017 (https://arxiv.org/pdf/1705.10528.pdf)● Proximal Policy Optimization Algorithms (https://arxiv.org/pdf/1707.06347.pdf)● Spinning Up in Deep RL (https://spinningup.openai.com/en/latest/index.html)● UC Berkeley Deep RL course
(http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf)● Deep RL Bootcamp (https://sites.google.com/view/deep-rl-bootcamp/lectures)● Deep RL Series
(https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-part-2-f51e3b2e373a)● https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-g
radient-methods-545784094e3b
![Page 34: Deep RL in continuous action spaces - ETH Z...Trust Region Policy Optimization, Schulman et al. 2015 Proximal Policy Optimization Algorithms, Schulman et al. 2017 Surrogate Loss How](https://reader034.vdocuments.mx/reader034/viewer/2022050309/5f7103f5dbf65f43946beac4/html5/thumbnails/34.jpg)
Questions?