chris g. willcocks durham university

Reinforcement LearningLecture 8: Policy gradient methods

Chris G. WillcocksDurham University

Lecture overview

Lecture covers chapter 13 in Sutton & Barto [1] and examples from David Silver [2]

1 Policy-based methodsdefinitioncharacteristicsdeterministic vs stochastic policies

2 Policy gradientsgradient-based estimatorMonte Carlo REINFORCE

3 Actor-critic methodsdefinitionalgorithmextensions

2 / 13

Policy-based methods introduction

Definition: policy-based methods

Last week, we used a function approximatorto estimate the value function:

v(s,w) ≈ vπ(s),

and for control we estimated Q:

q(s, a,w) ≈ qπ(s, a).

This week we will estimate policies:

πθ(a|s) = P (a|s, θ)

Given a state, what’s the distribution overactions?

Example: what’s the optimal policy?

3 / 13

Policy-based methods characteristics

Policy-based RL characteristics

This approach has the followingadvantages:• Can be more efficient that calculatingthe value function

• Better convergence guarantees• Effective in high-dimensional orcontinuous action spaces

• Can learn stochastic policiesAnd the following disadvantages:• Converges on local rather than globaloptimum

• Inefficient policy evaluation with highvariance

Example: continuous action spaces Û

4 / 13

https://cdn.openai.com/research-covers/learning-dexterity/[email protected]

Policy-based methods deterministic vs stochastic policies

Example: deterministic vs stochastic policies

Deterministic policy for feature vectors describing thewalls around a state:

a=π(s, θ)

Stochastic policy:

a∼π(s, θ)

Example from [2].

5 / 13

Policy gradients gradient estimators

Definition: gradient estimators

While we could optimise θ for non-differentiable functions using approachessuch as genetic algorithms or hill climbing, ideally we want to use a gradientbased estimator:

LPG(θ) = Et[∇θ log πθ(at|st)At

]where At is an estimate of the ‘advantage’ (difference between the return andthe state values, you could also replace At with q(s, a) instead for highervariance). The expectation Et is an empircal average over a finite batch ofsamples [3]. Typically π follows a categorical distribution (softmax) or aGaussian for continuous action spaces.

Therefore we empirically follow the gradient that maximizes the likelihood ofthe actions that give the most advantage.

6 / 13

Policy gradients Monte Carlo REINFORCE

Definition: Monte Carlo REINFORCE

REINFORCE estimates the return in theprevious equation by using a Monte Carloestimate [4].• Initialise some arbitrary parameters θ• Iteratively sample episodes• Calculate the complete return fromeach step

• For each step again, update in thegradient times the sample return

Algorithm: Monte Carlo REINFORCE

PyTorch example: W# initialise θ with random valuesπ = PolicyNetwork(θ)

while(True):# sample episode following πS0, A0, R1, ..., ST−1, AT−1, RT ∼ π

for t in range(T − 1):

Gt ←∑Tk=t+1 γ

k−t−1Rkθ ← θ + αγtGt∇ lnπ(At|St, θ)

7 / 13

https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py

Actor-critic methods introduction

Definition: actor-critic methods

We combine policy gradients withaction-value function approximation, usingtwo models that may (optionally) shareparameters.

• We use a critic to estimate theQ values:qw(s, a) ≈ qπθ (s, a)

• We use an actor to update the policyparameters θ in the direction suggestedby the critic.

Example: Actor critic

Image from freecodecamp.org

8 / 13

Actor-critic methods algorithm

Definition: actor-critic

Putting this together, actor-criticmethods use an approximatepolicy gradient to adjust the actorpolicy in the direction thatmaximises the reward according tothe critic:

θ ← θ + α∇θ log πθ(s, a)qw(s, a)

Algorithm: Actor-Critic (PyTorchW)

# initialise s, θ,w randomly# sample a ∼ πθ(a|s)for t in range(T ):

sample rt and s′ from environment(s, a)sample a′ ∼ πθ(a′|s′)θ ← θ + αqw(s, a)∇θ lnπθ(a|s) # update actorδt = rt + γqw(s′, a′)− qw(s, a) # TD errorw← w + αδt∇wqw(s, a) # update critica← a′, s← s′

9 / 13

https://github.com/higgsfield/RL-Adventure-2/blob/master/1.actor-critic.ipynb

Extensions

Extensions

This has introduced the foundations, hopefully now you have a goodplatform to read about the extensions to this.

1. Recommended further study (papers & code): W2. Recommended further study (theory & STAR):W

Recommended extensions include:• Advantage actor critic (A3C & A2C) [5]• Experience replay & prioritised replay [6]• Proximal policy optimisation [3]• Rainbow (combining extensions) [7]

10 / 13

https://github.com/higgsfield/RL-Adventure-2

https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

Take Away Points

Summary

In summary:

• Policy gradients open up many new extensions• Choose extensions to reduce variance to stabilise training• Consider regularisation to encourage exploration• Going off-policy gives better exploration• Its possible for the actor and critic to share some lower layerparameters, but be careful about it

• Experience replay can increase sample efficiency (wheresimulation is expensive)

11 / 13

References I

[1] Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction (second edition). Available onlineI. MITpress, 2018.

[2] David Silver. Reinforcement Learning lectures.https://www.davidsilver.uk/teaching/. 2015.

[3] John Schulman et al. “Proximal policy optimization algorithms”. In:arXiv preprint arXiv:1707.06347 (2017).

[4] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionistreinforcement learning”. In: Machine learning 8.3-4 (1992), pp. 229–256.

[5] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”.In: International conference on machine learning. 2016, pp. 1928–1937.

[6] Ziyu Wang et al. “Sample efficient actor-critic with experience replay”. In:arXiv preprint arXiv:1611.01224 (2016).

12 / 13

https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf

https://www.davidsilver.uk/teaching/

References II

[7] Matteo Hessel et al. “Rainbow: Combining improvements in deep reinforcementlearning”. In: arXiv preprint arXiv:1710.02298 (2017).

[8] Shixiang Gu et al. “Q-prop: Sample-efficient policy gradient with an off-policy critic”.In: arXiv preprint arXiv:1611.02247 (2016).

13 / 13

chris g. willcocks durham university

Documents