advanced policy gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfclass notes 1....

24
Advanced Policy Gradients CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

Upload: others

Post on 06-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Advanced Policy Gradients

CS 285: Deep Reinforcement Learning, Decision Making, and Control

Sergey Levine

Page 2: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Class Notes

1. Homework 2 due today (11:59 pm)!• Don’t be late!

2. Homework 3 comes out this week• Start early! Q-learning takes a while to run

Page 3: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Today’s Lecture

1. Why does policy gradient work?

2. Policy gradient is a type of policy iteration

3. Policy gradient as a constrained optimization

4. From constrained optimization to natural gradient

5. Natural gradients and trust regions

• Goals:• Understand the policy iteration view of policy gradient

• Understand how to analyze policy gradient improvement

• Understand what natural gradient does and how to use it

Page 4: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Recap: policy gradients

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

“reward to go”

can also use function approximation here

Page 5: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Why does policy gradient work?

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

look familiar?

Page 6: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Policy gradient as policy iteration

Page 7: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Policy gradient as policy iterationimportance sampling

Page 8: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Ignoring distribution mismatch??

why do we want this to be true?

is it true? and when?

Page 9: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Bounding the distribution change

seem familiar?

not a great bound, but a bound!

Page 10: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Bounding the distribution change

Proof based on: Schulman, Levine, Moritz, Jordan, Abbeel. “Trust Region Policy Optimization.”

Page 11: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Bounding the objective value

Page 12: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Where are we at so far?

Page 13: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Break

Page 14: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

A more convenient bound

KL divergence has some very convenient properties that make it much easier to approximate!

Page 15: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

How do we optimize the objective?

Page 16: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

How do we enforce the constraint?

can do this incompletely (for a few grad steps)

Page 17: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

How (else) do we optimize the objective?

Use first order Taylor approximation for objective (a.k.a., linearization)

Page 18: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

How do we optimize the objective?

(see policy gradient lecture for derivation)

exactly the normal policy gradient!

Page 19: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Can we just use the gradient then?

Page 20: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Can we just use the gradient then?

not the same!

second order Taylor expansion

Page 21: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Can we just use the gradient then?

natural gradient

Page 22: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Is this even a problem in practice?

(image from Peters & Schaal 2008)

Essentially the same problem as this:

(figure from Peters & Schaal 2008)

Page 23: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

Practical methods and notes

• Natural policy gradient• Generally a good choice to stabilize policy gradient training

• See this paper for details:• Peters, Schaal. Reinforcement learning of motor skills with policy gradients.

• Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix• See: Schulman et al. Trust region policy optimization

• Trust region policy optimization

• Just use the IS objective directly• Use regularization to stay close to old policy

• See: Proximal policy optimization

Page 24: Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1. Homework 2 due today (11:59 pm)! •Don’t be late! 2. Homework 3 comes out this

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

Review

• Policy gradient = policy iteration

• Optimize advantage under new policy state distribution

• Using old policy state distribution optimizes a bound, if the policies are close enough

• Results in constrained optimization problem

• First order approximation to objective = gradient ascent

• Regular gradient ascent has the wrong constraint, use natural gradient

• Practical algorithms

• Natural policy gradient

• Trust region policy optimization