value function methods - university of california,...

28
Value Function Methods CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

Upload: others

Post on 08-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Value Function Methods

CS 285: Deep Reinforcement Learning, Decision Making, and Control

Sergey Levine

Page 2: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Class Notes

1. Homework 2 is due in one week (next Monday)

2. Remember to start forming final project groups and writing your proposal!• Proposal due 9/25, this Wednesday!

Page 3: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Today’s Lecture

1. What if we just use a critic, without an actor?

2. Extracting a policy from a value function

3. The Q-learning algorithm

4. Extensions: continuous actions, improvements

• Goals:• Understand how value functions give rise to policies

• Understand the Q-learning algorithm

• Understand practical considerations for Q-learning

Page 4: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Recap: actor-critic

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

Page 5: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Can we omit policy gradient completely?

forget policies, let’s just do this!

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

Page 6: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Policy iteration

High level idea:

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

how to do this?

Page 7: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Dynamic programming

0.2 0.3 0.4

0.5

0.6

0.7

0.3 0.3

0.3

0.3

0.40.40.4

0.5 0.5 0.5

just use the current estimate here

Page 8: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Policy iteration with dynamic programming

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

0.2 0.3 0.4

0.5

0.6

0.7

0.3 0.3

0.3

0.3

0.40.40.4

0.5 0.5 0.5

Page 9: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Even simpler dynamic programming

approximates the new value!

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

Page 10: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Fitted value iteration

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

curse of dimensionality

Page 11: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

What if we don’t know the transition dynamics?

need to know outcomes for different actions!

Back to policy iteration…

can fit this using samples

Page 12: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Can we do the “max” trick again?

doesn’t require simulation of actions!

+ works even for off-policy samples (unlike actor-critic)

+ only one network, no high-variance policy gradient

- no convergence guarantees for non-linear function approximation (more on this later)

forget policy, compute value directly

can we do this with Q-values also, without knowing the transitions?

Page 13: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Fitted Q-iteration

Page 14: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Review

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

• Value-based methods• Don’t learn a policy explicitly

• Just learn value or Q-function

• If we have value function, we have a policy

• Fitted Q-iteration

Page 15: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Break

Page 16: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Why is this algorithm off-policy?

dataset of transitions

Fitted Q-iteration

Page 17: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

What is fitted Q-iteration optimizing?

most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)

Page 18: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Online Q-learning algorithms

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

off policy, so many choices here!

Page 19: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Exploration with Q-learning

We’ll discuss exploration in more detail in a later lecture!

“epsilon-greedy”

final policy:

why is this a bad idea for step 1?

“Boltzmann exploration”

Page 20: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Review

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

• Value-based methods• Don’t learn a policy explicitly

• Just learn value or Q-function

• If we have value function, we have a policy

• Fitted Q-iteration• Batch mode, off-policy method

• Q-learning• Online analogue of fitted Q-

iteration

Page 21: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Value function learning theory

0.2 0.3 0.4

0.5

0.6

0.7

0.3 0.3

0.3

0.3

0.40.40.4

0.5 0.5 0.5

Page 22: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Value function learning theory

0.2 0.3 0.4

0.5

0.6

0.7

0.3 0.3

0.3

0.3

0.40.40.4

0.5 0.5 0.5

Page 23: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Non-tabular value function learning

Page 24: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Non-tabular value function learning

Conclusions:value iteration converges (tabular case)fitted value iteration does not convergenot in generaloften not in practice

Page 25: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

What about fitted Q-iteration?

Applies also to online Q-learning

Page 26: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

But… it’s just regression!

Q-learning is not gradient descent!

no gradient through target value

Page 27: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

A sad corollary

An aside regarding terminology

Page 28: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just

Review

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

• Value iteration theory• Linear operator for backup• Linear operator for projection• Backup is contraction• Value iteration converges

• Convergence with function approximation• Projection is also a contraction• Projection + backup is not a contraction• Fitted value iteration does not in general

converge

• Implications for Q-learning• Q-learning, fitted Q-iteration, etc. does

not converge with function approximation

• But we can make it work in practice!• Sometimes – tune in next time