lecture 5: value function...

49
Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2019 The value function approximation structure for today closely follows much of David Silver’s Lecture 6. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 5: Value Function Approximation Winter 2019 1 / 49

Upload: others

Post on 14-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Lecture 5: Value Function Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2019

The value function approximation structure for today closely follows muchof David Silver’s Lecture 6.

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 1 / 49

Page 2: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 2 / 49

Page 3: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Class Structure

Last time: Control (making decisions) without a model of how theworld works

This time: Value function approximation

Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 3 / 49

Page 4: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Last time: Model-Free Control

Last time: how to learn a good policy from experience

So far, have been assuming we can represent the value function orstate-action value function as a vector/ matrix

Tabular representation

Many real world problems have enormous state and/or action spaces

Tabular representation is insufficient

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 4 / 49

Page 5: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Recall: Reinforcement Learning Involves

Optimization

Delayed consequences

Exploration

Generalization

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 5 / 49

Page 6: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Today: Focus on Generalization

Optimization

Delayed consequences

Exploration

Generalization

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 6 / 49

Page 7: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 7 / 49

Page 8: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterizedfunction instead of a table

𝑠 𝑉#(𝑠;𝑤)𝑤

𝑠 𝑄#(𝑠, 𝑎; 𝑤)𝑤𝑎

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 8 / 49

Page 9: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Motivation for VFA

Don’t want to have to explicitly store or learn for every single state a

Dynamics or reward modelValueState-action valuePolicy

Want more compact representation that generalizes across state orstates and actions

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 9 / 49

Page 10: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Benefits of Generalization

Reduce memory needed to store (P,R)/V /Q/π

Reduce computation needed to compute (P,R)/V /Q/π

Reduce experience needed to find a good P,R/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 10 / 49

Page 11: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterizedfunction instead of a table

𝑠 𝑉#(𝑠;𝑤)𝑤

𝑠 𝑄#(𝑠, 𝑎; 𝑤)𝑤𝑎Which function approximator?

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 11 / 49

Page 12: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Function Approximators

Many possible function approximators including

Linear combinations of featuresNeural networksDecision treesNearest neighborsFourier/ wavelet bases

In this class we will focus on function approximators that aredifferentiable (Why?)

Two very popular classes of differentiable function approximators

Linear feature representations (Today)Neural networks (Next lecture)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 12 / 49

Page 13: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Review: Gradient Descent

Consider a function J(w) that is a differentiable function of aparameter vector wGoal is to find parameter w that minimizes J

The gradient of J(w) is

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 13 / 49

Page 14: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 14 / 49

Page 15: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Value Function Approximation for Policy Evaluation withan Oracle

First assume we could query any state s and an oracle would returnthe true value for V π(s)

The objective was to find the best approximate representation of V π

given a particular parameterized function

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 15 / 49

Page 16: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Stochastic Gradient Descent

Goal: Find the parameter vector w that minimizes the loss between atrue value function V π(s) and its approximation V̂ (s; w) asrepresented with a particular function class parameterized by w .

Generally use mean squared error and define the loss as

J(w) = Eπ[(V π(s)− V̂ (s; w))2]

Can use gradient descent to find a local minimum

∆w = −1

2α∇wJ(w)

Stochastic gradient descent (SGD) samples the gradient:

Expected SGD is the same as the full gradient update

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 16 / 49

Page 17: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Model Free VFA Policy Evaluation

Don’t actually have access to an oracle to tell true V π(s) for anystate s

Now consider how to do model-free value function approximation forprediction / evaluation / policy evaluation without a model

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 17 / 49

Page 18: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)

Following a fixed policy π (or had access to prior data)Goal is to estimate V π and/or Qπ

Maintained a look up table to store estimates V π and/or Qπ

Updated these estimates after each episode (Monte Carlo methods)or after each step (TD methods)

Now: in value function approximation, change the estimateupdate step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 18 / 49

Page 19: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Feature Vectors

Use a feature vector to represent a state s

x(s) =

x1(s)x2(s). . .xn(s)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 19 / 49

Page 20: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Linear Value Function Approximation for Prediction WithAn Oracle

Represent a value function (or state-action value function) for aparticular policy with a weighted linear combination of features

V̂ (s; w) =n∑

j=1

xj(s)wj = x(s)Tw

Objective function is

J(w) = Eπ[(V π(s)− V̂ (s; w))2]

Recall weight update is

∆w = −1

2α∇wJ(w)

Update is:

Update = step-size × prediction error × feature value

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 20 / 49

Page 21: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected returnV π(st)

Therefore can reduce MC VFA to doing supervised learning on a setof (state,return) pairs: 〈s1,G1〉, 〈s2,G2〉, . . . , 〈sT ,GT 〉

Substitute Gt for the true V π(st) when fit function approximator

Concretely when using linear VFA for policy evaluation

∆w = α(Gt − V̂ (st ; w))∇w V̂ (st ; w)

= α(Gt − V̂ (st ; w))x(st)

= α(Gt − x(st)Tw)x(st)

Note: Gt may be a very noisy estimate of true return

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 21 / 49

Page 22: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

MC Linear Value Function Approximation for PolicyEvaluation

1: Initialize w = 0, k = 12: loop3: Sample k-th episode (sk,1, ak,1, rk,1, sk,2, . . . , sk,Lk ) given π4: for t = 1, . . . , Lk do5: if First visit to (s) in episode k then6: Gt(s) =

∑Lkj=t rk,j

7: Update weights:

8: end if9: end for

10: k = k + 111: end loop

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 22 / 49

Page 23: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Baird (1995)-Like Example with MC Policy Evaluation1

MC update: ∆w = α(Gt − x(st)Tw)x(st)

Small prob s7 goes to terminal state, x(s7)T = [0 0 0 0 0 0 1 2]

1Figure from Sutton and Barto 2018

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 23 / 49

Page 24: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence Guarantees for Linear Value FunctionApproximation for Policy Evaluation: Preliminaries

The Markov Chain defined by a MDP with a particular policy willeventually converge to a probability distribution over states d(s)

d(s) is called the stationary distribution over states of π∑s d(s) = 1

d(s) satisfies the following balance equation:

d(s) =∑s′

∑a

π(s ′|a)p(s ′|s, a)d(s ′)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 24 / 49

Page 25: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence Guarantees for Linear Value FunctionApproximation for Policy Evaluation2

Define the mean squared error of a linear value functionapproximation for a particular policy π relative to the true value as

MSVE (w) =∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

where

d(s): stationary distribution of π in the true decision processV̂ π(s; w) = x(s)Tw , a linear value function approximation

2Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with FunctionApproximation. 1997.https://web.stanford.edu/ bvr/pubs/td.pdf

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 25 / 49

Page 26: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence Guarantees for Linear Value FunctionApproximation for Policy Evaluation1

Define the mean squared error of a linear value functionapproximation for a particular policy π relative to the true value as

MSVE (w) =∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

whered(s): stationary distribution of π in the true decision processV̂ π(s; w) = x(s)Tw , a linear value function approximation

Monte Carlo policy evaluation with VFA converges to the weightswMC which has the minimum mean squared error possible:

MSVE (wMC ) = minw

∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

1Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with FunctionApproximation. 1997.https://web.stanford.edu/ bvr/pubs/td.pdf

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 26 / 49

Page 27: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Batch Monte Carlo Value Function Approximation

May have a set of episodes from a policy π

Can analytically solve for the best linear approximation that minimizesmean squared error on this data set

Let G (si ) be an unbiased sample of the true expected return V π(si )

arg minw

N∑i=1

(G (si )− x(si )Tw)2

Take the derivative and set to 0

w = (XTX )−1XTG

where G is a vector of all N returns, and X is a matrix of the featuresof each of the N states x(si )

Note: not making any Markov assumptions

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 27 / 49

Page 28: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π

Updates V π(s) after each transition (s, a, r , s ′):

V π(s) = V π(s) + α(r + γV π(s ′)− V π(s))

Target is r + γV π(s ′), a biased estimate of the true value V π(s)

Represent value for each state with a separate table entry

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 28 / 49

Page 29: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Temporal Difference (TD(0)) Learning with ValueFunction Approximation

Uses bootstrapping and sampling to approximate true V π

Updates estimate V π(s) after each transition (s, a, r , s ′):

V π(s) = V π(s) + α(r + γV π(s ′)− V π(s))

Target is r + γV π(s ′), a biased estimate of of the true value V π(s)

In value function approximation, target is r + γV̂ π(s ′; w), a biasedand approximated estimate of of the true value V π(s)

3 forms of approximation:

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 29 / 49

Page 30: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Temporal Difference (TD(0)) Learning with ValueFunction Approximation

In value function approximation, target is r + γV̂ π(s ′; w), a biasedand approximated estimate of the true value V π(s)

Can reduce doing TD(0) learning with value function approximationto supervised learning on a set of data pairs:

〈s1, r1 + γV̂ π(s2; w)〉, 〈s2, r2 + γV̂ (s3; w)〉, . . .Find weights to minimize mean squared error

J(w) = Eπ[(rj + γV̂ π(sj+1,w)− V̂ (sj ; w))2]

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 30 / 49

Page 31: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Temporal Difference (TD(0)) Learning with ValueFunction Approximation

In value function approximation, target is r + γV̂ π(s ′; w), a biasedand approximated estimate of the true value V π(s)

Supervised learning on a different set of data pairs:〈s1, r1 + γV̂ π(s2; w)〉, 〈s2, r2 + γV̂ (s3; w)〉, . . .In linear TD(0)

∆w = α(r + γV̂ π(s ′; w)− V̂ π(s; w))∇w V̂π(s; w)

= α(r + γV̂ π(s ′; w)− V̂ π(s; w))x(s)

= α(r + γx(s ′)Tw − x(s)Tw)x(s)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 31 / 49

Page 32: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

TD(0) Linear Value Function Approximation for PolicyEvaluation

1: Initialize w = 0, k = 12: loop3: Sample tuple (sk , ak , rk , sk+1) given π4: Update weights:

w = w + α(r + γx(s ′)Tw − x(s)Tw)x(s)

5: k = k + 16: end loop

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 32 / 49

Page 33: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Baird Example with TD(0) On Policy Evaluation 1

TD update: ∆w = α(r + γx(s ′)Tw − x(s)Tw)x(s)

1Figure from Sutton and Barto 2018Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 33 / 49

Page 34: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence Guarantees for Linear Value FunctionApproximation for Policy Evaluation

Define the mean squared error of a linear value functionapproximation for a particular policy π relative to the true value as

MSVE (w) =∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

where

d(s): stationary distribution of π in the true decision processV̂ π(s; w) = x(s)Tw , a linear value function approximation

TD(0) policy evaluation with VFA converges to weights wTD which iswithin a constant factor of the minimum mean squared error possible:

MSVE (wTD) ≤ 1

1− γminw

∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 34 / 49

Page 35: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Check Your Understanding

Monte Carlo policy evaluation with VFA converges to the weightswMC which has the minimum mean squared error possible:

MSVE (wMC ) = minw

∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

TD(0) policy evaluation with VFA converges to weights wTD which iswithin a constant factor of the minimum mean squared error possible:

MSVE (wTD) ≤ 1

1− γminw

∑s∈S

d(s)(V π(s)− V̂ π(s; w))2

If the VFA is a tabular representation (one feature for each state),what is the MSVE for MC and TD?

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 35 / 49

Page 36: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence Rates for Linear Value FunctionApproximation for Policy Evaluation

Does TD or MC converge faster to a fixed point?

Not (to my knowledge) definitively understood

Practically TD learning often converges faster to its fixed valuefunction approximation point

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 36 / 49

Page 37: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 37 / 49

Page 38: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Control using Value Function Approximation

Use value function approximation to represent state-action valuesQ̂π(s, a; w) ≈ Qπ

Interleave

Approximate policy evaluation using value function approximationPerform ε-greedy policy improvement

Can be unstable. Generally involves intersection of the following:

Function approximationBootstrappingOff-policy learning

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 38 / 49

Page 39: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Action-Value Function Approximation with an Oracle

Q̂π(s, a; w) ≈ Qπ

Minimize the mean-squared error between the true action-valuefunction Qπ(s, a) and the approximate action-value function:

J(w) = Eπ[(Qπ(s, a)− Q̂π(s, a; w))2]

Use stochastic gradient descent to find a local minimum

−1

2∇W J(w) = E

[(Qπ(s, a)− Q̂π(s, a; w))∇w Q̂

π(s, a; w)]

∆(w) = −1

2α∇wJ(w)

Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 39 / 49

Page 40: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Linear State Action Value Function Approximation with anOracle

Use features to represent both the state and action

x(s, a) =

x1(s, a)x2(s, a). . .

xn(s, a)

Represent state-action value function with a weighted linearcombination of features

Q̂(s, a; w) = x(s, a)Tw =n∑

j=1

xj(s, a)wj

Stochastic gradient descent update:

∇wJ(w) = ∇wEπ[(Qπ(s, a)− Q̂π(s, a; w))2]

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 40 / 49

Page 41: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for astate is unknown and so substitute a target value

In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w))∇w Q̂(st , at ; w)

For SARSA instead use a TD target r + γQ̂(s ′, a′; w) which leveragesthe current function approximation value

∆w = α(r + γQ̂(s ′, a′; w)− Q̂(s, a; w))∇w Q̂(s, a; w)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 41 / 49

Page 42: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for astate is unknown and so substitute a target value

In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w))∇w Q̂(st , at ; w)

For SARSA instead use a TD target r + γQ̂(s ′, a′; w) which leveragesthe current function approximation value

∆w = α(r + γQ̂(s ′, a′; w)− Q̂(s, a; w))∇w Q̂(s, a; w)

For Q-learning instead use a TD target r + γmaxa′ Q̂(s ′, a′; w) whichleverages the max of the current function approximation value

∆w = α(r + γmaxa′

Q̂(s ′, a′; w)− Q̂(s, a; w))∇w Q̂(s, a; w)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 42 / 49

Page 43: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence of TD Methods with VFA

TD with value function approximation is not following the gradient ofan objective function

Informally, updates involve doing an (approximate) Bellman backupfollowed by best trying to fit underlying value function to a particularfeature representation

Bellman operators are contractions, but value function approximationfitting can be an expansion

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 43 / 49

Page 44: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Challenges of Off Policy Control: Baird Example 1

Behavior policy and target policy are not identicalValue can diverge

1Figure from Sutton and Barto 2018

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 44 / 49

Page 45: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Convergence of Control Methods with VFA

Algorithm Tabular Linear VFA Nonlinear VFA

Monte-Carlo Control

Sarsa

Q-learning

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 45 / 49

Page 46: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Hot Topic: Off Policy Function ApproximationConvergence

Extensive work in better TD-style algorithms with value functionapproximation, some with convergence guarantees: see Chp 11 S& B

Exciting recent work on batch RL that can converge with nonlinearVFA (Dai et al. ICML 2018): uses primal dual optimization

An important issue is not just whether the algorithm converges, butwhat solution it converges too

Critical choices: objective function and feature representation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 46 / 49

Page 47: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Linear Value Function Approximation3

3Figure from Sutton and Barto 2018Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 47 / 49

Page 48: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

What You Should Understand

Be able to implement TD(0) and MC on policy evaluation with linearvalue function approximation

Be able to define what TD(0) and MC on policy evaluation withlinear VFA are converging to and when this solution has 0 error andnon-zero error.

Be able to implement Q-learning and SARSA and MC controlalgorithms

List the 3 issues that can cause instability and describe the problemsqualitiatively: function approximation, bootstrapping and off policylearning

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 48 / 49

Page 49: Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Class Structure

Last time: Control (making decisions) without a model of how theworld works

This time: Value function approximation

Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation Winter 2019 49 / 49