lecture 10: policy gradient iii & midterm review =1[1]with...

36
Logistics Midterm we will be in two rooms The room you are assigned to depends on the first letter of your SUiD (Stanford email handle, e.g jdoe@stanford.edu) Gates B1 (a-e inclusive) Cubberley Auditorium (f-z) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 10: Policy Gradient III & Midterm Review Winter 2019 1 / 36

Upload: others

Post on 04-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Logistics

Midterm we will be in two rooms

The room you are assigned to depends on the first letter of your SUiD(Stanford email handle, e.g [email protected])

Gates B1 (a-e inclusive)

Cubberley Auditorium (f-z)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 1 / 36

Page 2: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Lecture 10: Policy Gradient III & Midterm Review 1

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2019

Additional reading: Sutton and Barto 2018 Chp. 13

1With many policy gradient slides from or derived from David Silver and John Schulman and Pieter Abbeel

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 2 / 36

Page 3: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Class Structure

Last time: Policy Search

This time: Policy Search & Midterm Review

Next time: Midterm

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 3 / 36

Page 4: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Recall: Policy-Based RL

Policy search: directly parametrize the policy

πθ(s, a) = P[a|s; θ]

Goal is to find a policy π with the highest value function V π

Focus on policy gradient methods

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 4 / 36

Page 5: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

”Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do

Collect a set of trajectories by executing the current policyAt each timestep t in each trajectory τ i , computeReturn G i

t =∑T−1

t′=t rit′ , and

Advantage estimate Ait = G i

t − b(st).Re-fit the baseline, by minimizing

∑i

∑t ||b(st)− G i

t ||2,Update the policy, using a policy gradient estimate g ,

Which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)

endfor

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 5 / 36

Page 6: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Choosing the Target

G it is an estimation of the value function at st from a single roll out

Unbiased but high variance

Reduce variance by introducing bias using bootstrapping and functionapproximation

Just like in we saw for TD vs MC, and value function approximation

Estimate of V /Q is done by a critic

Actor-critic methods maintain an explicit representation of policyand the value function, and update both

A3C (Mnih et al. ICML 2016) is a very popular actor-critic method

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 6 / 36

Page 7: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

”Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do

Collect a set of trajectories by executing the current policyAt each timestep t in each trajectory τ i , computeTarget R i

t

Advantage estimate Ait = G i

t − b(st).Re-fit the baseline, by minimizing

∑i

∑t ||b(st)− R i

t ||2,Update the policy, using a policy gradient estimate g ,

Which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)

endfor

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 7 / 36

Page 8: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Policy Gradient Methods with Auto-Step-Size Selection

Can we automatically ensure the updated policy π′ has value greaterthan or equal to the prior policy π: V π′ ≥ V π?

Consider this for the policy gradient setting, and hope to address thisby modifying step size

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 8 / 36

Page 9: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Objective Function

Goal: find policy parameters that maximize value function1

V (θ) = Eπθ

[ ∞∑t=0

γtR(st , at);πθ

]

where s0 ∼ µ(s0), at ∼ π(at |st), st+1 ∼ P(st+1|st , at)Have access to samples from the current policy πθold (param. by θold)

Want to predict the value of a different policy (off policy learning!)

1For today we will primarily consider discounted value functions

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 9 / 36

Page 10: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Objective Function

Goal: find policy parameters that maximize value function1

V (θ) = Eπθ

[ ∞∑t=0

γtR(st , at);πθ

]where s0 ∼ µ(s0), at ∼ π(at |st), st+1 ∼ P(st+1|st , at)Express expected return of another policy in terms of the advantageover the original policy

V (θ) = V (θ) + Eπθ

[ ∞∑t=0

γtAπ(st , at)

]= V (θ) +

∑s

µπ(s)∑a

π(a|s)Aπ(s, a)

where µπ(s) is defined as the discounted weighted frequency of states under policy π (similar to in Imitation Learning lecture)

We know the advantage Aπ and π

But we can’t compute the above because we don’t know µπ, thestate distribution under the new proposed policy

1For today we will primarily consider discounted value functions

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 10 / 36

Page 11: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Table of Contents

1 Updating the Parameters Given the Gradient: Local Approximation

2 Updating the Parameters Given the Gradient: Trust Regions

3 Updating the Parameters Given the Gradient: TRPO Algorithm

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 11 / 36

Page 12: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Local approximation

Can we remove the dependency on the discounted visitationfrequencies under the new policy?

Substitute in the discounted visitation frequencies under the currentpolicy to define a new objective function:

Lπ(π) = V (θ) +∑s

µπ(s)∑a

π(a|s)Aπ(s, a)

Note that Lπθ0 (πθ0) = V (θ0)

Gradient of L is identical to gradient of value function at policyparameterized evaluated at θ0: ∇θLπθ0 (πθ)|θ=θ0 = ∇θV (θ)|θ=θ0

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 12 / 36

Page 13: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Conservative Policy Iteration

Is there a bound on the performance of a new policy obtained byoptimizing the surrogate objective?

Consider mixture policies that blend between an old policy and adifferent policy

πnew (a|s) = (1− α)πold(a|s) + απ′(a|s)

In this case can guarantee a lower bound on value of the new πnew :

V πnew ≥ Lπold (πnew )− 2εγ

(1− γ)2α2

where ε = maxs∣∣Ea∼π′(a|s) [Aπ(s, a)]

∣∣

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 13 / 36

Page 14: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Find the Lower-Bound in General Stochastic Policies

Would like to similarly obtain a lower bound on the potentialperformance for general stochastic policies (not just mixture policies)

Recall Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Theorem

Let DmaxTV (π1, π2) = maxs DTV (π1(·|s), π2(·|s)). Then

V πnew ≥ Lπold (πnew )− 4εγ

(1− γ)2(Dmax

TV (πold , πnew ))2

where ε = maxs,a |Aπ(s, a)|.

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 14 / 36

Page 15: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Find the Lower-Bound in General Stochastic Policies

Would like to similarly obtain a lower bound on the potentialperformance for general stochastic policies (not just mixture policies)Recall Lπ(π) = V (θ) +

∑s µπ(s)

∑a π(a|s)Aπ(s, a)

Theorem

Let DmaxTV (π1, π2) = maxs DTV (π1(·|s), π2(·|s)). Then

V πnew ≥ Lπold (πnew )− 4εγ

(1− γ)2(Dmax

TV (πold , πnew ))2

where ε = maxs,a |Aπ(s, a)|.

Note that DTV (p, q)2 ≤ DKL(p, q) for prob. distrib p and q.Then the above theorem immediately implies that

V πnew ≥ Lπold (πnew )− 4εγ

(1− γ)2DmaxKL (πold , πnew )

where DmaxKL (π1, π2) = maxs DKL(π1(·|s), π2(·|s))

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 15 / 36

Page 16: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Guaranteed Improvement1

Goal is to compute a policy that maximizes the objective functiondefining the lower bound:

1Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 16 / 36

Page 17: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Guaranteed Improvement1

Goal is to compute a policy that maximizes the objective functiondefining the lower bound:

Mi (π) = Lπi (π)− 4εγ

(1− γ)2DmaxKL (πi , π)

V πi+1 ≥ Lπi (π)− 4εγ

(1− γ)2DmaxKL (πi , π) = Mi (πi+1)

V πi = Mi (πi )

V πi+1 − V πi ≥ Mi (πi+1)−Mi (πi )

So as long as the new policy πi+1 is equal or an improvementcompared to the old policy πi with respect to the lower bound, we areguaranteed to to monotonically improve!

The above is a type of Minorization-Maximization (MM) algorithm

1Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 17 / 36

Page 18: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Guaranteed Improvement1

V πnew ≥ Lπold (πnew )− 4εγ

(1− γ)2DmaxKL (πold , πnew )

Figure: Source: John Schulman, Deep Reinforcement Learning, 2014

1Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 18 / 36

Page 19: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Table of Contents

1 Updating the Parameters Given the Gradient: Local Approximation

2 Updating the Parameters Given the Gradient: Trust Regions

3 Updating the Parameters Given the Gradient: TRPO Algorithm

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 19 / 36

Page 20: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Optimization of Parameterized Policies1

Goal is to optimize

maxθ

Lθold (θnew )− 4εγ

(1− γ)2Dmax

KL (θold , θnew ) = Lθold (θnew )− CDmaxKL (θold , θnew )

where C is the penalty coefficient

In practice, if we used the penalty coefficient recommended by the theoryabove C = 4εγ

(1−γ)2 , the step sizes would be very small

1Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 20 / 36

Page 21: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Optimization of Parameterized Policies1

Goal is to optimize

maxθ

Lθold (θnew )− 4εγ

(1− γ)2Dmax

KL (θold , θnew ) = Lθold (θnew )− CDmaxKL (θold , θnew )

where C is the penalty coefficient

In practice, if we used the penalty coefficient recommended by the theoryabove C = 4εγ

(1−γ)2 , the step sizes would be very small

New idea: Use a trust region constraint on step sizes (Schulman, Levine,Abbeel, Jordan, & Moritz ICML 2015). Do this by imposing a constraint onthe KL divergence between the new and old policy.

maxθ

Lθold (θ)

subject to Ds∼µθold

KL (θold , θ) ≤ δ

This uses the average KL instead of the max (the max requires the KL isbounded at all states and yields an impractical number of constraints)

1Lπ(π) = V (θ) +∑

s µπ(s)∑

a π(a|s)Aπ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 21 / 36

Page 22: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

From Theory to Practice

Prior objective:

maxθ

Lθold (θ)

subject to Ds∼µθoldKL (θold , θ) ≤ δ

where Lθold (θ) = V (θ) +∑

s µθold (s)∑

a π(a|s, θ)Aθold (s, a)

Don’t know the discounted visitation weights nor true advantagefunction

Instead do the following substitutions:∑s

µθold (s)→ 1

1− γEs∼µθold [. . .],

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 22 / 36

Page 23: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

From Theory to Practice

Next substitution:∑a

πθ(a|sn)Aθold (sn, a)→ Ea∼q

[πθ(a|sn)

q(a|sn)Aθold (sn, a)

]where q is some sampling distribution over the actions and sn is aparticular sampled state.

This second substitution is to use importance sampling to estimatethe desired sum, enabling the use of an alternate samplingdistribution q (other than the new policy πθ).

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 23 / 36

Page 24: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

From Theory to Practice

Third substitution:

Aθold → Qθold

Note that these 3 substitutions do not change the solution to theabove optimization problem

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 24 / 36

Page 25: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Selecting the Sampling Policy

Optimize

maxθ

Es∼µθold ,a∼q

[πθ(a|s)

q(a|s)Qθold (s, a)

]subject to Es∼µθoldDKL(πθold (·|s), πθ(·|s)) ≤ δ

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 25 / 36

Page 26: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Selecting the Sampling Policy

Optimize

maxθ

Es∼µθold ,a∼q

[πθ(a|s)

q(a|s)Qθold (s, a)

]subject to Es∼µθoldDKL(πθold (·|s), πθ(·|s)) ≤ δ

Standard approach: sampling distribution is q(a|s) is simply πold(a|s)

For the vine procedure see the paper

Figure: Trust Region Policy Optimization, Schulman et al, 2015

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 26 / 36

Page 27: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Searching for the Next Parameter

Use a linear approximation to the objective function and a quadraticapproximation to the constraint

Constrained optimization problem

Use conjugate gradient descent

schulman2015trust

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 27 / 36

Page 28: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Table of Contents

1 Updating the Parameters Given the Gradient: Local Approximation

2 Updating the Parameters Given the Gradient: Trust Regions

3 Updating the Parameters Given the Gradient: TRPO Algorithm

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 28 / 36

Page 29: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Practical Algorithm: TRPO

1: for iteration=1, 2, . . . do2: Run policy for T timesteps or N trajectories3: Estimate advantage function at all timesteps4: Compute policy gradient g5: Use CG (with Hessian-vector products) to compute F−1g where F is

the Fisher information matrix6: Do line search on surrogate loss and KL constraint7: end for

schulman2015trust

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 29 / 36

Page 30: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Practical Algorithm: TRPO

Applied to

Locomotion controllers in 2D

Figure: Trust Region Policy Optimization, Schulman et al, 2015

Atari games with pixel input

schulman2015trust

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 30 / 36

Page 31: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

TRPO Results

Figure: Trust Region Policy Optimization, Schulman et al, 2015

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 31 / 36

Page 32: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

TRPO Results

Figure: Trust Region Policy Optimization, Schulman et al, 2015

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 32 / 36

Page 33: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

TRPO Summary

Policy gradient approach

Uses surrogate optimization function

Automatically constrains the weight update to a trusted region, toapproximate where the first order approximation is valid

Empirically consistently does well

Very influential: +350 citations since introduced a few years ago

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 33 / 36

Page 34: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Common Template of Policy Gradient Algorithms

1: for iteration=1, 2, . . . do2: Run policy for T timesteps or N trajectories3: At each timestep in each trajectory, compute target Qπ(st , at), and

baseline b(st)4: Compute estimated policy gradient g5: Update the policy using g , potentially constrained to a local region6: end for

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 34 / 36

Page 35: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Policy Gradient Summary

Extremely popular and useful set of approaches

Can input prior knowledge in the form of specifying policyparameterization

You should be very familiar with REINFORCE and the policy gradienttemplate on the prior slide

Understand where different estimators can be slotted in (andimplications for bias/variance)

Don’t have to be able to derive or remember the specific formulas inTRPO for approximating the objectives and constraints

Will have the opportunity to practice with these ideas in homework 3

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 35 / 36

Page 36: Lecture 10: Policy Gradient III & Midterm Review =1[1]With ...web.stanford.edu/class/cs234/slides/lecture10_1.pdf · Logistics Midterm we will be in two rooms The room you are assigned

Class Structure

Last time: Policy Search

This time: Policy Search & Midterm review

Next time: Midterm

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy Gradient III & Midterm Review 1 Winter 2019 36 / 36