model-based reinforcement learning - rlchina · 2020. 9. 8. · luo, yuping, et al....

65
Reinforcement Learning China Summer School Model-based Reinforcement Learning Prof. Weinan Zhang John Hopcroft Center, Shanghai Jiao Tong University July 30, 2020

Upload: others

Post on 13-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Reinforcement Learning China Summer School

Model-based Reinforcement Learning

Prof. Weinan Zhang

John Hopcroft Center, Shanghai Jiao Tong University

July 30, 2020

Page 2: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Overall Pathway of DRL• Deep reinforcement learning gets appealing

success• Atari, AlphaGo, DOTA 2, AlphaStar

A Perspective of

Slide from Sergey Levine. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-1.pdf

Page 3: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Overall Pathway of DRL• Deep reinforcement learning gets appealing

success• Atari, AlphaGo, DOTA 2, AlphaStar

• But DRL has very low data efficiency• Trial-and-error learning for deep networks

• A recent popular direction is model-based RL• Build a model 𝑝 𝑠’, 𝑟 𝑠, 𝑎)• Based on the model to train the policy• So that the data efficiency could be improved

A Perspective of

Page 4: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Interaction between Agent and Environment

• Real environment• State dynamics• Reward function

p(s0; rjs; a)p(s0; rjs; a)

¼(ajs)¼(ajs)

f(s; a; r; s0)gf(s; a; r; s0)g

p(s0js; a)p(s0js; a)

r(s; a)r(s; a)

Experience data

Page 5: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Interaction between Agent and Env. Model

• Real environment• State dynamics• Reward function

p(s0; rjs; a)p(s0; rjs; a)

¼(ajs)¼(ajs)

f(s; a; r; s0)gf(s; a; r; s0)g

p(s0js; a)p(s0js; a)

r(s; a)r(s; a)

Simulated data

• Environment model• State dynamics• Reward function

p(s0js; a)p(s0js; a)

r(s; a)r(s; a)

Page 6: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model-free RL v.s. Model-based RL (1)

• Model-based RL• On-policy learning once the model is learned• May not need further real interaction data once the

model is learned (batch RL)• Always show higher sample efficiency than MFRL• Suffer from model compounding error

• Model-free RL• The best asymptotic performance• Highly suitable for DL architecture with big data• Off-policy methods still show instabilities• Very low sample efficiency & require huge amount of

training data

Page 7: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model-based RL: Blackbox and Whitebox

Model as a Blackbox• Seamless to policy training

algorithms• The simulation data efficiency

may still be low• E.g., Dyna-Q, MPC, MBPO

Model as a Whitebox• Offer both data and gradient

guidance for value and policy• High data efficiency• E.g., MAAC, SVG, PILCO

» (s; a; r; s0)» (s; a; r; s0)

» (s; a; r; s0)» (s; a; r; s0)

! @V (s0)@s0

@s0

@a

@a

¯s0»fÁ(s;a)

! @V (s0)@s0

@s0

@a

@a

¯s0»fÁ(s;a)

Page 8: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Content

1. Introduction to MBRL from Dyna

2. Shooting methods: RS, PETS, POPLIN

3. Theoretic bounds and methods: SLBO, MBPO & BMPO

4. Backpropagation through paths: SVG and MAAC

Page 9: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model-based RL

p(s0; rjs; a)p(s0; rjs; a)

p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known

or

in virtual environment

in realenvironment

Roll-out trajectory data

Real trajectory data

Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)

f(s; a; r; s0)gf(s; a; r; s0)g

Annotations based on Rich Sutton’s figure

Page 10: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model-free RL v.s. Model-based RL (2)

• Model-based RL• Can efficiently learn model by supervised learning

methods• Can reason about model uncertainty• Empirically less interactions with real environment• First learn a model, then construct a value function

• two sources of approximation error

• Model-free RL• No need to learn the model• Learn value function (and/or policy) from real

experience (on-policy or off-policy)Costly or even

dangerousBiased or out-of-date data

Page 11: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Q-Planning• Random-sample one-step tabular Q-planning

• First, learn a model p(s’,r|s,a) from experience data• Then perform one-step sampling by the model to learn

the Q function

• Here model learning and reinforcement learning are separate

Page 12: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Dyna

p(s0; rjs; a)p(s0; rjs; a)

or

in virtual environment

in realenvironment

Roll-out trajectory data

Real trajectory data

Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)

f(s; a; r; s0)gf(s; a; r; s0)g

p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known

Annotations based on Rich Sutton’s figure

Page 13: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Dyna-Q

Sutton, Richard S. "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." Machine Learning Proceedings 1990. Morgan Kaufmann, 1990. 216-224.

Page 14: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Dyna-Q on a Simple Maze

Page 15: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Key Questions of Model-based RL• Does the model really help improve the data

efficiency?

• Inevitably, the model is to-some-extent inaccurate. When to trust the model?

• How to properly leverage the model to better train our policy?

Page 16: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Content

1. Introduction to MBRL from Dyna

2. Shooting methods: RS, PETS, POPLIN

3. Theoretic bounds and methods: SLBO, MBPO & BMPO

4. Backpropagation through paths: SVG and MAAC

Page 17: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Shooting MethodsModel can also be used to help decision making when interact with environment. For the current state :• Given an action sequence of length T:

• we can sample trajectory from the model:

• And then choose the action sequence with highest estimated return:

[s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ][s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ]

[a0; a1; a2; : : : ; aT ][a0; a1; a2; : : : ; aT ]

Q(s; a) =

TXt=0

°trtQ(s; a) =

TXt=0

°trt ¼(s) = arg maxa

Q(s; a)¼(s) = arg maxa

Q(s; a)

s0s0

Page 18: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Random Shooting (RS)• The action sequences are randomly sampled• Pros:

• implementation simplicity• lower computational burden (no gradients)• no requirement to specify the task-horizon in advance

• Cons: high variance, may not sample the high reward action

• A refinement: Cross Entropy Method (CEM)

• CEM samples actions from a distribution closer to previous action samples that yield high reward.

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

5% median & 95%performance

Page 19: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

PETS: Probabilistic Ensembles with Trajectory Sampling

• Probabilistic ensemble (PE) dynamics model is shown as an ensemble of two bootstraps

• Bootstrap disagreement far from data captures epistemicuncertainty: our subjective uncertainty due to a lack of data (model variance)

• Each a probabilistic neural network that captures aleatoricuncertainty (stochastic environment).

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

Ensemble

Gaussian NN

Page 20: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

PETS: Probabilistic Ensembles with Trajectory Sampling• The trajectory sampling (TS) propagation technique

uses our dynamics model to re-sample each particle (with associated bootstrap) according to its probabilistic prediction at each point in time, up until horizon T

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

Page 21: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

PETS: Probabilistic Ensembles with Trajectory Sampling• Planning: At each time step, MPC algorithm computes an optimal action sequence by sampling multiple sequences, applies the first action in the sequence, and repeats until the task-horizon.

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

Page 22: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

PETS Algorithm

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

Page 23: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN: Model-based Policy Planning

• PETS uses CEM to maintain a distribution of actions yielding high reward at each time step

• Can we better choose the action sequence?

• POPLIN: maintain a policy to sample actions given the current simulated state

Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).

Page 24: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN: Model-based Policy Planning

• POPLIN: maintain a policy to sample actions given the current simulated state

Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).

• Policy parameter level• Action level

Page 25: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN Policy Distillation Schemes

• With action sequences ranked & selected by total reward

• Behavior cloning (BC) for POPLIN-A and -P

• Generative adversarial network training (GAN) for POPLIN-P

• Setting parameter average (AVG) for POPLIN-P

Page 26: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN Experiments

CheetahAnt

Swimmer Acrobot

Page 27: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN Experiments

Page 28: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

POPLIN vs. PETSTransform each planned candidate action trajectory with PCA into a 2D blue scatter.Red areas are with high reward. Blue points are actions taken.

Page 29: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Content

1. Introduction to MBRL from Dyna

2. Shooting methods: RS, PETS, POPLIN

3. Theoretic bounds and methods: SLBO, MBPO & BMPO

4. Backpropagation through paths: SVG and MAAC

Page 30: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Any Theoretic Bound for MBRL?

• A set of practical requirement of theoretic analysis

Discrepancy of model and real env when

playing with 𝜋Value in the model

Value in real env

(R1)

(R2)

(R3)

Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.

e.g.

A very strong assumption

Page 31: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Meta Algorithm of MBRL

• Remarks1. Optimize the lower bound directly (with TRPO)2. Jointly optimizing M and 𝜋 encourages the algorithm

to choose the most optimistic model among those that can be used to accurately estimate the value function.

3. Fixing M to optimize is 𝜋 just model-free RL

Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.

Page 32: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Convergence Theorem & Proof

Proof:

= 0

(by R1)

(by R2)

Page 33: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

SLBO: Stochastic Lower Bound Optimization

• Model optimization loss

• Model & policy optimization target

Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.

sg: stop gradient

Page 34: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

SLBO Experiments

Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.

All envs with maximum horizon of 500

Page 35: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Problems of SLBO• All environments have restricted 500 steps horizon• Use the model to roll-out whole trajectories from

the start state• Practically, we may not roll-out so long horizon due

to the compounding error of the model• Particularly when the model is inaccurate• We need to know whether and to-what-extent the

model is inaccurate• And decide how to adaptively perform roll-out

Page 36: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Bound based on Model & Policy Error

• The discrepancy of policy value in real environment and environment model is linearly proportional to the model error 𝜖

Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.

²m = maxt

Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m = maxt

Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]

²¼ = maxs

DTV (¼k¼D)²¼ = maxs

DTV (¼k¼D)

based on old policy sampled data

Page 37: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Bound based on Model & Policy Error

• Branched rollout• Begin a rollout from a state under the previous policy’s

state distribution 𝑑 (𝑠) and run k steps according to 𝜋under the learned model 𝑝

• Dyna can be viewed as a special case of k = 1 branched rollout

Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.

• This lower bound is maximized when k = 0, corresponding to not using the model at all.

Page 38: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Bound based on Model & Policy Error

Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.

• With a new model error (first-order) approximation

• The bound is rewritten as

where the optimal k > 0 if is sufficiently small

²m0 = maxt

Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m0 = maxt

Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]

Page 39: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Empirical Analysis of

where the optimal k > 0 if is sufficiently small

Page 40: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

MBPO Algorithm

• Remarks• Branch out from the real trajectories (instead from s0)• Branch rollout k steps depends on model & policy• Soft AC to update policy

Page 41: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

MBPO Experiments 1000-step horizon

Page 42: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

One Step Further based on MBPO• Idea: Bidirectional models for branched rollout to

reduce compounding error

• Paper: Hang Lai, Jian Shen, Weinan Zhang, Yong Yu.Bidirectional Model-based Policy Optimization. ICML 2020.

• https://arxiv.org/abs/2007.01995

Page 43: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Bidirectional ModelKey finding: Generating trajectories with the same length, the compounding error of the bidirectional model will be less than that of the forward model.

forward model

bidirectional model

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

Page 44: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Dynamics Model Learning• An ensemble of bootstrapped probabilistic networks are

used to parameterize both the forward model and the backward model .

where and are the mean and covariance respectively, and denotes the total number of real transition data.

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

• Each probabilistic neural network outputs a Gaussian distribution with diagonal covariance and is trained via maximum likelihood. The corresponding loss functions are:

s a s'

a s's

Page 45: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Backward Policy• In the forward model rollout, actions are

selected by the current policy . • To sample trajectories backwards, we

need to learn a backward policy to take actions given the next state.

LMLE(Á0) = ¡NX

t=0

log ~¼Á0 (atjst+1) ;

• or conditional GAN (GAIL):

min~¼

maxD

V (D; ~¼) = E(a;s0)»¼ [log D(a; s0)] + Es0»¼ [log (1¡D(~¼(¢js0); s0))] :

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

• The backward policy can be trained by either maximum likelihood estimation:

s a s' a'

a s'

Page 46: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Other Components of BMPO• State Sampling Strategy: instead of random chosen states from

environment replay buffer, we sample high value states to begin rolloutsaccording to a Boltzmann distribution.

p(s) / e¯V (s)

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

• Thus, the agent could learn to reach high value states through backward rollouts, and also learn to act better after these states through forward rollouts

at = argmaxa1:N

t »¼

Pt+H¡1t0=t °t0¡tr (st0; at0) + °HV (st+H)

• Incorporating MPC to refine the action taken in real environment.• At each time-step, candidate action sequences are generated from current policy and the

corresponding trajectories are simulated by the learned model. • Then the first action of the sequence that yields the highest accumulated rewards is

selected:

Page 47: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Overall Algorithm of BMPO

1. When interacting with the environment, the agent uses the model to perform MPC-based action selection

2. Data is stored in Denv, where the value of state increases from light red to dark red

3. High value states are then sampled from Denv to perform bidirectional model rollouts, which are stored in Dmodel

4. Model-free method (soft actor-critic) is used to train the policy based on the data from Dmodel

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

Page 48: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Theoretical Analysis

Bidirectional model:

Forward model:

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

forward model bidirectional model

Page 49: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Comparison with State-of-the-Arts

Compared with previous state-of-the-art baselines, BMPO (blue) learns faster and has better asymptotic performance than previous model-based algorithms using only the forward model.

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

Page 50: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model Compounding Error

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

Errorfor =1

2h

2hXi=1

ksi ¡ sik22

Errorbi =1

2h

hXi=1

¡ksh+i ¡ sh+ik22 + ksh¡i ¡ sh¡ik2

2

¢

Page 51: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Content

1. Introduction to MBRL from Dyna

2. Shooting methods: RS, PETS, POPLIN

3. Theoretic bounds and methods: SLBO, MBPO & BMPO

4. Backpropagation through paths: SVG and MAAC

Page 52: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Deterministic Policy Gradient• A critic module for state-action value estimation

Qw(s; a) ' Q¼(s; a)Qw(s; a) ' Q¼(s; a)

L(w) = Es»½¼;a»¼μ

£(Qw(s; a)¡Q¼(s; a))2

¤L(w) = Es»½¼;a»¼μ

£(Qw(s; a)¡Q¼(s; a))2

¤• With the differentiable critic, the deterministic

continuous-action actor can be updated as• Deterministic policy gradient theorem

J(¼μ) = Es»½¼ [Q¼(s; a)]J(¼μ) = Es»½¼ [Q¼(s; a)]

rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]

Chain ruleOn-policy

D. Silver et al. Deterministic Policy Gradient Algorithms. ICML 2014.

REVIEW

Page 53: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

From Deterministic to Stochastic• For deterministic environment, i.e., reward and

transition, and policy, i.e., a function mapping state to action

Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.

Page 54: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

From Deterministic to Stochastic• Math: reparameterization of distributions

• Conditional Gaussian distribution

• Consider conditional densities whose samples are generated by a deterministic function of an input noise variable

derivative

Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.

Page 55: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

From Deterministic to Stochastic• Deterministic version for

reference

• Stochastic environment, policy, value and gradients

Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.

Page 56: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

SVG Algorithm and Experiments

Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.

Page 57: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Model-Augmented Actor Critic• The objective function can be directly built as a

stochastic computation graph

Stochastic variable

Deterministic variable

H makes tradeoff between accuracy of model and learned Q

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Page 58: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Theoretic BoundsLarge H, i.e., long rollout, brings large gradient error, thus the model error

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Page 59: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Theoretic BoundsOne-step gradient approximation error

Dynamics error

Value function error

Policy value lower bound

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Page 60: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

MAAC Algorithm

• Model learning: an ensemble of probabilistic model

• Policy Optimization

• Q-function Learning

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Page 61: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Experiments

maac

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Page 62: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Backpropagation through Paths• PILCO

• Deisenroth, Marc, and Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." ICML 2011.

• SVG• Heess, Nicolas, et al. "Learning continuous control

policies by stochastic value gradients." NIPS 2015.• MAAC

• Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

References of

Page 63: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Summary of Model-based RL

Model as a Blackbox• Sample experience data

and train the policy in the manner of model-free RL

• Seamless to policy training algorithms

• Easy for rollout planning• The simulation data

efficiency may still be low• E.g., Dyna-Q, MPC, MBPO

Model as a Whitebox• Environment model,

including transition dynamics and reward function, is a differentiable stochastic mapping

• Offer both data and gradient guidance for value and policy

• High data efficiency• E.g., MAAC, SVG, PILCO

Page 64: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Future Directions of MBRL• Environment model learning

• Learning objectives• Environment imitation• Complex simulation• MBRL with true model in simulation

• Better understanding of bounds for model usage• When to have tighter bounds• Rollout length and when to rollout

• Multi-agent MBRL

Page 65: Model-based Reinforcement Learning - RLChina · 2020. 9. 8. · Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees."

Thank You! Questions?

Weinan ZhangAssociate ProfessorAPEX Data & Knowledge Management Lab

John Hopcroft Center for Computer Science

Shanghai Jiao Tong University

http://wnzhang.net