model-based reinforcement learning - rlchina · 2020. 9. 8. · luo, yuping, et al....

Reinforcement Learning China Summer School

Model-based Reinforcement Learning

Prof. Weinan Zhang

John Hopcroft Center, Shanghai Jiao Tong University

July 30, 2020

Overall Pathway of DRL• Deep reinforcement learning gets appealing

success• Atari, AlphaGo, DOTA 2, AlphaStar

A Perspective of

Slide from Sergey Levine. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-1.pdf

Overall Pathway of DRL• Deep reinforcement learning gets appealing

success• Atari, AlphaGo, DOTA 2, AlphaStar

• But DRL has very low data efficiency• Trial-and-error learning for deep networks

• A recent popular direction is model-based RL• Build a model 𝑝 𝑠’, 𝑟 𝑠, 𝑎)• Based on the model to train the policy• So that the data efficiency could be improved

A Perspective of

Interaction between Agent and Environment

• Real environment• State dynamics• Reward function

p(s0; rjs; a)p(s0; rjs; a)

¼(ajs)¼(ajs)

f(s; a; r; s0)gf(s; a; r; s0)g

p(s0js; a)p(s0js; a)

r(s; a)r(s; a)

Experience data

Interaction between Agent and Env. Model

• Real environment• State dynamics• Reward function


¼(ajs)¼(ajs)



r(s; a)r(s; a)

Simulated data

• Environment model• State dynamics• Reward function


r(s; a)r(s; a)

Model-free RL v.s. Model-based RL (1)

• Model-based RL• On-policy learning once the model is learned• May not need further real interaction data once the

model is learned (batch RL)• Always show higher sample efficiency than MFRL• Suffer from model compounding error

• Model-free RL• The best asymptotic performance• Highly suitable for DL architecture with big data• Off-policy methods still show instabilities• Very low sample efficiency & require huge amount of

training data

Model-based RL: Blackbox and Whitebox

Model as a Blackbox• Seamless to policy training

algorithms• The simulation data efficiency

may still be low• E.g., Dyna-Q, MPC, MBPO

Model as a Whitebox• Offer both data and gradient

guidance for value and policy• High data efficiency• E.g., MAAC, SVG, PILCO

» (s; a; r; s0)» (s; a; r; s0)

» (s; a; r; s0)» (s; a; r; s0)

! @V (s0)@s0

@s0

@a

@a

@μ

¯s0»fÁ(s;a)

! @V (s0)@s0

@s0

@a

@a

@μ

¯s0»fÁ(s;a)

Content

1. Introduction to MBRL from Dyna

2. Shooting methods: RS, PETS, POPLIN

3. Theoretic bounds and methods: SLBO, MBPO & BMPO

4. Backpropagation through paths: SVG and MAAC

Model-based RL


p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known

or

in virtual environment

in realenvironment

Roll-out trajectory data

Real trajectory data

Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)


Annotations based on Rich Sutton’s figure

Model-free RL v.s. Model-based RL (2)

• Model-based RL• Can efficiently learn model by supervised learning

methods• Can reason about model uncertainty• Empirically less interactions with real environment• First learn a model, then construct a value function

• two sources of approximation error

• Model-free RL• No need to learn the model• Learn value function (and/or policy) from real

experience (on-policy or off-policy)Costly or even

dangerousBiased or out-of-date data

Q-Planning• Random-sample one-step tabular Q-planning

• First, learn a model p(s’,r|s,a) from experience data• Then perform one-step sampling by the model to learn

the Q function

• Here model learning and reinforcement learning are separate

Dyna


or

in virtual environment

in realenvironment

Roll-out trajectory data

Real trajectory data

Q(s; a) and ¼(ajs)Q(s; a) and ¼(ajs)


p(s0js; a) with r(s; a) knownp(s0js; a) with r(s; a) known

Annotations based on Rich Sutton’s figure

Dyna-Q

Sutton, Richard S. "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." Machine Learning Proceedings 1990. Morgan Kaufmann, 1990. 216-224.

Dyna-Q on a Simple Maze

Key Questions of Model-based RL• Does the model really help improve the data

efficiency?

• Inevitably, the model is to-some-extent inaccurate. When to trust the model?

• How to properly leverage the model to better train our policy?

Content





Shooting MethodsModel can also be used to help decision making when interact with environment. For the current state :• Given an action sequence of length T:

• we can sample trajectory from the model:

• And then choose the action sequence with highest estimated return:

[s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ][s0; a0; r0; s1; a1; r1; s2; a2; r2 : : : ; sT ; aT ; rT ]

[a0; a1; a2; : : : ; aT ][a0; a1; a2; : : : ; aT ]

Q(s; a) =

TXt=0

°trtQ(s; a) =

TXt=0

°trt ¼(s) = arg maxa

Q(s; a)¼(s) = arg maxa

Q(s; a)

s0s0

Random Shooting (RS)• The action sequences are randomly sampled• Pros:

• implementation simplicity• lower computational burden (no gradients)• no requirement to specify the task-horizon in advance

• Cons: high variance, may not sample the high reward action

• A refinement: Cross Entropy Method (CEM)

• CEM samples actions from a distribution closer to previous action samples that yield high reward.

Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." NIPS 2018.

5% median & 95%performance

PETS: Probabilistic Ensembles with Trajectory Sampling

• Probabilistic ensemble (PE) dynamics model is shown as an ensemble of two bootstraps

• Bootstrap disagreement far from data captures epistemicuncertainty: our subjective uncertainty due to a lack of data (model variance)

• Each a probabilistic neural network that captures aleatoricuncertainty (stochastic environment).


Ensemble

Gaussian NN

PETS: Probabilistic Ensembles with Trajectory Sampling• The trajectory sampling (TS) propagation technique

uses our dynamics model to re-sample each particle (with associated bootstrap) according to its probabilistic prediction at each point in time, up until horizon T


PETS: Probabilistic Ensembles with Trajectory Sampling• Planning: At each time step, MPC algorithm computes an optimal action sequence by sampling multiple sequences, applies the first action in the sequence, and repeats until the task-horizon.


PETS Algorithm


POPLIN: Model-based Policy Planning

• PETS uses CEM to maintain a distribution of actions yielding high reward at each time step

• Can we better choose the action sequence?

• POPLIN: maintain a policy to sample actions given the current simulated state

Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).

POPLIN: Model-based Policy Planning

• POPLIN: maintain a policy to sample actions given the current simulated state

Wang, Tingwu, and Jimmy Ba. "Exploring Model-based Planning with Policy Networks." arXiv preprint arXiv:1906.08649 (2019).

• Policy parameter level• Action level

POPLIN Policy Distillation Schemes

• With action sequences ranked & selected by total reward

• Behavior cloning (BC) for POPLIN-A and -P

• Generative adversarial network training (GAN) for POPLIN-P

• Setting parameter average (AVG) for POPLIN-P

POPLIN Experiments

CheetahAnt

Swimmer Acrobot

POPLIN Experiments

POPLIN vs. PETSTransform each planned candidate action trajectory with PCA into a 2D blue scatter.Red areas are with high reward. Blue points are actions taken.

Content





Any Theoretic Bound for MBRL?

• A set of practical requirement of theoretic analysis

Discrepancy of model and real env when

playing with 𝜋Value in the model

Value in real env

(R1)

(R2)

(R3)

Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." ICLR 2019.

e.g.

A very strong assumption

Meta Algorithm of MBRL

• Remarks1. Optimize the lower bound directly (with TRPO)2. Jointly optimizing M and 𝜋 encourages the algorithm

to choose the most optimistic model among those that can be used to accurately estimate the value function.

3. Fixing M to optimize is 𝜋 just model-free RL


Convergence Theorem & Proof

Proof:

= 0

(by R1)

(by R2)

SLBO: Stochastic Lower Bound Optimization

• Model optimization loss

• Model & policy optimization target


sg: stop gradient

SLBO Experiments


All envs with maximum horizon of 500

Problems of SLBO• All environments have restricted 500 steps horizon• Use the model to roll-out whole trajectories from

the start state• Practically, we may not roll-out so long horizon due

to the compounding error of the model• Particularly when the model is inaccurate• We need to know whether and to-what-extent the

model is inaccurate• And decide how to adaptively perform roll-out

Bound based on Model & Policy Error

• The discrepancy of policy value in real environment and environment model is linearly proportional to the model error 𝜖

Janner, Michael, et al. "When to Trust Your Model: Model-Based Policy Optimization." NIPS 2019.

²m = maxt

Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m = maxt

Es»¼D;t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]

²¼ = maxs

DTV (¼k¼D)²¼ = maxs

DTV (¼k¼D)

based on old policy sampled data


• Branched rollout• Begin a rollout from a state under the previous policy’s

state distribution 𝑑 (𝑠) and run k steps according to 𝜋under the learned model 𝑝

• Dyna can be viewed as a special case of k = 1 branched rollout


• This lower bound is maximized when k = 0, corresponding to not using the model at all.



• With a new model error (first-order) approximation

• The bound is rewritten as

where the optimal k > 0 if is sufficiently small

²m0 = maxt

Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]²m0 = maxt

Es»¼t [DTV (p(s0; rjs; a)kpμ(s0; rjs; a))]

Empirical Analysis of

where the optimal k > 0 if is sufficiently small

MBPO Algorithm

• Remarks• Branch out from the real trajectories (instead from s0)• Branch rollout k steps depends on model & policy• Soft AC to update policy

MBPO Experiments 1000-step horizon

One Step Further based on MBPO• Idea: Bidirectional models for branched rollout to

reduce compounding error

• Paper: Hang Lai, Jian Shen, Weinan Zhang, Yong Yu.Bidirectional Model-based Policy Optimization. ICML 2020.

• https://arxiv.org/abs/2007.01995

Bidirectional ModelKey finding: Generating trajectories with the same length, the compounding error of the bidirectional model will be less than that of the forward model.

forward model

bidirectional model

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu. Bidirectional Model-based Policy Optimization. ICML 2020.

Dynamics Model Learning• An ensemble of bootstrapped probabilistic networks are

used to parameterize both the forward model and the backward model .

where and are the mean and covariance respectively, and denotes the total number of real transition data.


• Each probabilistic neural network outputs a Gaussian distribution with diagonal covariance and is trained via maximum likelihood. The corresponding loss functions are:

s a s'

a s's

Backward Policy• In the forward model rollout, actions are

selected by the current policy . • To sample trajectories backwards, we

need to learn a backward policy to take actions given the next state.

LMLE(Á0) = ¡NX

t=0

log ~¼Á0 (atjst+1) ;

• or conditional GAN (GAIL):

min~¼

maxD

V (D; ~¼) = E(a;s0)»¼ [log D(a; s0)] + Es0»¼ [log (1¡D(~¼(¢js0); s0))] :


• The backward policy can be trained by either maximum likelihood estimation:

s a s' a'

a s'

Other Components of BMPO• State Sampling Strategy: instead of random chosen states from

environment replay buffer, we sample high value states to begin rolloutsaccording to a Boltzmann distribution.

p(s) / e¯V (s)


• Thus, the agent could learn to reach high value states through backward rollouts, and also learn to act better after these states through forward rollouts

at = argmaxa1:N

t »¼

Pt+H¡1t0=t °t0¡tr (st0; at0) + °HV (st+H)

• Incorporating MPC to refine the action taken in real environment.• At each time-step, candidate action sequences are generated from current policy and the

corresponding trajectories are simulated by the learned model. • Then the first action of the sequence that yields the highest accumulated rewards is

selected:

Overall Algorithm of BMPO

1. When interacting with the environment, the agent uses the model to perform MPC-based action selection

2. Data is stored in Denv, where the value of state increases from light red to dark red

3. High value states are then sampled from Denv to perform bidirectional model rollouts, which are stored in Dmodel

4. Model-free method (soft actor-critic) is used to train the policy based on the data from Dmodel


Theoretical Analysis

Bidirectional model:

Forward model:


forward model bidirectional model

Comparison with State-of-the-Arts

Compared with previous state-of-the-art baselines, BMPO (blue) learns faster and has better asymptotic performance than previous model-based algorithms using only the forward model.


Model Compounding Error


Errorfor =1

2h

2hXi=1

ksi ¡ sik22

Errorbi =1

2h

hXi=1

¡ksh+i ¡ sh+ik22 + ksh¡i ¡ sh¡ik2

2

¢

Content





Deterministic Policy Gradient• A critic module for state-action value estimation

Qw(s; a) ' Q¼(s; a)Qw(s; a) ' Q¼(s; a)

L(w) = Es»½¼;a»¼μ

£(Qw(s; a)¡Q¼(s; a))2

¤L(w) = Es»½¼;a»¼μ

£(Qw(s; a)¡Q¼(s; a))2

¤• With the differentiable critic, the deterministic

continuous-action actor can be updated as• Deterministic policy gradient theorem

J(¼μ) = Es»½¼ [Q¼(s; a)]J(¼μ) = Es»½¼ [Q¼(s; a)]

rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]rμJ(¼μ) = Es»½¼ [rμ¼μ(s)raQ¼(s; a)ja=¼μ(s)]

Chain ruleOn-policy

D. Silver et al. Deterministic Policy Gradient Algorithms. ICML 2014.

REVIEW

From Deterministic to Stochastic• For deterministic environment, i.e., reward and

transition, and policy, i.e., a function mapping state to action

Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." NIPS 2015.

From Deterministic to Stochastic• Math: reparameterization of distributions

• Conditional Gaussian distribution

• Consider conditional densities whose samples are generated by a deterministic function of an input noise variable

derivative


From Deterministic to Stochastic• Deterministic version for

reference

• Stochastic environment, policy, value and gradients


SVG Algorithm and Experiments


Model-Augmented Actor Critic• The objective function can be directly built as a

stochastic computation graph

Stochastic variable

Deterministic variable

H makes tradeoff between accuracy of model and learned Q

Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

Theoretic BoundsLarge H, i.e., long rollout, brings large gradient error, thus the model error


Theoretic BoundsOne-step gradient approximation error

Dynamics error

Value function error

Policy value lower bound


MAAC Algorithm

• Model learning: an ensemble of probabilistic model

• Policy Optimization

• Q-function Learning


Experiments

maac


Backpropagation through Paths• PILCO

• Deisenroth, Marc, and Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." ICML 2011.

• SVG• Heess, Nicolas, et al. "Learning continuous control

policies by stochastic value gradients." NIPS 2015.• MAAC

• Ignasi Clavera, Yao Fu, Pieter Abbeel. Model-Augmented Actor Critic: Backpropagation through paths. ICLR 2020.

References of

Summary of Model-based RL

Model as a Blackbox• Sample experience data

and train the policy in the manner of model-free RL

• Seamless to policy training algorithms

• Easy for rollout planning• The simulation data

efficiency may still be low• E.g., Dyna-Q, MPC, MBPO

Model as a Whitebox• Environment model,

including transition dynamics and reward function, is a differentiable stochastic mapping

• Offer both data and gradient guidance for value and policy

• High data efficiency• E.g., MAAC, SVG, PILCO

Future Directions of MBRL• Environment model learning

• Learning objectives• Environment imitation• Complex simulation• MBRL with true model in simulation

• Better understanding of bounds for model usage• When to have tighter bounds• Rollout length and when to rollout

• Multi-agent MBRL

Thank You! Questions?

Weinan ZhangAssociate ProfessorAPEX Data & Knowledge Management Lab

John Hopcroft Center for Computer Science

Shanghai Jiao Tong University

http://wnzhang.net

model-based reinforcement learning - rlchina · 2020. 9. 8. · luo, yuping, et al....

Documents