cs885 reinforcement learning lecture 9: may 30, 2018ppoupart/teaching/cs885... · •learn directly...

24
CS885 Reinforcement Learning Lecture 9: May 30, 2018 Model-based RL [SutBar] Chap 8 CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

Upload: others

Post on 24-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Reinforcement LearningLecture 9: May 30, 2018

Model-based RL[SutBar] Chap 8

CS885 Spring 2018 Pascal Poupart 1University of Waterloo

Page 2: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 2

Outline

• Model-based RL• Dyna• Monte-Carlo Tree Search

University of Waterloo

Page 3: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 3

Model-free RL

• No explicit transition or reward models– Q-learning: value-based method– Policy gradient: policy-based method– Actor critic: policy and value based method

University of Waterloo

Agent: update policy/value function

Environment

actionstate

reward

Page 4: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 4

Model-based RL• Learn explicit transition and/or reward model– Plan based on the model– Benefit: Increased sample efficiency– Drawback: Increased complexity

University of Waterloo

Agent: update policy/value function

Environmentaction

statereward

Agent: update model

plan

Page 5: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 5

Maze Example

lllu-1uu+1rrr

1

2

3

1 2 3 4(1,1)à (1,2)à (1,3)à (1,2)à (1,3)à (2,3)à (3,3)à (4,3)+1(1,1)à (1,2)à (1,3)à (2,3)à (3,3)à (3,2)à (3,3)à (4,3)+1(1,1)à (2,1)à (3,1)à (3,2)à (4,2)-1

g = 1Reward is -0.04 for non-terminal states

P((2,3)|(1,3),r) =2/3P((1,2)|(1,3),r) =1/3

Use this information in

We need to learn all the transition probabilities!

#∗ % = max) * %, , + . ∑01 Pr %4 %, , #∗(%4)University of Waterloo

Page 6: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 6

Model-based RL

• Idea: at each step– Execute action– Observe resulting state and reward– Update transition and/or reward model– Update policy and/or value function

University of Waterloo

Page 7: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 7

Model-based RL (with Value Iteration)

ModelBasedRL(!)Repeat

Select and execute "Observe !’ and $Update counts: % !, " ← % !, " + 1,

% !, ", !* ← % !, ", !* + 1Update transition: Pr !* !, " ← -(/,0,/1)

-(/,0) ∀!′Update reward: 5 !, " ← 6 7 - /,0 89 :(/,0)

-(/,0)Solve: ;∗(!) = max0 5 !, " + A ∑/1 Pr !* !, " ;∗ !* ∀!! ← !′

Until convergence of ;∗Return ;∗

University of Waterloo

Page 8: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 8

Complex models

• Use function approx. for transition and reward models– Linear model: !"# $% $, ' = )($′|-. $

' , /01)– Non-linear models:

• Stochastic (e.g. Gaussian process):!"# $% $, ' = 34($|-. $

' , /01)• Deterministic (e.g., neural network): $% = 5 $, ' = ))($, ')

University of Waterloo

Page 9: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 9

Partial Planning

• In complex models, fully optimizing the policy or value function at each time step is intractable

• Consider partial planning– A few steps of Q-learning– A few steps of policy gradient

University of Waterloo

Page 10: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 10

Model-based RL (with Q-learning)ModelBasedRL(!)

RepeatSelect and execute ", observe !’ and $Update transition: %& ← %& − )& * !, " − !, ∇./*(!, ")Update reward:%2 ← %2 − )2 3 !, " − $ ∇.43(!, ")Repeat a few times:

sample !̂, 6" arbitrarily7 ← 3 !̂, 6" + 9max

6=>? *(!̂, 6"), 6", − ? !̂, 6"

Update ?: %@ ← %@ − )@7∇.A?(!̂, 6")! ← !′

Until convergence of ?Return ?

University of Waterloo

Page 11: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 11

Partial Planning vs Replay Buffer

• Previous algorithm is very similar to Model-free Q-learning with a replay buffer

• Instead of updating Q-function based on samples from replay buffer, generate samples from model

• Replay buffer: – Simple, real samples, no generalization to other sate-action pairs

• Partial planning with a model– Complex, simulated samples, generalization to other state-action

pairs (can help or hurt)

University of Waterloo

Page 12: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 12

Dyna• Learn explicit transition and/or reward model– Plan based on the model

• Learn directly from real experience

University of Waterloo

Agent: update policy/value function

Environmentaction

statereward

Agent: update model

plan

statereward

Page 13: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 13

Dyna-QDyna-Q(!)

RepeatSelect and execute ", observe !’ and $Update transition: %& ← %& − )& * !, " − !, ∇./*(!, ")Update reward: %2 ← %2 − )2 3 !, " − $ ∇.43(!, ")5 ← $ + 7max

;<= !′, "′ − = !, "

Update =: %? ← %? − )?5∇.@=(!, ")Repeat a few times:

sample !̂, B" arbitrarily5 ← 3 !̂, B" + 7max

B;<= *(!̂, B"), B", − = !̂, B"

Update =: %? ← %? − )?5∇.@=(!̂, B")! ← !′

Return =University of Waterloo

Page 14: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 14

Dyna-Q• Task: reach G from S

University of Waterloo

Page 15: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 15

Planning from current state

• Instead of planning at arbitrary states, plan from the current state– This helps improve next action

• Monte Carlo Tree Search

University of Waterloo

Page 16: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 16

Tree Searchs1

s13s12s3s2

a b

.9 .1 .2 .8

s4 s5

.5 .5s6 s7

.6 .4

a b

s8 s9

.2 .8s10 s11

.7 .3

a b

s14 s15

.1 .9s16 s17

.2 .8

a b

s18 s19

.2 .8s20 s21

.7 .3

a b

University of Waterloo

Page 17: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 17

Tractable Tree Search

• Combine 3 ideas:– Leaf nodes: approximate leaf values with value of default policy !

"∗ $, & ≈ "( $, & ≈ 1*($, &)-./0

12.

– Chance nodes: approximate expectation by sampling from transition model

"∗ $, & ≈ 3 $, & + 5 1*($, &) -

67 ~ 9:(67|6,<)=($>)

– Decision nodes: expand only most promising actions

&∗ = &@AB&C<" $, & + D E FG 1 61 6,< and =∗ $ = "($, &∗)

• Resulting algorithm: Monte Carlo Tree Search

University of Waterloo

Page 18: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

18

Computer Go

• Oct 2015:

Monte Carlo Tree SearchDeep RL

CS885 Spring 2018 Pascal PoupartUniversity of Waterloo

Page 19: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 19

Monte Carlo Tree Search (with upper confidence bound)

UCT(!")create root #$%&" withstate!/0/& #$%&" ← !"whilewithincomputationalbudgetdo#$%&< ← =>&&?$@ABC(#$%&")F0@G& ← H&I0G@/?$@ABC !/0/& #$%&<J0BKGL(#$%&<, F0@G&)

return0B/A$#(J&!/OℎA@% #$%&", 0 )

TreePolicy(#$%&)while#$%& isnonterminaldoif#$%& isnotfullyexpandeddoreturnUVL0#%(#$%&)

else#$%& ← J&!/OℎA@%(#$%&, O)

return#$%&

University of Waterloo

Page 20: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 20

Monte Carlo Tree Search (continued)Expand(!"#$)

choose % ∈ untried actions of '()*%*$ !"#$ )addanewchild!"#$′ to!"#$with)*%*$ !"#$9 ← ;()*%*$ !"#$ , %)

return!"#$′

BestChild(!"#$,?)

returnarg maxCDEFG ∈ HIJKELFC CDEF

M !"#$9 + ?O PQ C CDEF

C(CDEFG)

DefaultPolicy(!"#$)while!"#$ isnotterminaldosample% ~ U % )*%*$ !"#$)′ ← ;()*%*$ !"#$ , %)

returnV(), %)

deterministic transition

University of Waterloo

Page 21: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 21

Monte Carlo Tree Search (continued)

Backup(!"#$,%&'($)while!"#$ isnotnulldo5 !"#$ ← 7 789: ; 789: <=>?@:

7 789: <A! !"#$ ← ! !"#$ + 1!"#$ ← D&E$!F(!"#$)

BackupMinMax(!"#$,%&'($)while!"#$ isnotnulldo5 !"#$ ← 7 789: ; 789: <=>?@:

7 789: <A! !"#$ ← ! !"#$ + 1%&'($ ← −%&'($!"#$ ← D&E$!F(!"#$)

Single Player

Two Players (adversarial)

University of Waterloo

Page 22: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 22

AlphaGo

Four steps:1. Supervised Learning of Policy Networks2. Policy gradient with Policy Networks 3. Value gradient with Value Networks4. Searching with Policy and Value Networks– Monte Carlo Tree Search variant

University of Waterloo

Page 23: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 23

Search Tree

• At each edge store !(#, %), '(%|#), )(#, %)

• Where ) #, % is thevisit count of (#, %)

Sample trajectory

University of Waterloo

Page 24: CS885 Reinforcement Learning Lecture 9: May 30, 2018ppoupart/teaching/cs885... · •Learn directly from real experience University of Waterloo Agent: update policy/value function

CS885 Spring 2018 Pascal Poupart 24

Simulation

• At each node, select edge !∗ that maximizes!∗ = !$%&!'( ) *, ! + -(*, !)

• where - *, ! ∝ 1((|3)456 3,(

is an exploration bonus

) *, ! = 46 3,(

∑8 18 *, ! [;<= * + 1 − ; ?8]

18 *, ! = A1 BC *, ! D!* EB*BFGH !F BFG$!FBIJ B0 IFℎG$DB*G

University of Waterloo