![Page 1: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/1.jpg)
Reinforcement Learningor, Learning and Planning with Markov Decision Processes
295 Seminar, Winter 2018
Rina Dechter
Slides will follow David Silver’s, and Sutton’s book
Goals: To learn together the basics of RL.
Some lectures and classic and recent papers from the literature
Students will be active learners and teachers
1
Class page
Demo
Detailed demo295, Winter 2018 1
![Page 2: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/2.jpg)
Topics
1. Introduction and Markov Decision Processes: Basic concepts. S&B chapters 1, 3. (myslides 2)
2. Planning Dynamic Programming – Policy Iteration, Value Iteration, S&B chapter 4, (myslides 3)
3. Monte-Carlo(MC) and Temporal Differences (TD): S&B chapters 5 and 6, (myslides 4, myslides 5)
4. Multi-step bootstrapping: S&B chapter 7, (myslides 4, last part, slides 6 Sutton)
5. Bandit algorithms: S&B chapter 2, (myslides 7 , sutton-based)
6. Exploration exploitation. (Slides: silver 9, Brunskill)
7. Planning and learning MCTS: S&B chapter 8, (slides Brunskill)
8. function approximations S&B chapter 9,10,11, (slides: silver 6, Sutton 9,10,11)
9. Policy gradient methods: S&B chapter 13, (slides: silver 7, Sutton 13)10. Deep RL ???
295, Winter 2018 2
![Page 3: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/3.jpg)
Resources
• Book: Reinforcement Learning: An IntroductionRichard S. Sutton and Andrew G. Barto
• UCL Course on Reinforcement LearningDavid Silver
• RealLife Reinforcement Learning
Emma Brunskill
• Udacity course on Reinforcement Learning:
Isbell, Littman and Pryby
295, Winter 2018 3
![Page 4: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/4.jpg)
295, Winter 2018 4
![Page 5: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/5.jpg)
Lecture 1: Introduction to Reinforcement Learning
Course OutlineCourse Outline, Silver
Part I: Elementary Reinforcement Learning
1 Introduction to RL
2 Markov DecisionProcesses
3 Planning by Dynamic Programming
4 Model-Free Prediction
5 Model-Free Control
Part II: Reinforcement Learning inPractice
1 Value FunctionApproximation
2 Policy GradientMethods
3 Integrating Learning and Planning
4 Exploration andExploitation
5 Case study - RL in games
295, Winter 2018 5
![Page 6: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/6.jpg)
Introduction to Reinforcement
Learnintg
Chapter 1 S&B
295, Winter 2018 6
![Page 7: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/7.jpg)
Reinforcement Learning
295, Winter 2018 7
Learn a behavior strategy (policy) that maximizes the long term
Sum of rewards in an unknown and stochastic environment (Emma Brunskill: )
Planning under Uncertainty
Learn a behavior strategy (policy) that maximizes the long term
Sum of rewards in a known stochastic environment (Emma Brunskill: )
![Page 8: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/8.jpg)
Reinforcement Learning
295, Winter 2018 8
![Page 9: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/9.jpg)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
EnvironmentsAgent and Environment
observation
reward
action
At
Rt
Ot
![Page 10: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/10.jpg)
Lecture 1: Introduction to Reinforcement Learning
About RLBranches of Machine Learning
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Machine Learning
295, Winter 2018 10
![Page 11: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/11.jpg)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
RewardSequential Decision Making
Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may bedelayed
It may be better to sacrifice immediate reward to gain more
long-term reward
Examples:
A financial investment (may take months to mature)
Refuelling a helicopter (might prevent a crash in several hours)
Blocking opponent moves (might help winning chances many
moves from now)
• My pet project: The academic commitment problem.
Given outside requests (committees, reviews, talks,
teach…) what to accept and what to reject today?11
![Page 12: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/12.jpg)
295, Winter 2018 12
![Page 13: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/13.jpg)
Lecture 1: Introduction to Reinforcement Learning
Problems within RLAtari Example: Reinforcement Learning
observation
reward
action
At
Rt
Ot
Rules of the game are
unknown
Learn directly from
interactive game-play
Pick actions on
joystick, see pixels
and scores
295, Winter 2018 13
![Page 14: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/14.jpg)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
EnvironmentsAgent and Environment
observation
reward
action
At
Rt
OtAt each step t the agent:
Executes action At
Receives observation Ot
Receives scalar rewardRt
The environment:
Receives action At
Emits observationOt+1
Emits scalar reward Rt+1
t increments at env. step
295, Winter 2018 14
![Page 15: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/15.jpg)
Markov Decision Processes
295, Winter 2018 15
In a nutshell:
Policy: 𝜋 𝑠 → 𝑎
![Page 16: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/16.jpg)
Value and Q Functions
295, Winter 2018 17
Most of the story in a nutshell:
![Page 17: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/17.jpg)
295, Winter 2018 18
Most of the story in a nutshell:
![Page 18: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/18.jpg)
295, Winter 2018 19
Most of the story in a nutshell:
![Page 19: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/19.jpg)
295, Winter 2018 20
Most of the story in a nutshell:
![Page 20: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/20.jpg)
295, Winter 2018 21
Most of the story in a nutshell:
![Page 21: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/21.jpg)
295, Winter 2018 22
Most of the story in a nutshell:
![Page 22: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/22.jpg)
295, Winter 2018 23
Most of the story in a nutshell:
![Page 23: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/23.jpg)
![Page 24: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/24.jpg)
![Page 25: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/25.jpg)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
StateHistory and State
The history is the sequence of observations, actions, rewards
Ht = O1,R1, A1, ...,At−1, Ot ,Rt
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on thehistory:
The agent selects actions
The environment selects observations/rewards
State is the information used to determine what happens next
Formally, state is a function of the history:
St = f (Ht)
295, Winter 2018 26
![Page 26: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/26.jpg)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
StateInformation State
An information state (a.k.a. Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1 | St ] = P[St+1 | S1, ...,St ]
“The future is independent of the past given the present”
H1:t → St → Ht+1:∞
Once the state is known, the history may be thrownaway
i.e. The state is a sufficient statistic of the future
The environment state S is Markovt
The history Ht is Markov 27
![Page 27: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/27.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentMajor Components of an RL Agent
An RL agent may include one or more of these components:
Policy: agent’s behaviourfunction
Value function: how good is each state and/or action
Model: agent’s representation of the environment
295, Winter 2018 28
![Page 28: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/28.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentPolicy
A policy is the agent’s behaviour
It is a map from state to action, e.g.
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
295, Winter 2018 29
![Page 29: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/29.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentValue Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
And therefore to select between actions,e.g.
vπ(s) = Eπ Rt+1 + γRt+2 + γ2Rt+3 + ... | St = s
295, Winter 2018 30
![Page 30: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/30.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentModel
295, Winter 2018 31
![Page 31: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/31.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentMaze Example
Start
Goal
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent’s location
295, Winter 2018 32
![Page 32: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/32.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentMaze Example: Policy
Start
Goal
Arrows represent policy π(s) for each state s 33
![Page 33: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/33.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentMaze Example: Value Function
-14 -13 -12 -11 -10 -9
-16 -15 -12 -8
-16 -17 -6 -7
-18 -19 -5
-24 -20 -4 -3
-23 -22 -21 -22 -2 -1
Start
Goal
Numbers represent value vπ (s) of each state s 34
![Page 34: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/34.jpg)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL AgentMaze Example: Model
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1
-1 -1
-1 -1
Start
Goal
Agent may have an internal
model of the environment
Dynamics: how actions
change the state
Rewards: how muchreward
from each state
The model may be imperfect
Grid layout represents transition model Pa ss‘
asNumbers represent immediate reward R from each state s
(same for all a)295, Winter 2018 35
![Page 35: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/35.jpg)
Lecture 1: Introduction to Reinforcement Learning
Problems within RLLearning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment is initiallyunknown
The agent interacts with the environment
The agent improves itspolicy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
external interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
295, Winter 2018 36
![Page 36: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/36.jpg)
Lecture 1: Introduction to Reinforcement Learning
Problems within RLPrediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
295, Winter 2018 37
![Page 37: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/37.jpg)
Markov Decision Processes
Chapter 3 S&B
295, Winter 2018 38
![Page 38: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/38.jpg)
295, Winter 2018 39
![Page 39: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/39.jpg)
MDPs
• The world is an MDP (combining the agent and the world): give rise to a trajectory
S0,A0,R1,S1,A1,R2,S2,A3,R3,S3,…
• The process is governed by a transition function
• Markov Process (MP)
• Markov Reward Process (MRP)
• Markov Decision Process (MDP)
295, Winter 2018 40
![Page 40: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/40.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov PropertyMarkov Property
“The future is independent of the past given the present”
Definition
A state S t is Markov if and only if
P [St+1 | S t ] = P [St+1 | S1, ...,S t ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
295, Winter 2018 42
![Page 41: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/41.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov PropertyState Transition Matrix
where each row of the matrix sums to 1.295, Winter 2018 43
![Page 42: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/42.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov ChainsMarkov Process
A Markov process is a memoryless random process, i.e. a sequence
of random states S1, S2, ... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple (S, P )
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P [St+1 = s' | S t = s]
295, Winter 2018 44
![Page 43: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/43.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov ChainsExample: Student Markov Chain, a transition graph
0.5 0.2
0.4
SleepFacebook
Class 20.8
0.9
0.1
Pub
Class 3 0.6
PassClass 1 0.5
0.40.2
0.4
1.0
295, Winter 2018 45
![Page 44: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/44.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov ChainsExample: Student Markov Chain Episodes
Sleep
0.9
0.1
Pub
Pass
0.5 0.2
Class 1 0.5 Class 20.8
Class 30.6
0.4
0.20.4
0.4
1.0
Sample episodes for Student Markov Chain starting from S1 = C1
S1,S2, ...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep
295, Winter 2018 46
![Page 45: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/45.jpg)
Lecture 2: Markov DecisionProcesses
Markov Processes
Markov ChainsExample: Student Markov Chain Transition Matrix
0.5 0.2
SleepFacebook
0.9
0.1
Pub
Class 20.8
Class 30.6
0.4
PassClass 1 0.5
0.40.2
0.4
1.0
295, Winter 2018 47
![Page 46: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/46.jpg)
Markov Decision Processes
• States: S
• Model: T(s,a,s’) = P(s’|s,a)
• Actions: A(s), A
• Reward: R(s), R(s,a), R(s,a,s’)
• Discount: 𝛾
• Policy: 𝜋 𝑠 → 𝑎
• Utility/Value: sum of discounted rewards.
• We seek optimal policy that maximizes the expected total (discounted) reward
295, Winter 2018 48
![Page 47: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/47.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
MRPExample: Student MRP
R = +10
0.5 0.2
0.4
Sleep
0.9
0.1
R = +1
R = -1 R = 0
Pub
Class 20.8
Class 30.6 PassClass 1 0.5
R = -2 R = -2 R = -2
0.40.2
0.4
1.0
49
![Page 48: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/48.jpg)
Goals, Returns and Rewards
• The agent’s goal is to maximize the total amount of rewards it gets (not immediate ones), relative to the long run.
• Reward is -1 typically in mazes for every time step
• Deciding how to associate rewards with states is part of the problem modelling. If T is the final step then the return is:
295, Winter 2018 50
![Page 49: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/49.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
ReturnReturn
Definition
The return Gt is the total discounted reward from time-step t.
The discount γ ∈ [0, 1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γk R.
This values immediate reward above delayed reward.
γ close to 0 leads to ”myopic” evaluationγ close to 1 leads to ”far-sighted” evaluation
295, Winter 2018 51
![Page 50: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/50.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
ReturnWhy discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Animal/human behaviour shows preference for immediate
reward
It is sometimes possible to use undiscounted Markov reward
processes (i.e. γ = 1), e.g. if all sequences terminate.
295, Winter 2018 52
![Page 51: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/51.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Value FunctionValue Function
The value function v (s) gives the long-term value of state s
Definition
The state value function v (s) of an MRP is the expected return
starting from state s
v (s) = E[Gt | S t = s]
295, Winter 2018 53
![Page 52: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/52.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Value FunctionExample: Student MRP Returns
Sample returns for Student MRP:Starting from S1 = C1 with γ = 1
2
G1 = R2 + γR3 + ...+ γT−2RT
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep C1
FB FB C1 C2 C3 Pub C1 ... FB
FB FB C1 C2 C3 Pub C2 Sleep
295, Winter 2018 54
![Page 53: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/53.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Bellman EquationBellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state γv (St+1)
v(s) = E [Gt | S t = s]
= E [ R + γR 2t +1 t +2 t +3 t+ γ R + ... | S = s]
= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | S t = s]
= E [Rt+1 + γGt+1 | S t = s]
= E [Rt+1 + γv(St+1) | S t = s]
295, Winter 2018 55
![Page 54: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/54.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Bellman EquationBellman Equation for MRPs (2)
295, Winter 2018 56
![Page 55: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/55.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Bellman EquationExample: Bellman Equation for Student MRP
10-13 1.5 4.3
0-23
R = +10
0.5 0.2
0.8 0.6
0.4
0.9
0.1
R = +1
R = -1 R = 0
0.8
0.5
R = -2 R = -2 R = -2
0.40.2
0.4
1.0
4.3 = -2 + 0.6*10 + 0.4*0.8
57
![Page 56: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/56.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Bellman EquationBellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γPv
where v is a column vector with one entry per state
295, Winter 2018 58
![Page 57: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/57.jpg)
Lecture 2: Markov DecisionProcesses
Markov Reward Processes
Bellman EquationSolving the Bellman Equation
The Bellman equation is a linear equation
It can be solved directly:
v = R + γPv
(I − γP) v = R
v = (I − γP)−1 R
Computational complexity is O(n3) for n states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.Dynamic programming Monte-Carlo evaluation Temporal-Difference learning295, Winter 2018 59
![Page 58: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/58.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
MDPMarkov Decision Process
295, Winter 2018 60
![Page 59: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/59.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
MDPExample: Student MDP
R = -2 R = -2
Study
FacebookR = -1
Study
SleepR =0
PubR =+1
0.40.2
0.4
Study
R = +10
FacebookR = -1
QuitR =0
295, Winter 2018 61
![Page 60: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/60.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
PoliciesPolicies and Value functions (1)
Definition
A policy π is a distribution over actions givenstates,
π(a|s) = P [At = a | S t = s]
A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent),At ∼ π(·|St ), ∀t > 0
295, Winter 2018 62
![Page 61: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/61.jpg)
Policy’s and Value functions
295, Winter 2018 63
![Page 62: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/62.jpg)
Lecture 1: Introduction to Reinforcement Learning
Problems withinRLGridworld Example: Prediction
3.3 8.8 4.4 5.3 1.5
1.5 3.0 2.3 1.9 0.5
0.1 0.7 0.7 0.4 -0.4
-1.0 -0.4 -0.4 -0.6 -1.2
-1.9 -1.3 -1.2 -1.4 -2.0
A B
+5
+10 B’
A’Actions
(a) (b)
What is the value function for the uniform random policy?
Gamma=0.9. solved using EQ. 3.14
Exercise: show 3.14 holds for each state in Figure (b).
Figure 3.3
Actions: up, down, left, right. Rewards 0 unless off the grid with reward -1
From A to A’, rewatd +10. from B to B’ reward +5
Policy: actions are uniformly random.
64
![Page 63: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/63.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Value FunctionsValue Function, Q Functions
Definition
The state-value function vπ (s) of an MDP is the expected return
starting from state s, and then following policy π
vπ (s) = Eπ [Gt | S t = s]
Definition
The action-value function qπ (s, a) is the expected return
starting from state s, taking action a, and then following policy π
qπ(s,a) = Eπ [Gt | S t = s,At = a]
295, Winter 2018 65
![Page 64: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/64.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Expectation EquationBellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ (s) = Eπ [Rt+1 + γvπ (St+1) | S t = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ [Rt+1 + γqπ(St+1,At+1) | S t = s,At = a]
295, Winter 2018 66
Expressing the functions recursively,
Will translate to one step look-ahead.
![Page 65: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/65.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Expectation EquationBellman Expectation Equation for V π
295, Winter 2018 67
![Page 66: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/66.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Expectation EquationBellman Expectation Equation for Qπ
295, Winter 2018 68
![Page 67: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/67.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Expectation EquationBellman Expectation Equation for vπ (2)
295, Winter 2018 69
![Page 68: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/68.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Expectation EquationBellman Expectation Equation for qπ (2)
295, Winter 2018 70
![Page 69: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/69.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Optimal Value FunctionsOptimal Policies and Optimal Value Function
Definition
The optimal state-value function v∗(s) is the maximum value
function over all policies
π∗ πv (s) = max v (s)
The optimal action-value function q∗(s, a) is the maximum
action-value function over all policies
π∗ πq (s, a) = max q (s, a)
The optimal value function specifies the best possible
performance in the MDP.
An MDP is “solved” when we know the optimal value function.
![Page 70: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/70.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Optimal Value FunctionsOptimal Value Function for Student MDP
6 8 10R = -2 R = -2
Study
FacebookR = -1
Study
SleepR =0
PubR =+1
0.40.2
0.4
Study
R = +10
QuitR =0
Facebook v*(s) for γ=1R = -1
6 0
295, Winter 2018 72
![Page 71: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/71.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Optimal Value FunctionsOptimal Action-Value Function for Student MDP
6 8 10
06
0.40.2
0.4
q*(s,a) for γ=1FacebookR = -1
q* =5
QuitR =0q*=6
R =-2
q*=6
FacebookR =-1q*=5
Study
R =-2
q*=8
SleepR =0q* =0
Study
StudyR = +10
q* =10
PubR =+1
q*=8.4
295, Winter 2018 73
![Page 72: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/72.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Optimal Value FunctionsOptimal Policy
Define a partial ordering over policies
π ≥ π' if vπ(s) ≥ vπ'(s),∀s
Theorem
For any Markov Decision Process
There exists an optimal policy π∗ that is better than or equal
to all other policies, π∗ ≥ π, ∀π
All optimal policies achieve the optimal value function,
vπ∗ (s) = v∗(s)
All optimal policies achieve the optimal action-value function,
qπ∗(s,a) = q∗(s,a)
295, Winter 2018 74
![Page 73: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/73.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Optimal Value FunctionsFinding an Optimal Policy
An optimal policy can be found by maximising over q∗(s, a),
There is always a deterministic optimal policy for any MDP
If we know q∗(s, a), we immediately have the optimal policy
295, Winter 2018 75
![Page 74: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/74.jpg)
Bellman Equation for V* and Q*
295, Winter 2018 77
V*(s) q*(s; a)
![Page 75: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/75.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Optimality EquationExample: Bellman Optimality Equation in Student MDP
6 8 10R = -2 R = -2
Study
FacebookR = -1
Study
SleepR =0
PubR =+1
0.40.2
0.4
Study
R = +10
QuitR =0
Facebook 6 = max {-2 + 8, -1 + 6}R = -1
6 0
295, Winter 2018 78
![Page 76: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/76.jpg)
Lecture 1: Introduction to Reinforcement Learning
Problems withinRLGridworld Example: Control
a) gridworld V*
22.0 24.4 22.0 19.4 17.5
19.8 22.0 19.8 17.8 16.0
17.8 19.8 17.8 16.0 14.4
16.0 17.8 16.0 14.4 13.0
14.4 16.0 14.4 13.0 11.7
A B
+5
+10 B’
A’
π*b) v c) ⇡
What is the optimal value function over all possible policies?
What is the optimal policy?
Figure 3.6295, Winter 2018 79
![Page 77: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/77.jpg)
Lecture 2: Markov DecisionProcesses
Markov Decision Processes
Bellman Optimality EquationSolving the Bellman Optimality Equation
Bellman Optimality Equation is non-linear
No closed form solution (in general) Many
iterative solution methods
Value Iteration Policy Iteration Q-learning Sarsa
295, Winter 2018 80
![Page 78: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/78.jpg)
Planning by DynamicProgramming
Sutton & Barto,
Chapter 4
295, Winter 2018 81
![Page 79: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/79.jpg)
Lecture 3: Planning by Dynamic Programming
IntroductionPlanning by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP
For prediction:Input: MDP (S, A , P , R , γ) and policy π
or: MRP (S, Pπ , Rπ , γ)Output: value function vπ
Or for control:Input: MDP (S, A , P , R , γ)Output: optimal value function v∗
and: optimal policy π∗
295, Winter 2018 83
![Page 80: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/80.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy EvaluationPolicy Evaluation (Prediction)
Problem: evaluate a given policy π
Solution: iterative application of Bellman expectation backup
v1 → v2 → ... → vπ
Using synchronous backups, At each iteration k + 1
For all states s ∈ S
Update vk+1(s) from vk (s') where s' is a successor state of s
We will discuss asynchronous backups later
Convergence to vπ will be proven at the end of the lecture
295, Winter 2018 84
![Page 81: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/81.jpg)
Iterative Policy Evaluations
295, Winter 2018 85
These is a simultaneous linear equations in ISI unknowns and can be solved.
Practically an iterative procedure until a foxed-point can be more effective
Iterative policy evaluation.
![Page 82: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/82.jpg)
Iterative policy Evaluation
295, Winter 2018 87
![Page 83: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/83.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small GridworldEvaluating a Random Policy in the Small Gridworld
Undiscounted episodic MDP (γ = 1)
Nonterminal states 1, ..., 14
One terminal state (shown twice as shaded squares)
Actions leading out of the grid leave state unchanged
Reward is −1 until the terminal state is reached
Agent follows uniform random policy
π(n|·) = π(e|·) = π(s|·) = π(w |·) = 0.25
295, Winter 2018 88
![Page 84: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/84.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Iterative Policy Evaluation in Small Gridworld
VkVk
k = 0
k = 1
k = 2
randompolicy
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 0.0
0.0 -1.7 -2.0 -2.0
-1.7 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -1.7
-2.0 -2.0 -1.7 0.0
vk for the
Random Policy
Greedy Policy
w.r.t. vk
295, Winter 2018 89
![Page 85: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/85.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Iterative Policy Evaluation in Small Gridworld (2)
k = 10
k = 3
optimalpolicy
0.0 -2.4 -2.9 -3.0
-2.4 -2.9 -3.0 -2.9
-2.9 -3.0 -2.9 -2.4
-3.0 -2.9 -2.4 0.0
0.0 -6.1 -8.4 -9.0
-6.1 -7.7 -8.4 -8.4
-8.4 -8.4 -7.7 -6.1
-9.0 -8.4 -6.1 0.0
0.0 -14. -20. -22.
-14. -18. -20. -20.
-20. -20. -18. -14.
-22. -20. -14. 0.0
k = ∞
295, Winter 2018 90
![Page 86: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/86.jpg)
Lecture 3: Planning by Dynamic Programming
Policy IterationPolicy Improvement
Given a policy π
Evaluate the policy π
vπ(s) = E [Rt+1 + γRt+2 + ...|St = s]
Improve the policy by acting greedily with respect to vπ
π' = greedy(vπ)
In Small Gridworld improved policy was optimal, π' = π∗
In general, need more iterations of improvement / evaluation
But this process of policy iteration always converges to π∗
295, Winter 2018 91
![Page 87: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/87.jpg)
Policy Iteration
295, Winter 2018 92
![Page 88: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/88.jpg)
Lecture 3: Planning by Dynamic Programming
Policy IterationPolicy Iteration
Policy evaluation Estimate vπ
Iterative policy evaluation
Policy improvement Generate πI ≥ π
Greedy policy improvement
295, Winter 2018 93
![Page 89: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/89.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy ImprovementPolicy Improvement
295, Winter 2018 94
![Page 90: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/90.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy ImprovementPolicy Improvement (2)
If improvements stop,
qπ(s, π'(s)) = max qπ(s, a) = qπ(s, π(s)) = vπ(s)a∈A
Then the Bellman optimality equation has been satisfied
vπ(s) = max qπ(s, a)a∈A
Therefore vπ (s) = v∗(s) for all s ∈ S
so π is an optimal policy
295, Winter 2018 95
![Page 91: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/91.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Extensions to Policy IterationModified Policy Iteration
Does policy evaluation need to converge to vπ ?
Or should we introduce a stopping conditione.g. E-convergence of value function
Or simply stop after k iterations of iterative policy evaluation?
For example, in the small gridworld k = 3 was sufficient to
achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
This is equivalent to value iteration (next section)
295, Winter 2018 96
![Page 92: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/92.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Extensions to Policy IterationGeneralised Policy Iteration
Policy evaluation Estimate vπ
Any policy evaluation algorithm
Policy improvement Generate π' ≥ π
Any policy improvement algorithm
295, Winter 2018 97
![Page 93: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/93.jpg)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPsPrinciple of Optimality
Any optimal policy can be subdivided into two components:
An optimal first action A∗
Followed by an optimal policy from successor state S I
Theorem (Principle of Optimality)
A policy π(a|s) achieves the optimal value from state s,
vπ (s) = v∗(s), if and onlyif
For any state s' reachable from s
π achieves the optimal value from state s', vπ (s') = v∗(s')
295, Winter 2018 98
![Page 94: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/94.jpg)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPsDeterministic Value Iteration
295, Winter 2018 99
![Page 95: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/95.jpg)
Value Iteration
295, Winter 2018 100
![Page 96: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/96.jpg)
Value Iteration
295, Winter 2018 101
![Page 97: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/97.jpg)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPsExample: Shortest Path
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -2
-2 -2 -2 -2
0 -1 -2 -3
-1 -2 -3 -3
-2 -3 -3 -3
-3 -3 -3 -3
0 -1 -2 -3
-1 -2 -3 -4
-2 -3 -4 -4
-3 -4 -4 -4
0 -1 -2 -3
-1 -2 -3 -4
-2 -3 -4 -5
-3 -4 -5 -5
0 -1 -2 -3
-1 -2 -3 -4
-2 -3 -4 -5
-3 -4 -5 -6
g
Problem V1 V2 V3
V4 V5 V6 V7
295, Winter 2018 102
![Page 98: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/98.jpg)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPsValue Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
v1 → v2 → ... → v∗
Using synchronous backups At each iteration k + 1
For all states s ∈ SUpdate vk+1(s) from vk (s')
Convergence to v∗ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy
295, Winter 2018 103
![Page 99: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/99.jpg)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPsValue Iteration (2)
295, Winter 2018 104
![Page 100: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/100.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic ProgrammingAsynchronous Dynamic Programming
DP methods described so far used synchronous backups
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected
295, Winter 2018 106
![Page 101: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/101.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic ProgrammingAsynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
In-place dynamicprogramming
Prioritised sweeping
Real-time dynamicprogramming
295, Winter 2018 107
![Page 102: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/102.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic ProgrammingIn-Place Dynamic Programming
295, Winter 2018 108
![Page 103: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/103.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic ProgrammingPrioritised Sweeping
295, Winter 2018 109
![Page 104: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/104.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic ProgrammingReal-Time Dynamic Programming
Idea: only states that are relevant to agent
Use agent’s experience to guide the selection of states
After each time-step St , At , Rt+1
Backup the state St
295, Winter 2018 110
![Page 105: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/105.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backupsFull-Width Backups
DP uses full-widthbackups
For each backup (sync or async) Every successor state and action is
considered
Using knowledge of the MDP transitions and reward function
DP is effective for medium-sized problems
(millions of states)
For large problems DP suffers Bellman’scurse ofdimensionality
Number of states n = |S| grows
exponentially with number of state variables
Even one backup can be too expensive 111
![Page 106: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/106.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backupsSample Backups
In subsequent lectures we will consider sample backups
Using sample rewards and sample transitions(S, A, R, S ')
Instead of reward function R and transition dynamics P
Advantages:
Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|
295, Winter 2018 112
![Page 107: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/107.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Approximate Dynamic ProgrammingApproximate Dynamic Programming
295, Winter 2018 113
![Page 108: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/108.jpg)
Csaba slides,
295, Winter 2018 114
![Page 109: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/109.jpg)
295, Winter 2018 115
![Page 110: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/110.jpg)
295, Winter 2018 116
![Page 111: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/111.jpg)
295, Winter 2018 117
![Page 112: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/112.jpg)
295, Winter 2018 118
![Page 113: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/113.jpg)
295, Winter 2018 119
![Page 114: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/114.jpg)
295, Winter 2018 120
![Page 115: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/115.jpg)
295, Winter 2018 121
![Page 116: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/116.jpg)
295, Winter 2018 122
![Page 117: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/117.jpg)
295, Winter 2018 123
![Page 118: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/118.jpg)
Lecture 3: Planning by Dynamic Programming
Contraction MappingValue Function ∞-Norm
s∈S
We will measure distance between state-value functions u and
v by the ∞-norm
i.e. the largest difference between state values,
||u − v||∞ = max |u(s) − v(s)|
295, Winter 2018 126
![Page 119: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/119.jpg)
Lecture 3: Planning by Dynamic Programming
Contraction MappingContraction Mapping Theorem
Theorem (Contraction Mapping Theorem)
For any metric space V that is complete (i.e. closed) under an
operator T (v ), where T is a γ-contraction,
T converges to a unique fixed point
At a linear convergence rate of γ
295, Winter 2018 128
![Page 120: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/120.jpg)
295, Winter 2018 129
![Page 121: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/121.jpg)
Lecture 3: Planning by Dynamic Programming
Contraction MappingConvergence of Iter. Policy Evaluation and Policy Iteration
The Bellman expectation operator T π has a unique fixed point
vπ is a fixed point of T π (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on vπ
Policy iteration converges on v∗
295, Winter 2018 130
![Page 122: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/122.jpg)
Lecture 3: Planning by Dynamic Programming
Contraction MappingBellman Optimality Backup is a Contraction
Define the Bellman optimality backup operator T ∗,
T ∗(v) = max Ra + γPava∈A
This operator is a γ-contraction, i.e. it makes value functions
closer by at least γ (similar to previous proof)
||T∗(u) − T ∗(v)||∞ ≤ γ||u − v||∞
295, Winter 2018 131
![Page 123: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/123.jpg)
Lecture 3: Planning by Dynamic Programming
Contraction MappingConvergence of Value Iteration
The Bellman optimality operator T ∗ has a unique fixed point
v∗ is a fixed point of T ∗ (by Bellman optimality equation) By
contraction mapping theorem
Value iteration converges on v∗
295, Winter 2018 132
![Page 124: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/124.jpg)
295, Winter 2018 133
![Page 125: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/125.jpg)
295, Winter 2018 134
![Page 126: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/126.jpg)
295, Winter 2018 135
![Page 127: Reinforcement Learning or, Learning and Planning with ...dechter/courses/ics-295/winter-2018/slides/class1.pdfReinforcement Learning or, Learning and Planning with Markov Decision](https://reader033.vdocuments.mx/reader033/viewer/2022042105/5e845d5916b1b77adf2e016f/html5/thumbnails/127.jpg)
295, Winter 2018 136