planning in markov decision...
TRANSCRIPT
![Page 1: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/1.jpg)
Planning in Markov Decision Processes
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie MellonSchool of Computer Science
Lecture 3, CMU 10703
![Page 2: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/2.jpg)
A Markov Decision Process is a tuple
• is a finite set of states
• is a finite set of actions
• is a state transition probability function
• is a reward function
• is a discount factor
Markov Decision Process (MDP)
�
r(s, a) = E[Rt+1|St = s,At = a]
r
T
T (s0|s, a) = P[St+1 = s0|St = s,At = a]
A
S
(S,A, T, r, �)
� 2 [0, 1]
![Page 3: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/3.jpg)
Solving MDPs
• Prediction: Given an MDP and a policyfind the state and action value functions.
• Optimal control: given an MDP , find the optimal policy (aka the planning problem). Compare with the learning problem with missing information about rewards/dynamics.
• We still consider finite MDPs (finite and ) with known dynamics!
⇡(a|s) = P[At = a|St = s]
S A
(S,A, T, r, �)
(S,A, T, r, �)
![Page 4: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/4.jpg)
Outline
• Policy iteration
• Value iteration
• Linear programming
• Asynchronous DP
![Page 5: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/5.jpg)
Policy Evaluation
Policy evaluation: for a given policy , compute the state value function where is implicitly given by the Bellman equation
a system of simultaneous equations.
⇡
V⇡(s) = E⇡[Rt+1 + �Rt+2 + �2Rt+3 + ...|St = s]
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!
|S|
v⇡(s) = E⇡
⇥Rt+1 + �Rt+2 + �2Rt+3 + . . . |St = s
⇤
v⇡(s)
![Page 6: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/6.jpg)
MDPs to MRPs
MDP under a fixed policy becomes Markov Reward Process (MRP)
where and r⇡s =P
a2A ⇡(a|s)r(s, a) T⇡s0s =
Pa2A ⇡(a|s)T (s0|s, a)
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!
=X
a2A⇡(a|s)r(s, a) + �
X
a2A⇡(a|s)
X
s02ST (s0|s, a)v⇡(s0)
= r⇡s + �X
s02ST⇡s0sv⇡(s
0)
![Page 7: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/7.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
r
Back Up Diagram
MDP
v⇡(s) s
v⇡(s0) s0
![Page 8: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/8.jpg)
Back Up Diagram
MDP
MRP
v⇡(s0) s0
v⇡(s) s
v⇡(s) = r⇡s + �P
s02S T⇡s0sv⇡(s
0)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
v⇡(s) s
v⇡(s0) s0
r
![Page 9: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/9.jpg)
Matrix Form
The Bellman expectation equation can be written concisely using the induced MRP as
with direct solution
of complexity O(N3)
v⇡ = (I � �T⇡)�1r⇡
v⇡ = r⇡ + �T⇡v⇡
![Page 10: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/10.jpg)
Iterative Methods: Recall the Bellman Equation
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
v⇡(s) s
v⇡(s0) s0
r
![Page 11: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/11.jpg)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
r
Iterative Methods: Backup Operation
Given an expected value function at iteration k, we back up the expected value function at iteration k+1:
v[k+1] sv[k+1](s) s
v[k+1](s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v[k](s0)
!
v[k](s0) s0
![Page 12: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/12.jpg)
A sweep consists of applying the backup operation for all the states in
Applying the back up operator iteratively
Iterative Methods: Sweep
v[0] ! v[1] ! v[2] ! . . . v⇡
v ! v0S
![Page 13: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/13.jpg)
A Small-Grid World
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
R
γ = 1
![Page 14: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/14.jpg)
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
![Page 15: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/15.jpg)
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
![Page 16: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/16.jpg)
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
![Page 17: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/17.jpg)
An operator on a normed vector space is a -contraction, for , provided for all
Theorem (Contraction mapping)For a -contraction in a complete normed vector space
• converges to a unique fixed point in
• at a linear convergence rate
Remark. In general we only need metric (vs normed) space
Contraction Mapping Theorem
F X �
||T (x)� T (y)|| �||x� y||
x, y 2 X0 < � < 1
� F X
F X�
![Page 18: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/18.jpg)
Value Function Sapce
• Consider the vector space over value functions
• There are dimensions
• Each point in this space fully specifies a value function
• Bellman backup brings value functions closer in this space?
• And therefore the backup must converge to a unique solution
|S|v(s)
V
![Page 19: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/19.jpg)
Value Function -Norm
• We will measure distance between state-value functions and by the -norm
• i.e. the largest difference between state values,
1
u v1
||u� v||1 = max
s2S|u(s)� v(s)|
![Page 20: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/20.jpg)
Bellman Expectation Backup is a Contraction
• Define the Bellman expectation backup operator
• This operator is a -contraction, i.e. it makes value functions closer by at least , �
�
||F⇡(u)� F⇡(v)||1 = ||(r⇡ + �T⇡u)||1 � ||(r⇡ + �T⇡v)||1= ||�T⇡(u� v)||1 ||�T⇡||u� v||1||1 �||u� v||1
F⇡(v) = r⇡ + �T⇡v
![Page 21: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/21.jpg)
Convergence of Iter. Policy Evaluation and Policy Iteration
• The Bellman expectation operator has a unique fixed point
• is a fixed point of (by Bellman expectation equation)
• By contraction mapping theorem
• Iterative policy evaluation converges on
F⇡
v⇡
F⇡v⇡
![Page 22: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/22.jpg)
Policy Improvement
• Suppose we have computed for a deterministic policy
• For a given state , would it be better to do an action ?
• It is better to switch to action for state if and only if
• And we can compute from by:
v⇡ ⇡
s a 6= ⇡(s)
a s
q⇡(s, a) > v⇡(s)
q⇡(s, a) v⇡
q⇡(s, a) = E[Rt+1 + �v⇡(St+1)|St = s,At = a]
= r(s, a) + �X
s02ST (s0|s, a)v⇡(s0)
![Page 23: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/23.jpg)
Policy Improvement Cont.
• Do this for all states to get a new policy that is greedy with respect to :
• What if the policy is unchanged by this?
• Then the policy must be optimal!
⇡0 � ⇡v⇡
⇡0(s) = argmax
aq⇡(s, a)
= argmax
aE[Rt+1 + �v⇡(s
0)|St = s,At = a]
= argmax r(s, a) + �X
s02ST (s0|s, a)v⇡(s0)
![Page 24: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/24.jpg)
Policy Iteration
policy evaluation policy improvement“greedification”
⇡0E�! v⇡0
I�! ⇡1E�! v⇡1
I�! ⇡2E�! ...
I�! ⇡⇤E�! v⇤
![Page 25: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/25.jpg)
Policy Iteration
92 CHAPTER 4. DYNAMIC PROGRAMMING
1. InitializationV (s) 2 R and ⇡(s) 2 A(s) arbitrarily for all s 2 S
2. Policy EvaluationRepeat
� 0For each s 2 S:
v V (s)V (s)
Ps0,r p(s0, r|s, ⇡(s))
⇥r + �V (s0)
⇤
� max(�, |v � V (s)|)until � < ✓ (a small positive number)
3. Policy Improvementpolicy-stable trueFor each s 2 S:
a ⇡(s)⇡(s) arg maxa
Ps0,r p(s0, r|s, a)
⇥r + �V (s0)
⇤
If a 6= ⇡(s), then policy-stable falseIf policy-stable, then stop and return V and ⇡; else go to 2
Figure 4.3: Policy iteration (using iterative policy evaluation) for v⇤. Thisalgorithm has a subtle bug, in that it may never terminate if the policy con-tinually switches between two or more policies that are equally good. The bugcan be fixed by adding additional flags, but it makes the pseudocode so uglythat it is not worth it. :-)
argmax r(s, a) + �⌃s02ST (s0|s, a)v⇡(s0)
⌃a2A⇡(a|s) (r(s, a) + �⌃s02ST (s0|s, a)V (s0))
v
v
![Page 26: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/26.jpg)
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
6
Iterative Policy Eval for the Small Gridworld
∞
R
γ = 1
Policy , an equiprobable random action⇡
![Page 27: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/27.jpg)
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Iterative Policy Eval for the Small Gridworld
∞
R
γ = 1
Policy , an equiprobable random action⇡
![Page 28: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/28.jpg)
• Does policy evaluation need to converge to ?
• Or should we introduce a stopping condition
• e.g. -convergence of value function
• Or simply stop after k iterations of iterative policy evaluation?
• For example, in the small grid world k = 3 was sufficient to achieve optimal policy
• Why not update policy every iteration? i.e. stop after k = 1
• This is equivalent to value iteration (next section)
Generalized Policy Iteration
v⇡
✏
![Page 29: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/29.jpg)
Generalized Policy Iteration
Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity.
A geometric metaphor forconvergence of GPI:
evaluation
improvement
⇡ greedy(V )
V⇡
V v⇡
v⇤⇡⇤
v⇤,⇡⇤
V0,⇡0
V = v⇡
⇡ = greedy(V )
v⇡
v⇡
v⇤
v⇤
![Page 30: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/30.jpg)
Principle of Optimality
• Any optimal policy can be subdivided into two components:
• An optimal first action
• Followed by an optimal policy from successor state
• Theorem (Principle of Optimality)
• A policy achieves the optimal value from state , dfsfdsfdf dsfdf , if and only if
• For any state reachable from , achieves the optimal value from state ,
⇡(a|s) sv⇡(s) = v⇤(s)
s0 s ⇡s0 v⇡(s
0) = v⇤(s0)
A⇤
S 0
![Page 31: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/31.jpg)
• If we know the solution to subproblems
• Then solution can be found by one-step lookahead
• The idea of value iteration is to apply these updates iteratively
• Intuition: start with final rewards and work backwards
• Still works with loopy, stochastic MDPs
Value Iteration
v⇤(s0)
v⇤(s0)
v⇤(s) max
a2Ar(s, a) + �
X
s02ST (s0|s, a)v⇤(s0)
![Page 32: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/32.jpg)
Example: Shortest Path
• Body Level One
• Body Level Two
• Body Level Three
• Body Level Four
Body Level Five
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Example: Shortest Path
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
-1
-2
-2
-1
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
0
-1
-2
-3
-1
-2
-3
-3
-2
-3
-3
-3
-3
-3
-3
-3
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-4
-3
-4
-4
-4
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-5
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-6
g
Problem V1 V2 V3
V4 V5 V6 V7
![Page 33: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/33.jpg)
Value Iteration
• Problem: find optimal policy
• Solution: iterative application of Bellman optimality backup
•
• Using synchronous backups
• At each iteration k + 1
• For all states
• Update from
⇡
v1 ! v2 ! ... ! v⇤
vk+1(s) vk(s0)
s 2 S
![Page 34: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/34.jpg)
Value Iteration (2)Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Value Iteration (2)
vk+1(s) � s
vk(s0) � s0
r
a
vk+1
(s) = maxa2A
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1
= maxa2A
RaRaRa + �PaPaPavk
r
vk+1(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)vk(s0)
!
vk+1 = max
a2Ar(a) + �T (a)vk
vk+1(s) s
vk(s0) s0
![Page 35: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/35.jpg)
Bellman Optimality Backup is a Contraction
• Define the Bellman optimality backup operator ,
• This operator is a -contraction, i.e. it makes value functions closer by at least (similar to previous proof)
F ⇤
F ⇤(v) = max
a2Ar(a) + �T (a)v
||F ⇤(u)� F ⇤(v)||1 �||u� v||1
��
![Page 36: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/36.jpg)
Convergence of Value Iteration
• The Bellman optimality operator has a unique fixed point
• is a fixed point of (by Bellman optimality equation)
• By contraction mapping theorem
• Value iteration converges on
F ⇤
v⇤ F ⇤
v⇤
![Page 37: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/37.jpg)
• Algorithms are based on state-value function or
• Complexity per iteration, for actions and states
• Could also apply to action-value function or • Complexity per iteration
Synchronous Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration
Control Bellman Optimality Equation Value Iteration
v⇤(s)v⇡(s)
q⇡(s, a) q⇤(s, a)O(mn2)
O(m2n2)
m n
![Page 38: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/38.jpg)
Efficiency of DP
• To find an optimal policy is polynomial in the number of states…
• BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).
• In practice, classical DP can be applied to problems with a few millions of states.
![Page 39: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/39.jpg)
Asynchronous DP
• All the DP methods described so far require exhaustive sweeps of the entire state set.
• Asynchronous DP does not use sweeps. Instead it works like this:
• Repeat until convergence criterion is met:
• Pick a state at random and apply the appropriate backup
• Still need lots of computation, but does not get locked into hopelessly long sweeps
• Guaranteed to converge if all states continue to be selected
• Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.
![Page 40: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/40.jpg)
Asynchronous Dynamic Programming
• Three simple ideas for asynchronous dynamic programming:
• In-place dynamic programming
• Prioritized sweeping
• Real-time dynamic programming
![Page 41: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/41.jpg)
• Synchronous value iteration stores two copies of value function
• for all in
• In-place value iteration only stores one copy of value function
• for all in
In-Place Dynamic Programming
s Sv
new
(s) max
a2A
r(s, a) + �
X
s
02ST (s0|s, a)v
old
(s0)
!
v(s) max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0)
!
vold
vnew
s S
![Page 42: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/42.jpg)
Prioritized Sweeping
• Use magnitude of Bellman error to guide state selection, e.g.
• Backup the state with the largest remaining Bellman error
• Update Bellman poor of affected states after each backup
• Requires knowledge of reverse dynamics (predecessor states)
• Can be implemented efficiently by maintaining a priority queue
�����max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0)
!� v(s)
�����
![Page 43: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/43.jpg)
Real-time Dynamic Programming
• Idea: only states that are relevant to agent
• Use agent’s experience to guide the selection of states
• After each time-step
• Backup the state
v(St) max
a2A
r(St, a) + �
X
s02ST (s0|St, a)v(s
0)
!
St,At, rt+1
St
![Page 44: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/44.jpg)
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backups
Sample Backups
In subsequent lectures we will consider sample backups
Using sample rewards and sample transitionshS ,A,R , S 0iInstead of reward function R and transition dynamics PAdvantages:
Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|
Sample Backups
• In subsequent lectures we will consider sample backups
• Using sample rewards and sample transitions
• Instead of reward function and transition dynamics
• Advantages:
• Model-free: no advance knowledge of MDP required
• Breaks the curse of dimensionality through sampling
• Cost of backup is constant, independent of n = |S|
(S,A, r,S 0)
![Page 45: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/45.jpg)
Approximate Dynamic Programming
• Approximate the value function
• Using a function approximate
• Apply dynamic programming to
• e.g. Fitted Value Iteration repeats at each iteration k,
• Sample states
• For each state , estimate target value using Bellman optimality equation,
• Train next value function using targets
v(s,w)
v(· ,w)
S ✓ Ss 2 S
v(· ,wk+1) {hs, vk(s)i}
vk(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0,wk)
!
![Page 46: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/46.jpg)
Linear Programming (LP)
• Recall, at value iteration convergence we have
• LP formulation to find
8s 2 S :
Theorem. is the solution to the above LP.
is a probability distribution over , with for all in . µ0 µ0(s) > 0 sS S
V ⇤
minV
X
Sµ0(s)V (s)
s.t. 8s 2 S, 8a 2 A :
V (s) � r(s, a) + �X
s02ST (s0|s, a)V ⇤(s0)
V ⇤(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)V ⇤
(s0)
!
![Page 47: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/47.jpg)
• Let be the Bellman operator, i.e., . Then the LP can be written as:
LP con’t
F V ⇤i+1 = F (Vi)
Monotonicity Property: If then .Hence, if then , and by repeated application,
Any feasible solution to the LP must satisfy , and hence must satisfy . Hence, assuming all entries in are positive, is the optimal solution to the LP.
U � V F (U) � F (V )V � F (V ) F (V ) � F (F (V ))
V � F (V ) � F 2V � F 3V � ... � F1V = V ⇤
V � V ⇤ V ⇤µ0
V � F (V )
minV
µ>0 V
s.t. V � F (V )
![Page 48: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal](https://reader035.vdocuments.mx/reader035/viewer/2022070910/5fa4121708f85463463fef9c/html5/thumbnails/48.jpg)
• Interpretation
•
• Equation 2: ensures that has the above meaning
• Equation 1: maximize expected discounted sum of rewards
• Optimal policy:
LP: The Dual -> q*
�
⇡⇤(s) = argmax
a�(s, a)
�(s, a) =1X
t=0
�tP (st = s, at = a)
min�
X
s2S
X
a2A
X
s02S�(s, a)T (s, a, s0)r(s, a)
s.t. 8s 2 S :X
a02A�(s0, a0) = µ0(s) + �
X
s2S
X
a2A�(s, a)T (s, a, s0)