planning in markov decision...

48
Planning in Markov Decision Processes Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science Lecture 3, CMU 10703

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Planning in Markov Decision Processes

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie MellonSchool of Computer Science

Lecture 3, CMU 10703

Page 2: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

A Markov Decision Process is a tuple

• is a finite set of states

• is a finite set of actions

• is a state transition probability function

• is a reward function

• is a discount factor

Markov Decision Process (MDP)

r(s, a) = E[Rt+1|St = s,At = a]

r

T

T (s0|s, a) = P[St+1 = s0|St = s,At = a]

A

S

(S,A, T, r, �)

� 2 [0, 1]

Page 3: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Solving MDPs

• Prediction: Given an MDP and a policyfind the state and action value functions.

• Optimal control: given an MDP , find the optimal policy (aka the planning problem). Compare with the learning problem with missing information about rewards/dynamics.

• We still consider finite MDPs (finite and ) with known dynamics!

⇡(a|s) = P[At = a|St = s]

S A

(S,A, T, r, �)

(S,A, T, r, �)

Page 4: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Outline

• Policy iteration

• Value iteration

• Linear programming

• Asynchronous DP

Page 5: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Policy Evaluation

Policy evaluation: for a given policy , compute the state value function where is implicitly given by the Bellman equation

a system of simultaneous equations.

V⇡(s) = E⇡[Rt+1 + �Rt+2 + �2Rt+3 + ...|St = s]

v⇡(s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v⇡(s0)

!

|S|

v⇡(s) = E⇡

⇥Rt+1 + �Rt+2 + �2Rt+3 + . . . |St = s

v⇡(s)

Page 6: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

MDPs to MRPs

MDP under a fixed policy becomes Markov Reward Process (MRP)

where and r⇡s =P

a2A ⇡(a|s)r(s, a) T⇡s0s =

Pa2A ⇡(a|s)T (s0|s, a)

v⇡(s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v⇡(s0)

!

=X

a2A⇡(a|s)r(s, a) + �

X

a2A⇡(a|s)

X

s02ST (s0|s, a)v⇡(s0)

= r⇡s + �X

s02ST⇡s0sv⇡(s

0)

Page 7: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk

r

Back Up Diagram

MDP

v⇡(s) s

v⇡(s0) s0

Page 8: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Back Up Diagram

MDP

MRP

v⇡(s0) s0

v⇡(s) s

v⇡(s) = r⇡s + �P

s02S T⇡s0sv⇡(s

0)

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk

v⇡(s) s

v⇡(s0) s0

r

Page 9: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Matrix Form

The Bellman expectation equation can be written concisely using the induced MRP as

with direct solution

of complexity O(N3)

v⇡ = (I � �T⇡)�1r⇡

v⇡ = r⇡ + �T⇡v⇡

Page 10: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Iterative Methods: Recall the Bellman Equation

v⇡(s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v⇡(s0)

!Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk

v⇡(s) s

v⇡(s0) s0

r

Page 11: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk

r

Iterative Methods: Backup Operation

Given an expected value function at iteration k, we back up the expected value function at iteration k+1:

v[k+1] sv[k+1](s) s

v[k+1](s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v[k](s0)

!

v[k](s0) s0

Page 12: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

A sweep consists of applying the backup operation for all the states in

Applying the back up operator iteratively

Iterative Methods: Sweep

v[0] ! v[1] ! v[2] ! . . . v⇡

v ! v0S

Page 13: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

A Small-Grid World

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that would take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

R

γ = 1

Page 14: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that would take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for therandom policyv[k]

1

Page 15: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that would take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for therandom policyv[k]

1

Page 16: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that would take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for therandom policyv[k]

1

Page 17: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

An operator on a normed vector space is a -contraction, for , provided for all

Theorem (Contraction mapping)For a -contraction in a complete normed vector space

• converges to a unique fixed point in

• at a linear convergence rate

Remark. In general we only need metric (vs normed) space

Contraction Mapping Theorem

F X �

||T (x)� T (y)|| �||x� y||

x, y 2 X0 < � < 1

� F X

F X�

Page 18: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Value Function Sapce

• Consider the vector space over value functions

• There are dimensions

• Each point in this space fully specifies a value function

• Bellman backup brings value functions closer in this space?

• And therefore the backup must converge to a unique solution

|S|v(s)

V

Page 19: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Value Function -Norm

• We will measure distance between state-value functions and by the -norm

• i.e. the largest difference between state values,

1

u v1

||u� v||1 = max

s2S|u(s)� v(s)|

Page 20: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Bellman Expectation Backup is a Contraction

• Define the Bellman expectation backup operator

• This operator is a -contraction, i.e. it makes value functions closer by at least , �

||F⇡(u)� F⇡(v)||1 = ||(r⇡ + �T⇡u)||1 � ||(r⇡ + �T⇡v)||1= ||�T⇡(u� v)||1 ||�T⇡||u� v||1||1 �||u� v||1

F⇡(v) = r⇡ + �T⇡v

Page 21: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Convergence of Iter. Policy Evaluation and Policy Iteration

• The Bellman expectation operator has a unique fixed point

• is a fixed point of (by Bellman expectation equation)

• By contraction mapping theorem

• Iterative policy evaluation converges on

F⇡

v⇡

F⇡v⇡

Page 22: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Policy Improvement

• Suppose we have computed for a deterministic policy

• For a given state , would it be better to do an action ?

• It is better to switch to action for state if and only if

• And we can compute from by:

v⇡ ⇡

s a 6= ⇡(s)

a s

q⇡(s, a) > v⇡(s)

q⇡(s, a) v⇡

q⇡(s, a) = E[Rt+1 + �v⇡(St+1)|St = s,At = a]

= r(s, a) + �X

s02ST (s0|s, a)v⇡(s0)

Page 23: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Policy Improvement Cont.

• Do this for all states to get a new policy that is greedy with respect to :

• What if the policy is unchanged by this?

• Then the policy must be optimal!

⇡0 � ⇡v⇡

⇡0(s) = argmax

aq⇡(s, a)

= argmax

aE[Rt+1 + �v⇡(s

0)|St = s,At = a]

= argmax r(s, a) + �X

s02ST (s0|s, a)v⇡(s0)

Page 24: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Policy Iteration

policy evaluation policy improvement“greedification”

⇡0E�! v⇡0

I�! ⇡1E�! v⇡1

I�! ⇡2E�! ...

I�! ⇡⇤E�! v⇤

Page 25: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Policy Iteration

92 CHAPTER 4. DYNAMIC PROGRAMMING

1. InitializationV (s) 2 R and ⇡(s) 2 A(s) arbitrarily for all s 2 S

2. Policy EvaluationRepeat

� 0For each s 2 S:

v V (s)V (s)

Ps0,r p(s0, r|s, ⇡(s))

⇥r + �V (s0)

� max(�, |v � V (s)|)until � < ✓ (a small positive number)

3. Policy Improvementpolicy-stable trueFor each s 2 S:

a ⇡(s)⇡(s) arg maxa

Ps0,r p(s0, r|s, a)

⇥r + �V (s0)

If a 6= ⇡(s), then policy-stable falseIf policy-stable, then stop and return V and ⇡; else go to 2

Figure 4.3: Policy iteration (using iterative policy evaluation) for v⇤. Thisalgorithm has a subtle bug, in that it may never terminate if the policy con-tinually switches between two or more policies that are equally good. The bugcan be fixed by adding additional flags, but it makes the pseudocode so uglythat it is not worth it. :-)

argmax r(s, a) + �⌃s02ST (s0|s, a)v⇡(s0)

⌃a2A⇡(a|s) (r(s, a) + �⌃s02ST (s0|s, a)V (s0))

v

v

Page 26: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

6

Iterative Policy Eval for the Small Gridworld

R

γ = 1

Policy , an equiprobable random action⇡

Page 27: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

Iterative Policy Eval for the Small Gridworld

R

γ = 1

Policy , an equiprobable random action⇡

Page 28: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• Does policy evaluation need to converge to ?

• Or should we introduce a stopping condition

• e.g. -convergence of value function

• Or simply stop after k iterations of iterative policy evaluation?

• For example, in the small grid world k = 3 was sufficient to achieve optimal policy

• Why not update policy every iteration? i.e. stop after k = 1

• This is equivalent to value iteration (next section)

Generalized Policy Iteration

v⇡

Page 29: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Generalized Policy Iteration

Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

evaluation

improvement

⇡ greedy(V )

V⇡

V v⇡

v⇤⇡⇤

v⇤,⇡⇤

V0,⇡0

V = v⇡

⇡ = greedy(V )

v⇡

v⇡

v⇤

v⇤

Page 30: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Principle of Optimality

• Any optimal policy can be subdivided into two components:

• An optimal first action

• Followed by an optimal policy from successor state

• Theorem (Principle of Optimality)

• A policy achieves the optimal value from state , dfsfdsfdf dsfdf , if and only if

• For any state reachable from , achieves the optimal value from state ,

⇡(a|s) sv⇡(s) = v⇤(s)

s0 s ⇡s0 v⇡(s

0) = v⇤(s0)

A⇤

S 0

Page 31: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• If we know the solution to subproblems

• Then solution can be found by one-step lookahead

• The idea of value iteration is to apply these updates iteratively

• Intuition: start with final rewards and work backwards

• Still works with loopy, stochastic MDPs

Value Iteration

v⇤(s0)

v⇤(s0)

v⇤(s) max

a2Ar(s, a) + �

X

s02ST (s0|s, a)v⇤(s0)

Page 32: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Example: Shortest Path

• Body Level One

• Body Level Two

• Body Level Three

• Body Level Four

Body Level Five

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Example: Shortest Path

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

0

-1

-2

-2

-1

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

0

-1

-2

-3

-1

-2

-3

-3

-2

-3

-3

-3

-3

-3

-3

-3

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-4

-3

-4

-4

-4

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-5

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-6

g

Problem V1 V2 V3

V4 V5 V6 V7

Page 33: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Value Iteration

• Problem: find optimal policy

• Solution: iterative application of Bellman optimality backup

• Using synchronous backups

• At each iteration k + 1

• For all states

• Update from

v1 ! v2 ! ... ! v⇤

vk+1(s) vk(s0)

s 2 S

Page 34: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Value Iteration (2)Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration (2)

vk+1(s) � s

vk(s0) � s0

r

a

vk+1

(s) = maxa2A

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1

= maxa2A

RaRaRa + �PaPaPavk

r

vk+1(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)vk(s0)

!

vk+1 = max

a2Ar(a) + �T (a)vk

vk+1(s) s

vk(s0) s0

Page 35: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Bellman Optimality Backup is a Contraction

• Define the Bellman optimality backup operator ,

• This operator is a -contraction, i.e. it makes value functions closer by at least (similar to previous proof)

F ⇤

F ⇤(v) = max

a2Ar(a) + �T (a)v

||F ⇤(u)� F ⇤(v)||1 �||u� v||1

��

Page 36: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Convergence of Value Iteration

• The Bellman optimality operator has a unique fixed point

• is a fixed point of (by Bellman optimality equation)

• By contraction mapping theorem

• Value iteration converges on

F ⇤

v⇤ F ⇤

v⇤

Page 37: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• Algorithms are based on state-value function or

• Complexity per iteration, for actions and states

• Could also apply to action-value function or • Complexity per iteration

Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative Policy Evaluation

Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration

Control Bellman Optimality Equation Value Iteration

v⇤(s)v⇡(s)

q⇡(s, a) q⇤(s, a)O(mn2)

O(m2n2)

m n

Page 38: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Efficiency of DP

• To find an optimal policy is polynomial in the number of states…

• BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).

• In practice, classical DP can be applied to problems with a few millions of states.

Page 39: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Asynchronous DP

• All the DP methods described so far require exhaustive sweeps of the entire state set.

• Asynchronous DP does not use sweeps. Instead it works like this:

• Repeat until convergence criterion is met:

• Pick a state at random and apply the appropriate backup

• Still need lots of computation, but does not get locked into hopelessly long sweeps

• Guaranteed to converge if all states continue to be selected

• Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.

Page 40: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Asynchronous Dynamic Programming

• Three simple ideas for asynchronous dynamic programming:

• In-place dynamic programming

• Prioritized sweeping

• Real-time dynamic programming

Page 41: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• Synchronous value iteration stores two copies of value function

• for all in

• In-place value iteration only stores one copy of value function

• for all in

In-Place Dynamic Programming

s Sv

new

(s) max

a2A

r(s, a) + �

X

s

02ST (s0|s, a)v

old

(s0)

!

v(s) max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0)

!

vold

vnew

s S

Page 42: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Prioritized Sweeping

• Use magnitude of Bellman error to guide state selection, e.g.

• Backup the state with the largest remaining Bellman error

• Update Bellman poor of affected states after each backup

• Requires knowledge of reverse dynamics (predecessor states)

• Can be implemented efficiently by maintaining a priority queue

�����max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0)

!� v(s)

�����

Page 43: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Real-time Dynamic Programming

• Idea: only states that are relevant to agent

• Use agent’s experience to guide the selection of states

• After each time-step

• Backup the state

v(St) max

a2A

r(St, a) + �

X

s02ST (s0|St, a)v(s

0)

!

St,At, rt+1

St

Page 44: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Full-width and sample backups

Sample Backups

In subsequent lectures we will consider sample backups

Using sample rewards and sample transitionshS ,A,R , S 0iInstead of reward function R and transition dynamics PAdvantages:

Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|

Sample Backups

• In subsequent lectures we will consider sample backups

• Using sample rewards and sample transitions

• Instead of reward function and transition dynamics

• Advantages:

• Model-free: no advance knowledge of MDP required

• Breaks the curse of dimensionality through sampling

• Cost of backup is constant, independent of n = |S|

(S,A, r,S 0)

Page 45: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Approximate Dynamic Programming

• Approximate the value function

• Using a function approximate

• Apply dynamic programming to

• e.g. Fitted Value Iteration repeats at each iteration k,

• Sample states

• For each state , estimate target value using Bellman optimality equation,

• Train next value function using targets

v(s,w)

v(· ,w)

S ✓ Ss 2 S

v(· ,wk+1) {hs, vk(s)i}

vk(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0,wk)

!

Page 46: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

Linear Programming (LP)

• Recall, at value iteration convergence we have

• LP formulation to find

8s 2 S :

Theorem. is the solution to the above LP.

is a probability distribution over , with for all in . µ0 µ0(s) > 0 sS S

V ⇤

minV

X

Sµ0(s)V (s)

s.t. 8s 2 S, 8a 2 A :

V (s) � r(s, a) + �X

s02ST (s0|s, a)V ⇤(s0)

V ⇤(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)V ⇤

(s0)

!

Page 47: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• Let be the Bellman operator, i.e., . Then the LP can be written as:

LP con’t

F V ⇤i+1 = F (Vi)

Monotonicity Property: If then .Hence, if then , and by repeated application,

Any feasible solution to the LP must satisfy , and hence must satisfy . Hence, assuming all entries in are positive, is the optimal solution to the LP.

U � V F (U) � F (V )V � F (V ) F (V ) � F (F (V ))

V � F (V ) � F 2V � F 3V � ... � F1V = V ⇤

V � V ⇤ V ⇤µ0

V � F (V )

minV

µ>0 V

s.t. V � F (V )

Page 48: Planning in Markov Decision Processesnlp.jbnu.ac.kr/AI/slides_cmu_RL/lecture3_mdp_planning.pdfdfsfdsfdf dsfdf , if and only if • For any state reachable from , achieves the optimal

• Interpretation

• Equation 2: ensures that has the above meaning

• Equation 1: maximize expected discounted sum of rewards

• Optimal policy:

LP: The Dual -> q*

⇡⇤(s) = argmax

a�(s, a)

�(s, a) =1X

t=0

�tP (st = s, at = a)

min�

X

s2S

X

a2A

X

s02S�(s, a)T (s, a, s0)r(s, a)

s.t. 8s 2 S :X

a02A�(s0, a0) = µ0(s) + �

X

s2S

X

a2A�(s, a)T (s, a, s0)