planning in markov decision...

Planning in Markov Decision Processes

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie MellonSchool of Computer Science

Lecture 3, CMU 10703

A Markov Decision Process is a tuple

• is a finite set of states

• is a finite set of actions

• is a state transition probability function

• is a reward function

• is a discount factor

Markov Decision Process (MDP)

�

r(s, a) = E[Rt+1|St = s,At = a]

r

T

T (s0|s, a) = P[St+1 = s0|St = s,At = a]

A

S

(S,A, T, r, �)

� 2 [0, 1]

Solving MDPs

• Prediction: Given an MDP and a policyfind the state and action value functions.

• Optimal control: given an MDP , find the optimal policy (aka the planning problem). Compare with the learning problem with missing information about rewards/dynamics.

• We still consider finite MDPs (finite and ) with known dynamics!

⇡(a|s) = P[At = a|St = s]

S A

(S,A, T, r, �)

(S,A, T, r, �)

Outline

• Policy iteration

• Value iteration

• Linear programming

• Asynchronous DP

Policy Evaluation

Policy evaluation: for a given policy , compute the state value function where is implicitly given by the Bellman equation

a system of simultaneous equations.

⇡

V⇡(s) = E⇡[Rt+1 + �Rt+2 + �2Rt+3 + ...|St = s]

v⇡(s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v⇡(s0)

!

|S|

v⇡(s) = E⇡

⇥Rt+1 + �Rt+2 + �2Rt+3 + . . . |St = s

⇤

v⇡(s)

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk

r

Back Up Diagram

MDP

v⇡(s) s

v⇡(s0) s0

Back Up Diagram

MDP

MRP

v⇡(s0) s0

v⇡(s) s

v⇡(s) = r⇡s + �P

s02S T⇡s0sv⇡(s

0)


Policy Evaluation



a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!


v⇡(s) s

v⇡(s0) s0

r

Matrix Form

The Bellman expectation equation can be written concisely using the induced MRP as

with direct solution

of complexity O(N3)

v⇡ = (I � �T⇡)�1r⇡

v⇡ = r⇡ + �T⇡v⇡

Iterative Methods: Recall the Bellman Equation

v⇡(s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v⇡(s0)

!Lecture 3: Planning by Dynamic Programming

Policy Evaluation



a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!


v⇡(s) s

v⇡(s0) s0

r


Policy Evaluation



a

r

vk+1(s) � s

vk(s0) � s0

vk+1

(s) =X

a2A⇡(a|s)

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!


r

Iterative Methods: Backup Operation

Given an expected value function at iteration k, we back up the expected value function at iteration k+1:

v[k+1] sv[k+1](s) s

v[k+1](s) =X

a2A⇡(a|s)

r(s, a) + �

X

s02ST (s0|s, a)v[k](s0)

!

v[k](s0) s0

A sweep consists of applying the backup operation for all the states in

Applying the back up operator iteratively

Iterative Methods: Sweep

v[0] ! v[1] ! v[2] ! . . . v⇡

v ! v0S

A Small-Grid World

• An undiscounted episodic task

• Nonterminal states: 1, 2, … , 14

• Terminal state: one, shown in shaded square

• Actions that would take the agent off the grid leave the state unchanged

• Reward is -1 until the terminal state is reached

R

γ = 1




• Actions that would take the agent off the grid leave the state unchanged


Policy , an equiprobable random action


⇡

for therandom policyv[k]

1

An operator on a normed vector space is a -contraction, for , provided for all

Theorem (Contraction mapping)For a -contraction in a complete normed vector space

• converges to a unique fixed point in

• at a linear convergence rate

Remark. In general we only need metric (vs normed) space

Contraction Mapping Theorem

F X �

||T (x)� T (y)|| �||x� y||

x, y 2 X0 < � < 1

� F X

F X�

Value Function Sapce

• Consider the vector space over value functions

• There are dimensions

• Each point in this space fully specifies a value function

• Bellman backup brings value functions closer in this space?

• And therefore the backup must converge to a unique solution

|S|v(s)

V

Value Function -Norm

• We will measure distance between state-value functions and by the -norm

• i.e. the largest difference between state values,

1

u v1

||u� v||1 = max

s2S|u(s)� v(s)|

Bellman Expectation Backup is a Contraction

• Define the Bellman expectation backup operator

• This operator is a -contraction, i.e. it makes value functions closer by at least , �

�

||F⇡(u)� F⇡(v)||1 = ||(r⇡ + �T⇡u)||1 � ||(r⇡ + �T⇡v)||1= ||�T⇡(u� v)||1 ||�T⇡||u� v||1||1 �||u� v||1

F⇡(v) = r⇡ + �T⇡v

Convergence of Iter. Policy Evaluation and Policy Iteration

• The Bellman expectation operator has a unique fixed point

• is a fixed point of (by Bellman expectation equation)

• By contraction mapping theorem

• Iterative policy evaluation converges on

F⇡

v⇡

F⇡v⇡

Policy Improvement

• Suppose we have computed for a deterministic policy

• For a given state , would it be better to do an action ?

• It is better to switch to action for state if and only if

• And we can compute from by:

v⇡ ⇡

s a 6= ⇡(s)

a s

q⇡(s, a) > v⇡(s)

q⇡(s, a) v⇡

q⇡(s, a) = E[Rt+1 + �v⇡(St+1)|St = s,At = a]

= r(s, a) + �X

s02ST (s0|s, a)v⇡(s0)

Policy Improvement Cont.

• Do this for all states to get a new policy that is greedy with respect to :

• What if the policy is unchanged by this?

• Then the policy must be optimal!

⇡0 � ⇡v⇡

⇡0(s) = argmax

aq⇡(s, a)

= argmax

aE[Rt+1 + �v⇡(s

0)|St = s,At = a]

= argmax r(s, a) + �X

s02ST (s0|s, a)v⇡(s0)

Policy Iteration

policy evaluation policy improvement“greedification”

⇡0E�! v⇡0

I�! ⇡1E�! v⇡1

I�! ⇡2E�! ...

I�! ⇡⇤E�! v⇤

Policy Iteration

92 CHAPTER 4. DYNAMIC PROGRAMMING

1. InitializationV (s) 2 R and ⇡(s) 2 A(s) arbitrarily for all s 2 S

2. Policy EvaluationRepeat

� 0For each s 2 S:

v V (s)V (s)

Ps0,r p(s0, r|s, ⇡(s))

⇥r + �V (s0)

⇤

� max(�, |v � V (s)|)until � < ✓ (a small positive number)

3. Policy Improvementpolicy-stable trueFor each s 2 S:

a ⇡(s)⇡(s) arg maxa

Ps0,r p(s0, r|s, a)

⇥r + �V (s0)

⇤

If a 6= ⇡(s), then policy-stable falseIf policy-stable, then stop and return V and ⇡; else go to 2

Figure 4.3: Policy iteration (using iterative policy evaluation) for v⇤. Thisalgorithm has a subtle bug, in that it may never terminate if the policy con-tinually switches between two or more policies that are equally good. The bugcan be fixed by adding additional flags, but it makes the pseudocode so uglythat it is not worth it. :-)

argmax r(s, a) + �⌃s02ST (s0|s, a)v⇡(s0)

⌃a2A⇡(a|s) (r(s, a) + �⌃s02ST (s0|s, a)V (s0))

v

v




• Actions that take the agent off the grid leave the state unchanged


6

Iterative Policy Eval for the Small Gridworld

∞

R

γ = 1

Policy , an equiprobable random action⇡




• Actions that take the agent off the grid leave the state unchanged


Iterative Policy Eval for the Small Gridworld

∞

R

γ = 1

Policy , an equiprobable random action⇡

• Does policy evaluation need to converge to ?

• Or should we introduce a stopping condition

• e.g. -convergence of value function

• Or simply stop after k iterations of iterative policy evaluation?

• For example, in the small grid world k = 3 was sufficient to achieve optimal policy

• Why not update policy every iteration? i.e. stop after k = 1

• This is equivalent to value iteration (next section)

Generalized Policy Iteration

v⇡

✏

Generalized Policy Iteration

Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

evaluation

improvement

⇡ greedy(V )

V⇡

V v⇡

v⇤⇡⇤

v⇤,⇡⇤

V0,⇡0

V = v⇡

⇡ = greedy(V )

v⇡

v⇡

v⇤

v⇤

Principle of Optimality

• Any optimal policy can be subdivided into two components:

• An optimal first action

• Followed by an optimal policy from successor state

• Theorem (Principle of Optimality)

• A policy achieves the optimal value from state , dfsfdsfdf dsfdf , if and only if

• For any state reachable from , achieves the optimal value from state ,

⇡(a|s) sv⇡(s) = v⇤(s)

s0 s ⇡s0 v⇡(s

0) = v⇤(s0)

A⇤

S 0

• If we know the solution to subproblems

• Then solution can be found by one-step lookahead

• The idea of value iteration is to apply these updates iteratively

• Intuition: start with final rewards and work backwards

• Still works with loopy, stochastic MDPs

Value Iteration

v⇤(s0)

v⇤(s0)

v⇤(s) max

a2Ar(s, a) + �

X

s02ST (s0|s, a)v⇤(s0)

Example: Shortest Path

• Body Level One

• Body Level Two

• Body Level Three

• Body Level Four

Body Level Five


Value Iteration

Value Iteration in MDPs

Example: Shortest Path

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

0

-1

-2

-2

-1

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

0

-1

-2

-3

-1

-2

-3

-3

-2

-3

-3

-3

-3

-3

-3

-3

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-4

-3

-4

-4

-4

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-5

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-6

g

Problem V1 V2 V3

V4 V5 V6 V7

Value Iteration

• Problem: find optimal policy

• Solution: iterative application of Bellman optimality backup

•

• Using synchronous backups

• At each iteration k + 1

• For all states

• Update from

⇡

v1 ! v2 ! ... ! v⇤

vk+1(s) vk(s0)

s 2 S

Value Iteration (2)Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration (2)

vk+1(s) � s

vk(s0) � s0

r

a

vk+1

(s) = maxa2A

Ra

s

+ �X

s

02SPa

ss

0vk

(s 0)

!

vk+1

= maxa2A

RaRaRa + �PaPaPavk

r

vk+1(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)vk(s0)

!

vk+1 = max

a2Ar(a) + �T (a)vk

vk+1(s) s

vk(s0) s0

Bellman Optimality Backup is a Contraction

• Define the Bellman optimality backup operator ,

• This operator is a -contraction, i.e. it makes value functions closer by at least (similar to previous proof)

F ⇤

F ⇤(v) = max

a2Ar(a) + �T (a)v

||F ⇤(u)� F ⇤(v)||1 �||u� v||1

��

Convergence of Value Iteration

• The Bellman optimality operator has a unique fixed point

• is a fixed point of (by Bellman optimality equation)

• By contraction mapping theorem

• Value iteration converges on

F ⇤

v⇤ F ⇤

v⇤

• Algorithms are based on state-value function or

• Complexity per iteration, for actions and states

• Could also apply to action-value function or • Complexity per iteration

Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative Policy Evaluation

Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration

Control Bellman Optimality Equation Value Iteration

v⇤(s)v⇡(s)

q⇡(s, a) q⇤(s, a)O(mn2)

O(m2n2)

m n

Efficiency of DP

• To find an optimal policy is polynomial in the number of states…

• BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).

• In practice, classical DP can be applied to problems with a few millions of states.

Asynchronous DP

• All the DP methods described so far require exhaustive sweeps of the entire state set.

• Asynchronous DP does not use sweeps. Instead it works like this:

• Repeat until convergence criterion is met:

• Pick a state at random and apply the appropriate backup

• Still need lots of computation, but does not get locked into hopelessly long sweeps

• Guaranteed to converge if all states continue to be selected

• Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.

Asynchronous Dynamic Programming

• Three simple ideas for asynchronous dynamic programming:

• In-place dynamic programming

• Prioritized sweeping

• Real-time dynamic programming

• Synchronous value iteration stores two copies of value function

• for all in

• In-place value iteration only stores one copy of value function

• for all in

In-Place Dynamic Programming

s Sv

new

(s) max

a2A

r(s, a) + �

X

s

02ST (s0|s, a)v

old

(s0)

!

v(s) max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0)

!

vold

vnew

s S

Prioritized Sweeping

• Use magnitude of Bellman error to guide state selection, e.g.

• Backup the state with the largest remaining Bellman error

• Update Bellman poor of affected states after each backup

• Requires knowledge of reverse dynamics (predecessor states)

• Can be implemented efficiently by maintaining a priority queue

��max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0)

!� v(s)

��

Real-time Dynamic Programming

• Idea: only states that are relevant to agent

• Use agent’s experience to guide the selection of states

• After each time-step

• Backup the state

v(St) max

a2A

r(St, a) + �

X

s02ST (s0|St, a)v(s

0)

!

St,At, rt+1

St


Extensions to Dynamic Programming

Full-width and sample backups

Sample Backups

In subsequent lectures we will consider sample backups

Using sample rewards and sample transitionshS ,A,R , S 0iInstead of reward function R and transition dynamics PAdvantages:

Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|

Sample Backups

• In subsequent lectures we will consider sample backups

• Using sample rewards and sample transitions

• Instead of reward function and transition dynamics

• Advantages:

• Model-free: no advance knowledge of MDP required

• Breaks the curse of dimensionality through sampling

• Cost of backup is constant, independent of n = |S|

(S,A, r,S 0)

Approximate Dynamic Programming

• Approximate the value function

• Using a function approximate

• Apply dynamic programming to

• e.g. Fitted Value Iteration repeats at each iteration k,

• Sample states

• For each state , estimate target value using Bellman optimality equation,

• Train next value function using targets

v(s,w)

v(· ,w)

S ✓ Ss 2 S

v(· ,wk+1) {hs, vk(s)i}

vk(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)v(s0,wk)

!

Linear Programming (LP)

• Recall, at value iteration convergence we have

• LP formulation to find

8s 2 S :

Theorem. is the solution to the above LP.

is a probability distribution over , with for all in . µ0 µ0(s) > 0 sS S

V ⇤

minV

X

Sµ0(s)V (s)

s.t. 8s 2 S, 8a 2 A :

V (s) � r(s, a) + �X

s02ST (s0|s, a)V ⇤(s0)

V ⇤(s) = max

a2A

r(s, a) + �

X

s02ST (s0|s, a)V ⇤

(s0)

!

• Let be the Bellman operator, i.e., . Then the LP can be written as:

LP con’t

F V ⇤i+1 = F (Vi)

Monotonicity Property: If then .Hence, if then , and by repeated application,

Any feasible solution to the LP must satisfy , and hence must satisfy . Hence, assuming all entries in are positive, is the optimal solution to the LP.

U � V F (U) � F (V )V � F (V ) F (V ) � F (F (V ))

V � F (V ) � F 2V � F 3V � ... � F1V = V ⇤

V � V ⇤ V ⇤µ0

V � F (V )

minV

µ>0 V

s.t. V � F (V )

• Interpretation

•

• Equation 2: ensures that has the above meaning

• Equation 1: maximize expected discounted sum of rewards

• Optimal policy:

LP: The Dual -> q*

�

⇡⇤(s) = argmax

a�(s, a)

�(s, a) =1X

t=0

�tP (st = s, at = a)

min�

X

s2S

X

a2A

X

s02S�(s, a)T (s, a, s0)r(s, a)

s.t. 8s 2 S :X

a02A�(s0, a0) = µ0(s) + �

X

s2S

X

a2A�(s, a)T (s, a, s0)

planning in markov decision...

Documents