markov decision processes infinite horizon problems

39
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * ed in part on slides by Craig Boutilier and Daniel Weld

Upload: silver

Post on 22-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Markov Decision Processes Infinite Horizon Problems. Alan Fern *. * Based in part on slides by Craig Boutilier and Daniel Weld. What is a solution to an MDP? . MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Markov Decision Processes Infinite Horizon Problems

1

Markov Decision ProcessesInfinite Horizon Problems

Alan Fern *

* Based in part on slides by Craig Boutilier and Daniel Weld

Page 2: Markov Decision Processes Infinite Horizon Problems

2

What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”

h This depends on how we define the value of a policy

h There are several choices and the solution algorithms depend on the choice

h We will consider two common choices5 Finite-Horizon Value5 Infinite Horizon Discounted Value

Page 3: Markov Decision Processes Infinite Horizon Problems

3

Discounted Infinite Horizon MDPsh Defining value as total reward is problematic with

infinite horizons (r1 + r2 + r3 + r4 + …..)5 many or all policies have infinite expected reward5 some MDPs are ok (e.g., zero-cost absorbing states)

h “Trick”: introduce discount factor 0 ≤ β < 15 future rewards discounted by β per time step

h Note:

h Motivation: economic? prob of death? convenience?

],|[)(0

sREsVt

tt

max

0

max

11][)( RREsV

t

t

Bounded Value

Page 4: Markov Decision Processes Infinite Horizon Problems
Page 5: Markov Decision Processes Infinite Horizon Problems

5

Notes: Discounted Infinite Horizon

h Optimal policies guaranteed to exist (Howard, 1960)5 I.e. there is a policy that maximizes value at each state

h Furthermore there is always an optimal stationary policy5 Intuition: why would we change action at s at a new time

when there is always forever ahead

h We define to be the optimal value function.5 That is, for some optimal stationary π

)(* sV)()(* sVsV

Page 6: Markov Decision Processes Infinite Horizon Problems

6

Computational Problems

h Policy Evaluation5 Given and an MDP compute

h Policy Optimization5 Given an MDP, compute an optimal policy and .5 We’ll cover two algorithms for doing this: value iteration

and policy iteration

Page 7: Markov Decision Processes Infinite Horizon Problems

7

Policy Evaluationh Value equation for fixed policy

h Equation can be derived from original definition of infinite horizon discounted value

)'(' )'),(,(β)()( ss VsssTsRsV

immediate rewarddiscounted expected valueof following policy in the future

Page 8: Markov Decision Processes Infinite Horizon Problems

8

Policy Evaluationh Value equation for fixed policy

h How can we compute the value function for a fixed policy?5 we are given R, T, and and want to find for each s5 linear system with n variables and n constraints

g Variables are values of states: V(s1),…,V(sn)g Constraints: one value equation (above) per state

5 Use linear algebra to solve for V (e.g. matrix inverse)

)'(' )'),(,(β)()( ss VsssTsRsV

Page 9: Markov Decision Processes Infinite Horizon Problems
Page 10: Markov Decision Processes Infinite Horizon Problems

10

Policy Evaluation via Matrix InverseVπ and R are n-dimensional column vector (one

element for each state)

T is an nxn matrix s.t.

RIV

RVI

VRV

1-βT)(

βT)(

βT

)),(,T(s j)T(i, i ji ss

Page 11: Markov Decision Processes Infinite Horizon Problems

11

Computing an Optimal Value Function

h Bellman equation for optimal value function

h Bellman proved this is always true for an optimal value function

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

immediate rewarddiscounted expected valueof best action assuming wewe get optimal value in future

Page 12: Markov Decision Processes Infinite Horizon Problems

12

Computing an Optimal Value Functionh Bellman equation for optimal value function

h How can we solve this equation for V*?5 The MAX operator makes the system non-linear, so the problem is

more difficult than policy evaluation

h Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration 5 Adjust Bellman Backup to take discounting into account.

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

Page 13: Markov Decision Processes Infinite Horizon Problems

Bellman Backups (Revisited)

a1

a2

s4

s1

s3

s2

Vk

0.70.3

0.4

0.6

ComputeExpectations

Vk+1(s) s

ComputeMax

)'(' )',,(max)()(1 ss VsasTsRsV kk

a

Page 14: Markov Decision Processes Infinite Horizon Problems

14

Value Iterationh Can compute optimal policy using value iteration

based on Bellman backups, just like finite-horizon problems (but include discount term)

h Will it converge to optimal value function as k gets large? 5 Yes.

h Why?

)'(' )',,(max)()(

0)(1

0

ss VsasTsRsV

sVkk

a

*lim VV kk

;; Could also initialize to R(s)

Page 15: Markov Decision Processes Infinite Horizon Problems

15

Convergence of Value Iterationh Bellman Backup Operator: define B to be an

operator that takes a value function V as input and returns a new value function after a Bellman backup

h Value iteration is just the iterative application of B:

)'(' )',,(maxβ)()]([ ss VsasTsRsVBa

][

01

0

kk VBV

V

Page 16: Markov Decision Processes Infinite Horizon Problems

16

Convergence: Fixed Point Property

h Bellman equation for optimal value function

h Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B.5 That is B[V*]=V*

)'(' *)',,(maxβ)()(* ss VsasTsRsVa

)'(' )',,(maxβ)()]([ ss VsasTsRsVBa

Page 17: Markov Decision Processes Infinite Horizon Problems

17

Convergence: Contraction Propertyh Let ||V|| denote the max-norm of V, which returns

the maximum element of the vector. 5 E.g. ||(0.1 100 5 6)|| = 100

h B[V] is a contraction operator wrt max-norm

h For any V and V’, || B[V] – B[V’] || ≤ β || V – V’ ||5 You will prove this.

h That is, applying B to any two value functions causes them to get closer together in the max-norm sense!

Page 18: Markov Decision Processes Infinite Horizon Problems
Page 19: Markov Decision Processes Infinite Horizon Problems

19

Convergenceh Using the properties of B we can prove convergence of

value iteration.

h Proof:1. For any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V||

2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β||V* - Vk+1 || = ||V* - B[Vk ]|| ≤ β || V* - Vk ||

3. This means that ||Vk – V*|| ≤ βk || V* - V0 ||

4. Thus 0lim * k

k VV

Page 20: Markov Decision Processes Infinite Horizon Problems

20

Value Iteration: Stopping Conditionh Want to stop when we can guarantee the value

function is near optimal.

h Key property: (not hard to prove)

If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)

h Continue iteration until ||Vk - Vk-1||≤ ε 5 Select small enough ε for desired error

guarantee

Page 21: Markov Decision Processes Infinite Horizon Problems

21

How to Acth Given a Vk from value iteration that closely

approximates V*, what should we use as our policy?

h Use greedy policy: (one step lookahead)

h Note that the value of greedy policy may not be equal to Vk

5 Why?

)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk

a

Page 22: Markov Decision Processes Infinite Horizon Problems

22

How to Acth Use greedy policy: (one step lookahead)

h We care about the value of the greedy policy which we denote by Vg5 This is how good the greedy policy will be in practice.

h How close is Vg to V*?

)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk

a

Page 23: Markov Decision Processes Infinite Horizon Problems

23

Value of Greedy Policy

h Define Vg to be the value of this greedy policy5 This is likely not the same as Vk

h Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β) 5 Thus, Vg is not too far from optimal if Vk is close to optimal

h Our previous stopping condition allows us to bound λ based on ||Vk+1 – Vk||

h Set stopping condition so that ||Vg - V*|| ≤ Δ5 How?

)'(' )',,(maxarg)]([ ss VsasTa

sVgreedy kk

Page 24: Markov Decision Processes Infinite Horizon Problems

Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β)

Property: If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)

Goal: ||Vg - V*|| ≤ Δ

Answer: If ||Vk - Vk-1||≤ then ||Vg - V*|| ≤ Δ

Page 25: Markov Decision Processes Infinite Horizon Problems

Policy Evaluation Revisitedh Sometimes policy evaluation is expensive due to

matrix operations

h Can we have an iterative algorithm like value iteration for policy evaluation?

h Idea: Given a policy π and MDP M, create a new MDP M[π] that is identical to M, except that in each state s we only allow a single action π(s)5 What is the optimal value function V* for M[π] ?

h Since the only valid policy for M[π] is π, V* = Vπ.

Page 26: Markov Decision Processes Infinite Horizon Problems

Policy Evaluation Revisited

h Running VI on M[π] will converge to V* = Vπ.5 What does the Bellman backup look like here?

h The Bellman backup now only considers one action in each state, so there is no max5 We are effectively applying a backup restricted by π

)'(' )'),(,(β)()]([ ss VsssTsRsVB

Restricted Bellman Backup:

Page 27: Markov Decision Processes Infinite Horizon Problems

27

Iterative Policy Evaluationh Running VI on M[π] is equivalent to iteratively

applying the restricted Bellman backup.

h Often become close to Vπ for small k

][

01

0

kk VBV

V

VV kk lim

Iterative Policy Evaluation:

Convergence:

Page 28: Markov Decision Processes Infinite Horizon Problems

28

Optimization via Policy Iterationh Policy iteration uses policy evaluation as a sub

routine for optimization

h It iterates steps of policy evaluation and policy improvement

1. Choose a random policy π2. Loop: (a) Evaluate Vπ

(b) π’ = ImprovePolicy(Vπ) (c) Replace π with π’Until no improving action possible at any state

Given Vπ returns a strictlybetter policy if π isn’t optimal

Page 29: Markov Decision Processes Infinite Horizon Problems

29

Policy Improvement

h Given Vπ how can we compute a policy π’ that is strictly better than a sub-optimal π?

h Idea: given a state s, take the action that looks the best assuming that we following policy π thereafter5 That is, assume the next state s’ has value Vπ (s’)

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

For each s in S, set )'(' )',,(maxarg)(' ss VsasTsa

Page 30: Markov Decision Processes Infinite Horizon Problems

30

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

For any two value functions and , we write to indicate that for all states s,

Useful Properties for Proof:

1) For any and , if then

Page 31: Markov Decision Processes Infinite Horizon Problems

31

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

Proof:

Page 32: Markov Decision Processes Infinite Horizon Problems

32

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.

)'(' )',,(maxarg)(' ss VsasTsa

Proof:

Page 33: Markov Decision Processes Infinite Horizon Problems

33

Optimization via Policy Iteration

1. Choose a random policy π2. Loop: (a) Evaluate Vπ (b) For each s in S, set (c) Replace π with π’Until no improving action possible at any state

)'(' )',,(maxarg)(' ss VsasTsa

Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π. Policy iteration goes through a sequence of improving policies

Page 34: Markov Decision Processes Infinite Horizon Problems

34

Policy Iteration: Convergence

h Convergence assured in a finite number of iterations5 Since finite number of policies and each step

improves value, then must converge to optimal

h Gives exact value of optimal policy

Page 35: Markov Decision Processes Infinite Horizon Problems

35

Policy Iteration Complexityh Each iteration runs in polynomial time in the

number of states and actionsh There are at most |A|n policies and PI never

repeats a policy5 So at most an exponential number of iterations5 Not a very good complexity bound

h Empirically O(n) iterations are required often it seems like O(1)5 Challenge: try to generate an MDP that requires

more than that n iterations

h Still no polynomial bound on the number of PI iterations (open problem)!5 But may have been solved recently ????…..

Page 36: Markov Decision Processes Infinite Horizon Problems

36

Value Iteration vs. Policy Iterationh Which is faster? VI or PI

5 It depends on the problem

h VI takes more iterations than PI, but PI requires more time on each iteration5 PI must perform policy evaluation on each iteration

which involves solving a linear system

h VI is easier to implement since it does not require the policy evaluation step5 But see next slide

h We will see that both algorithms will serve as inspiration for more advanced algorithms

Page 37: Markov Decision Processes Infinite Horizon Problems

37

Modified Policy Iterationh Modified Policy Iteration: replaces exact

policy evaluation step with inexact iterative evaluation5 Uses a small number of restricted Bellman

backups for evaluation

h Avoids the expensive policy evaluation steph Perhaps easier to implement. h Often is faster than PI and VIh Still guaranteed to converge under mild

assumptions on starting points

Page 38: Markov Decision Processes Infinite Horizon Problems

Modified Policy Iteration

1. Choose initial value function V2. Loop: (a) For each s in S, set (b) Partial Policy Evaluation Repeat K times:

Until change in V is minimal

)'(' )',,(maxarg)( ss VsasTsa

Policy Iteration

][VBV Approx.evaluation

Page 39: Markov Decision Processes Infinite Horizon Problems

39

Recap: things you should knowh What is an MDP?h What is a policy?

5 Stationary and non-stationary

h What is a value function? 5 Finite-horizon and infinite horizon

h How to evaluate policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?

h How to optimize policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?5 Why they are correct?