module 6 value iteration - david r. cheriton school of
TRANSCRIPT
Module 6
Value Iteration
CS 886 Sequential Decision Making and
Reinforcement Learning
University of Waterloo
CS886 (c) 2013 Pascal Poupart
2
Markov Decision Process
β’ Definition
β Set of states: π
β Set of actions (i.e., decisions): π΄
β Transition model: Pr(π π‘|π π‘β1, ππ‘β1)
β Reward model (i.e., utility): π (π π‘, ππ‘)
β Discount factor: 0 β€ πΎ β€ 1
β Horizon (i.e., # of time steps): β
β’ Goal: find optimal policy π
CS886 (c) 2013 Pascal Poupart
3
Finite Horizon
β’ Policy evaluation
πβπ π = πΎπ‘Pr(ππ‘ = π β²|π0 = π , π)π (π β², ππ‘(π β²))
βπ‘=0
β’ Recursive form (dynamic programming)
π0π π = π (π , π0 π )
ππ‘π π = π π , ππ‘ π + πΎ Pr π β² π , ππ‘ π ππ‘β1
π (π β²)π β²
CS886 (c) 2013 Pascal Poupart
4
Finite Horizon
β’ Optimal Policy πβ
πβπβ
π β₯ πβπ π βπ, π
β’ Optimal value function πβ (shorthand for ππβ)
π0β π = max
ππ (π , π)
ππ‘β π = max
ππ π , π + πΎ Pr π β² π , π ππ‘β1
β (π β²)π β²
Bellmanβs equation
CS886 (c) 2013 Pascal Poupart
5
Value Iteration Algorithm
Optimal policy πβ
π‘ = 0:π0β π β argmax
ππ π , π βπ
π‘ > 0: ππ‘β π β argmax
ππ π , π + πΎ Pr π β² π , π ππ‘β1
β (π β²)π β² βπ
NB: πβ is non stationary (i.e., time dependent)
valueIteration(MDP) π0
β π β maxπ
π (π , π)βπ
For π‘ = 1 to β do
ππ‘β π β max
ππ π , π + πΎ Pr π β² π , π ππ‘β1
β (π β²)π β² βπ
Return πβ
CS886 (c) 2013 Pascal Poupart
6
Value Iteration
β’ Matrix form:
π π: π Γ 1 column vector of rewards for π
ππ‘β: π Γ 1 column vector of state values
ππ: π Γ π matrix of transition prob. for π
valueIteration(MDP) π0
β β maxπ
π π
For π‘ = 1 to β do
ππ‘β β max
ππ π + πΎππππ‘β1
β
Return πβ
CS886 (c) 2013 Pascal Poupart
7
Infinite Horizon
β’ Let β β β
β’ Then πβπ β πβ
π and πββ1π β πβ
π
β’ Policy evaluation:
πβπ π = π π , πβ π + πΎ Pr π β² π , πβ π πβ
π(π β²)π β² βπ
β’ Bellmanβs equation: πβ
β π = maxπ
π π , π + πΎ Pr π β² π , π πββ(π β²)π β²
CS886 (c) 2013 Pascal Poupart
8
Policy evaluation
β’ Linear system of equations
πβπ π = π π , πβ π + πΎ Pr π β² π , πβ π πβ
π(π β²)π β² βπ
β’ Matrix form:
π : π Γ 1 column vector of sate rewards for π
π: π Γ 1 column vector of state values for π
π: π Γ π matrix of transition prob for π
π = π + πΎππ
CS886 (c) 2013 Pascal Poupart
9
Solving linear equations
β’ Linear system: π = π + πΎππ
β’ Gaussian elimination: πΌ β πΎπ π = π
β’ Compute inverse: π = πΌ β πΎπ β1π
β’ Iterative methods
β’ Value iteration (a.k.a. Richardson iteration)
β’ Repeat π β π + πΎππ
CS886 (c) 2013 Pascal Poupart
10
Contraction
β’ Let π»(π) β π + πΎππ be the policy eval operator
β’ Lemma 1: π» is a contraction mapping.
π» π β π» πβ
β€ πΎ π β πβ
β’ Proof π» π β π» πβ
= π + πΎππ β π β πΎππβ
(by definition)
= πΎπ π β πβ
(simplification)
β€ πΎ πβ
π β πβ
(since π΄π΅ β€ π΄ π΅ )
= πΎ π β πβ
(since maxπ
π(π , π β²)π β² = 1)
CS886 (c) 2013 Pascal Poupart
11
Convergence
β’ Theorem 2: Policy evaluation converges to ππ
for any initial estimate π
limπββ
π»(π) π = ππ βπ
β’ Proof
β’ By definition Vπ = π» β 0 , but policy evaluation
computes π» β π for any initial π
β’ By lemma 1, π»(π) π β π» π π β
β€ πΎπ π β π β
β’ Hence, when π β β, then π»(π) π β π» π 0β
β 0
and π» β π = ππβπ
CS886 (c) 2013 Pascal Poupart
12
Approximate Policy Evaluation
β’ In practice, we canβt perform an infinite number
of iterations.
β’ Suppose that we perform value iteration for π
steps and π» π π β π» πβ1 πβ
= π, how far
is π» π π from ππ?
CS886 (c) 2013 Pascal Poupart
13
Approximate Policy Evaluation
β’ Theorem 3: If π» π π β π» πβ1 πβ
β€ π then
ππ β π» π πβ
β€π
1 β πΎ
β’ Proof ππ β π» π πβ
= π» β (π) β π» π πβ
(by Theorem 2)
= π» π‘+π πβπ‘=1 β π» π‘+πβ1 π
β
β€ π» π‘+π (π) β π» π‘+πβ1 πβ
βπ‘=1 ( π΄ + π΅ β€ π΄ + | π΅ |)
= πΎπ‘πβπ‘=1 =
π
1βπΎ (by Lemma 1)
CS886 (c) 2013 Pascal Poupart
14
Optimal Value Function
β’ Non-linear system of equations
πββ π = max
ππ π , π + πΎ Pr π β² π , π πβ
β(π β²)π β² βπ
β’ Matrix form:
π π: π Γ 1 column vector of rewards for π
πβ: π Γ 1 column vector of optimal values
πa: π Γ π matrix of transition prob for π
πβ = maxπ
π π + πΎπππβ
CS886 (c) 2013 Pascal Poupart
15
Contraction
β’ Let π»β(π) β maxπ
π π + πΎππ π be the operator in
value iteration
β’ Lemma 3: π»β is a contraction mapping.
π»β π β π»β πβ
β€ πΎ π β πβ
β’ Proof: without loss of generality,
let π»β π π β₯ π»β(π)(π ) and
let ππ β = argmax
ππ π , π + πΎ Pr π β² π , π π(π β²)π β²
CS886 (c) 2013 Pascal Poupart
16
Contraction
β’ Proof continued:
β’ Then 0 β€ π»β π π β π»β π π (by assumption)
β€ π π , ππ β + πΎ Pr π β² π , ππ
βπ β² π π β² (by definition)
βπ π , ππ β β πΎ Pr π β² π , ππ
βπ β² π π β²
= πΎ Pr π β² π , ππ β π π β² β π π β²
π β²
β€ πΎ Pr π β² π , ππ β π β π
βπ β² (maxnorm upper bound)
= πΎ π β πβ
(since Pr π β² π , ππ β
π β² = 1)
β’ Repeat the same argument for π»β π π β₯ π»β(π )(π ) and
for each π
CS886 (c) 2013 Pascal Poupart
17
Convergence
β’ Theorem 4: Value iteration converges to πβ for
any initial estimate π
limπββ
π»β(π) π = πβ βπ
β’ Proof
β’ By definition Vβ = π»β β 0 , but value iteration
computes π»β β π for some initial π
β’ By lemma 3, π»β(π) π β π»β π π β
β€ πΎπ π β π β
β’ Hence, when π β β, then π»β(π) π β π»β π 0β
β
0 and π»β β π = πββπ
CS886 (c) 2013 Pascal Poupart
18
Value Iteration
β’ Even when horizon is infinite, perform finitely
many iterations
β’ Stop when ππ β ππβ1 β€ π
valueIteration(MDP) π0
β β maxπ
π π ; π β 0
Repeat π β π + 1
ππ β maxπ
π π + πΎππππβ1
Until ππ β ππβ1 ββ€ π
Return ππ
CS886 (c) 2013 Pascal Poupart
19
Induced Policy
β’ Since ππ β ππβ1 ββ€ π, by Theorem 4: we know
that ππ β πββ
β€π
1βπΎ
β’ But, how good is the stationary policy ππ π
extracted based on ππ?
ππ π = argmaxπ
π π , π + πΎ Pr π β² π , π ππ(π β²)
π β²
β’ How far is πππ from πβ?
CS886 (c) 2013 Pascal Poupart
20
Induced Policy
β’ Theorem 5: πππ β πββ
β€2π
1βπΎ
β’ Proof
πππ β πββ
= πππ β ππ + ππ β πββ
β€ πππ β ππ β+ ππ β πβ
β ( π΄ + π΅ β€ π΄ + | π΅ |)
= π»ππ β (ππ) β ππβ
+ ππ β π»β β ππβ
β€π
1βπΎ+
π
1βπΎ (by Theorems 2 and 4)
=2π
1βπΎ
CS886 (c) 2013 Pascal Poupart
21
Summary
β’ Value iteration β Simple dynamic programming algorithm
β Complexity: π(π π΄ π 2) β’ Here π is the number of iterations
β’ Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? β Yes: by policy iteration