module 6 value iteration - david r. cheriton school of

Module 6

Value Iteration

CS 886 Sequential Decision Making and

Reinforcement Learning

University of Waterloo

CS886 (c) 2013 Pascal Poupart

2

Markov Decision Process

• Definition

– Set of states: 𝑆

– Set of actions (i.e., decisions): 𝐴

– Transition model: Pr(𝑠𝑡|𝑠𝑡−1, 𝑎𝑡−1)

– Reward model (i.e., utility): 𝑅(𝑠𝑡, 𝑎𝑡)

– Discount factor: 0 ≤ 𝛾 ≤ 1

– Horizon (i.e., # of time steps): ℎ

• Goal: find optimal policy 𝜋


3

Finite Horizon

• Policy evaluation

𝑉ℎ𝜋 𝑠 = 𝛾𝑡Pr(𝑆𝑡 = 𝑠′|𝑆0 = 𝑠, 𝜋)𝑅(𝑠′, 𝜋𝑡(𝑠′))

ℎ𝑡=0

• Recursive form (dynamic programming)

𝑉0𝜋 𝑠 = 𝑅(𝑠, 𝜋0 𝑠 )

𝑉𝑡𝜋 𝑠 = 𝑅 𝑠, 𝜋𝑡 𝑠 + 𝛾 Pr 𝑠′ 𝑠, 𝜋𝑡 𝑠 𝑉𝑡−1

𝜋 (𝑠′)𝑠′


4

Finite Horizon

• Optimal Policy 𝜋∗

𝑉ℎ𝜋∗

𝑠 ≥ 𝑉ℎ𝜋 𝑠 ∀𝜋, 𝑠

• Optimal value function 𝑉∗ (shorthand for 𝑉𝜋∗)

𝑉0∗ 𝑠 = max

𝑎𝑅(𝑠, 𝑎)

𝑉𝑡∗ 𝑠 = max

𝑎𝑅 𝑠, 𝑎 + 𝛾 Pr 𝑠′ 𝑠, 𝑎 𝑉𝑡−1

∗ (𝑠′)𝑠′

Bellman’s equation


5

Value Iteration Algorithm

Optimal policy 𝜋∗

𝑡 = 0:𝜋0∗ 𝑠 ← argmax

𝑎𝑅 𝑠, 𝑎 ∀𝑠

𝑡 > 0: 𝜋𝑡∗ 𝑠 ← argmax


∗ (𝑠′)𝑠′ ∀𝑠

NB: 𝜋∗ is non stationary (i.e., time dependent)

valueIteration(MDP) 𝑉0

∗ 𝑠 ← max𝑎

𝑅(𝑠, 𝑎)∀𝑠

For 𝑡 = 1 to ℎ do

𝑉𝑡∗ 𝑠 ← max


∗ (𝑠′)𝑠′ ∀𝑠

Return 𝑉∗


6

Value Iteration

• Matrix form:

𝑅𝑎: 𝑆 × 1 column vector of rewards for 𝑎

𝑉𝑡∗: 𝑆 × 1 column vector of state values

𝑇𝑎: 𝑆 × 𝑆 matrix of transition prob. for 𝑎


∗ ← max𝑎

𝑅𝑎

For 𝑡 = 1 to ℎ do

𝑉𝑡∗ ← max

𝑎𝑅𝑎 + 𝛾𝑇𝑎𝑉𝑡−1

∗

Return 𝑉∗


7

Infinite Horizon

• Let ℎ → ∞

• Then 𝑉ℎ𝜋 → 𝑉∞

𝜋 and 𝑉ℎ−1𝜋 → 𝑉∞

𝜋

• Policy evaluation:

𝑉∞𝜋 𝑠 = 𝑅 𝑠, 𝜋∞ 𝑠 + 𝛾 Pr 𝑠′ 𝑠, 𝜋∞ 𝑠 𝑉∞

𝜋(𝑠′)𝑠′ ∀𝑠

• Bellman’s equation: 𝑉∞

∗ 𝑠 = max𝑎

𝑅 𝑠, 𝑎 + 𝛾 Pr 𝑠′ 𝑠, 𝑎 𝑉∞∗(𝑠′)𝑠′


8

Policy evaluation

• Linear system of equations

𝑉∞𝜋 𝑠 = 𝑅 𝑠, 𝜋∞ 𝑠 + 𝛾 Pr 𝑠′ 𝑠, 𝜋∞ 𝑠 𝑉∞

𝜋(𝑠′)𝑠′ ∀𝑠

• Matrix form:

𝑅: 𝑆 × 1 column vector of sate rewards for 𝜋

𝑉: 𝑆 × 1 column vector of state values for 𝜋

𝑇: 𝑆 × 𝑆 matrix of transition prob for 𝜋

𝑉 = 𝑅 + 𝛾𝑇𝑉


9

Solving linear equations

• Linear system: 𝑉 = 𝑅 + 𝛾𝑇𝑉

• Gaussian elimination: 𝐼 − 𝛾𝑇 𝑉 = 𝑅

• Compute inverse: 𝑉 = 𝐼 − 𝛾𝑇 −1𝑅

• Iterative methods

• Value iteration (a.k.a. Richardson iteration)

• Repeat 𝑉 ← 𝑅 + 𝛾𝑇𝑉


10

Contraction

• Let 𝐻(𝑉) ≝ 𝑅 + 𝛾𝑇𝑉 be the policy eval operator

• Lemma 1: 𝐻 is a contraction mapping.

𝐻 𝑉 − 𝐻 𝑉∞

≤ 𝛾 𝑉 − 𝑉∞

• Proof 𝐻 𝑉 − 𝐻 𝑉∞

= 𝑅 + 𝛾𝑇𝑉 − 𝑅 − 𝛾𝑇𝑉∞

(by definition)

= 𝛾𝑇 𝑉 − 𝑉∞

(simplification)

≤ 𝛾 𝑇∞

𝑉 − 𝑉∞

(since 𝐴𝐵 ≤ 𝐴 𝐵 )

= 𝛾 𝑉 − 𝑉∞

(since max𝑠

𝑇(𝑠, 𝑠′)𝑠′ = 1)


11

Convergence

• Theorem 2: Policy evaluation converges to 𝑉𝜋

for any initial estimate 𝑉

lim𝑛→∞

𝐻(𝑛) 𝑉 = 𝑉𝜋 ∀𝑉

• Proof

• By definition V𝜋 = 𝐻 ∞ 0 , but policy evaluation

computes 𝐻 ∞ 𝑉 for any initial 𝑉

• By lemma 1, 𝐻(𝑛) 𝑉 − 𝐻 𝑛 𝑉 ∞

≤ 𝛾𝑛 𝑉 − 𝑉 ∞

• Hence, when 𝑛 → ∞, then 𝐻(𝑛) 𝑉 − 𝐻 𝑛 0∞

→ 0

and 𝐻 ∞ 𝑉 = 𝑉𝜋∀𝑉


12

Approximate Policy Evaluation

• In practice, we can’t perform an infinite number

of iterations.

• Suppose that we perform value iteration for 𝑘

steps and 𝐻 𝑘 𝑉 − 𝐻 𝑘−1 𝑉∞

= 𝜖, how far

is 𝐻 𝑘 𝑉 from 𝑉𝜋?


13

Approximate Policy Evaluation

• Theorem 3: If 𝐻 𝑘 𝑉 − 𝐻 𝑘−1 𝑉∞

≤ 𝜖 then

𝑉𝜋 − 𝐻 𝑘 𝑉∞

≤𝜖

1 − 𝛾

• Proof 𝑉𝜋 − 𝐻 𝑘 𝑉∞

= 𝐻 ∞ (𝑉) − 𝐻 𝑘 𝑉∞

(by Theorem 2)

= 𝐻 𝑡+𝑘 𝑉∞𝑡=1 − 𝐻 𝑡+𝑘−1 𝑉

∞

≤ 𝐻 𝑡+𝑘 (𝑉) − 𝐻 𝑡+𝑘−1 𝑉∞

∞𝑡=1 ( 𝐴 + 𝐵 ≤ 𝐴 + | 𝐵 |)

= 𝛾𝑡𝜖∞𝑡=1 =

𝜖

1−𝛾 (by Lemma 1)


14

Optimal Value Function

• Non-linear system of equations

𝑉∞∗ 𝑠 = max

𝑎𝑅 𝑠, 𝑎 + 𝛾 Pr 𝑠′ 𝑠, 𝑎 𝑉∞

∗(𝑠′)𝑠′ ∀𝑠

• Matrix form:

𝑅𝑎: 𝑆 × 1 column vector of rewards for 𝑎

𝑉∗: 𝑆 × 1 column vector of optimal values

𝑇a: 𝑆 × 𝑆 matrix of transition prob for 𝑎

𝑉∗ = max𝑎

𝑅𝑎 + 𝛾𝑇𝑎𝑉∗


15

Contraction

• Let 𝐻∗(𝑉) ≝ max𝑎

𝑅𝑎 + 𝛾𝑇𝑎 𝑉 be the operator in

value iteration

• Lemma 3: 𝐻∗ is a contraction mapping.

𝐻∗ 𝑉 − 𝐻∗ 𝑉∞

≤ 𝛾 𝑉 − 𝑉∞

• Proof: without loss of generality,

let 𝐻∗ 𝑉 𝑠 ≥ 𝐻∗(𝑉)(𝑠) and

let 𝑎𝑠∗ = argmax

𝑎𝑅 𝑠, 𝑎 + 𝛾 Pr 𝑠′ 𝑠, 𝑎 𝑉(𝑠′)𝑠′


16

Contraction

• Proof continued:

• Then 0 ≤ 𝐻∗ 𝑉 𝑠 − 𝐻∗ 𝑉 𝑠 (by assumption)

≤ 𝑅 𝑠, 𝑎𝑠∗ + 𝛾 Pr 𝑠′ 𝑠, 𝑎𝑠

∗𝑠′ 𝑉 𝑠′ (by definition)

−𝑅 𝑠, 𝑎𝑠∗ − 𝛾 Pr 𝑠′ 𝑠, 𝑎𝑠

∗𝑠′ 𝑉 𝑠′

= 𝛾 Pr 𝑠′ 𝑠, 𝑎𝑠∗ 𝑉 𝑠′ − 𝑉 𝑠′

𝑠′

≤ 𝛾 Pr 𝑠′ 𝑠, 𝑎𝑠∗ 𝑉 − 𝑉

∞𝑠′ (maxnorm upper bound)

= 𝛾 𝑉 − 𝑉∞

(since Pr 𝑠′ 𝑠, 𝑎𝑠∗

𝑠′ = 1)

• Repeat the same argument for 𝐻∗ 𝑉 𝑠 ≥ 𝐻∗(𝑉 )(𝑠) and

for each 𝑠


17

Convergence

• Theorem 4: Value iteration converges to 𝑉∗ for

any initial estimate 𝑉

lim𝑛→∞

𝐻∗(𝑛) 𝑉 = 𝑉∗ ∀𝑉

• Proof

• By definition V∗ = 𝐻∗ ∞ 0 , but value iteration

computes 𝐻∗ ∞ 𝑉 for some initial 𝑉

• By lemma 3, 𝐻∗(𝑛) 𝑉 − 𝐻∗ 𝑛 𝑉 ∞

≤ 𝛾𝑛 𝑉 − 𝑉 ∞

• Hence, when 𝑛 → ∞, then 𝐻∗(𝑛) 𝑉 − 𝐻∗ 𝑛 0∞

→

0 and 𝐻∗ ∞ 𝑉 = 𝑉∗∀𝑉


18

Value Iteration

• Even when horizon is infinite, perform finitely

many iterations

• Stop when 𝑉𝑛 − 𝑉𝑛−1 ≤ 𝜖


∗ ← max𝑎

𝑅𝑎 ; 𝑛 ← 0

Repeat 𝑛 ← 𝑛 + 1

𝑉𝑛 ← max𝑎

𝑅𝑎 + 𝛾𝑇𝑎𝑉𝑛−1

Until 𝑉𝑛 − 𝑉𝑛−1 ∞≤ 𝜖

Return 𝑉𝑛


19

Induced Policy

• Since 𝑉𝑛 − 𝑉𝑛−1 ∞≤ 𝜖, by Theorem 4: we know

that 𝑉𝑛 − 𝑉∗∞

≤𝜖

1−𝛾

• But, how good is the stationary policy 𝜋𝑛 𝑠

extracted based on 𝑉𝑛?

𝜋𝑛 𝑠 = argmax𝑎

𝑅 𝑠, 𝑎 + 𝛾 Pr 𝑠′ 𝑠, 𝑎 𝑉𝑛(𝑠′)

𝑠′

• How far is 𝑉𝜋𝑛 from 𝑉∗?


20

Induced Policy

• Theorem 5: 𝑉𝜋𝑛 − 𝑉∗∞

≤2𝜖

1−𝛾

• Proof

𝑉𝜋𝑛 − 𝑉∗∞

= 𝑉𝜋𝑛 − 𝑉𝑛 + 𝑉𝑛 − 𝑉∗∞

≤ 𝑉𝜋𝑛 − 𝑉𝑛 ∞+ 𝑉𝑛 − 𝑉∗

∞ ( 𝐴 + 𝐵 ≤ 𝐴 + | 𝐵 |)

= 𝐻𝜋𝑛 ∞ (𝑉𝑛) − 𝑉𝑛∞

+ 𝑉𝑛 − 𝐻∗ ∞ 𝑉𝑛∞

≤𝜖

1−𝛾+

𝜖

1−𝛾 (by Theorems 2 and 4)

=2𝜖

1−𝛾


21

Summary

• Value iteration – Simple dynamic programming algorithm

– Complexity: 𝑂(𝑛 𝐴 𝑆 2) • Here 𝑛 is the number of iterations

• Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? – Yes: by policy iteration

module 6 value iteration - david r. cheriton school of

Documents