vi extensions - cw.fel.cvut.cz · value function of a policy look at the following de nition of a...

37
VI extensions 6. kvˇ etna 2019 B4M36PUI/BE4M36PUI — Planning for Artificial Intelligence

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI extensions

6. kvetna 2019

B4M36PUI/BE4M36PUI — Planning for Artificial Intelligence

Page 2: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Outline

• Review of MDP concepts

• Value Iteration algorithm

• VI extensions

1

Page 3: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Review of last tutorial

Page 4: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value function of a policy

Look at the following definition of a value function of a policy for

inifnite-horizon MDP. It contains multiple mistakes, correct them on a piece of

paper:

Def: Value function of a policy for infinite-horizon MDP

Assume infinite horizon MDP with γ ∈ [0, 100]. Then let Value function of a

policy π for every state s ∈ S be defined as

V π(s) =∑s′∈S

R(s, π(s), s ′)R(s, π(s), s ′) + γπ(s ′)

γ ∈ [0, 1)

V π(s) =∑s′∈S

T (s, π(s), s ′)[R(s, π(s), s ′) + γV π(s ′)]

Question: Difference to def. of an optimal value function?

2

Page 5: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value function of a policy

Look at the following definition of a value function of a policy for

inifnite-horizon MDP. It contains multiple mistakes, correct them on a piece of

paper:

Def: Value function of a policy for infinite-horizon MDP

Assume infinite horizon MDP with γ ∈ [0, 100]. Then let Value function of a

policy π for every state s ∈ S be defined as

V π(s) =∑s′∈S

R(s, π(s), s ′)R(s, π(s), s ′) + γπ(s ′)

γ ∈ [0, 1)

V π(s) =∑s′∈S

T (s, π(s), s ′)[R(s, π(s), s ′) + γV π(s ′)]

Question: Difference to def. of an optimal value function?

2

Page 6: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value function of a policy

Look at the following definition of a value function of a policy for

inifnite-horizon MDP. It contains multiple mistakes, correct them on a piece of

paper:

Def: Value function of a policy for infinite-horizon MDP

Assume infinite horizon MDP with γ ∈ [0, 100]. Then let Value function of a

policy π for every state s ∈ S be defined as

V π(s) =∑s′∈S

R(s, π(s), s ′)R(s, π(s), s ′) + γπ(s ′)

γ ∈ [0, 1)

V π(s) =∑s′∈S

T (s, π(s), s ′)[R(s, π(s), s ′) + γV π(s ′)]

Question: Difference to def. of an optimal value function?2

Page 7: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Bellman Equations

Write down equations for finding a value function of a policy π. How would you

solve these equations?

• T :

T (S0, a0, S1) = 0.6

T (S0, a0, S2) = 0.4

T (S1, a1, S3) = 1

T (S2, a2, S3) = 0.7

T (S2, a2, S0) = 0.3

• R :

R(S0, a0,S1) = 5

R(S0, a0,S2) = 2

R(S1, a1,S3) = 1

R(S2, a2,S3) = 4

R(S2, a2,S0) = 3

R(S1, a3,S4) = 4

R(S2, a4,S3) = 10

• S : S0, S1, S2, S3,S4

• A: a0, a1, a2, a3, a4)

3

Page 8: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value Iteration

Page 9: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

value Iteration algorithm

Basic algorithm for finding solution of Bellman Equations iteratively.

1. initialize V0 arbitrarily for each state, e.g to 0, set n = 0

2. Set n = n + 1.

3. Compute Bellman Backup, i.e. for each s ∈ S :

3.1 Vn(s) = maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γVn−1(s′)]

4. GOTO 2.

4

Page 10: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI example

Robot Emil-like domain.

Assuming γ = 1.

Using VI, initializing

∀s V0(s) = 0, calculate

V1,V2,V3 for the nine states

around +10 tile.

• Moving into edges gives -1

reward

• Moving onto marked tiles

gives corresponding reward

5

Page 11: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

value Iteration algorithm

Basic algorithm for finding solution of Bellman Equations iteratively.

1. initialize V0 arbitrarily for each state, e.g to 0, set n = 0

2. Set n = n + 1.

3. Compute Bellman Backup, i.e. for each s ∈ S :

3.1 Vn(s) = maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γVn−1(s′)]

4. GOTO 2.

Question: Does it converge? How fast? When do we stop?

6

Page 12: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Residual - definition

Def: Residual

Residual of value function Vn from Vn+1 at state s ∈ S is defined by:

ResVn (s) = |Vn(s)− Vn+1(s)|

Residual of value function V from V ′ is given by:

ResVn = ||Vn − Vn+1||∞ = maxs|Vn(s)− Vn+1(s)|

7

Page 13: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Residual - definition

Def: Residual

Residual of value function Vn from Vn+1 at state s ∈ S is defined by:

ResVn (s) = |Vn(s)− Vn+1(s)|

Residual of value function V from V ′ is given by:

ResVn = ||Vn − Vn+1||∞ = maxs|Vn(s)− Vn+1(s)|

7

Page 14: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI stopping criterion

Stopping criterion: When residual of consecutive value functions is below low

value of ε:

||Vn − Vn+1|| < ε

However, this does not imply ε distance of value of greedy policy from optimal

value function.

Theorems for general MDP exist of form:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < ε(Some MDP dependent term)

In case of discounted (γ < 1) infinite-horizon MDPs:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < 2

εγ

1− γ

Y for proof, see: [Singh, Yee: An Upper Bound on the Loss from Approximate Optimal-Value Functions, p.229] 8

Page 15: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI stopping criterion

Stopping criterion: When residual of consecutive value functions is below low

value of ε:

||Vn − Vn+1|| < ε

However, this does not imply ε distance of value of greedy policy from optimal

value function.

Theorems for general MDP exist of form:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < ε(Some MDP dependent term)

In case of discounted (γ < 1) infinite-horizon MDPs:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < 2

εγ

1− γ

Y for proof, see: [Singh, Yee: An Upper Bound on the Loss from Approximate Optimal-Value Functions, p.229] 8

Page 16: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI stopping criterion

Stopping criterion: When residual of consecutive value functions is below low

value of ε:

||Vn − Vn+1|| < ε

However, this does not imply ε distance of value of greedy policy from optimal

value function.

Theorems for general MDP exist of form:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < ε(Some MDP dependent term)

In case of discounted (γ < 1) infinite-horizon MDPs:

Vn,V∗as above =⇒ ∀s |Vn(s)− V ∗(s)| < 2

εγ

1− γ

Y for proof, see: [Singh, Yee: An Upper Bound on the Loss from Approximate Optimal-Value Functions, p.229] 8

Page 17: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI with stopping criterion

1. initialize V0 arbitrarily for each state, e.g to 0, set n = 0

2. Set n = n + 1.

3. Compute Bellman Backup, i.e. for each s ∈ S :

3.1 Vn(s) = maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γVn−1(s′)]

3.2 Calculate residual Res = maxs∈S |Vn(s)− Vn−1(s)|

4. if res > ε GOTO 2. else TERMINATE

Question: What is the policy?

• Greedy policy πVn is the policy given as argmax of Vn.

9

Page 18: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI with stopping criterion

1. initialize V0 arbitrarily for each state, e.g to 0, set n = 0

2. Set n = n + 1.

3. Compute Bellman Backup, i.e. for each s ∈ S :

3.1 Vn(s) = maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γVn−1(s′)]

3.2 Calculate residual Res = maxs∈S |Vn(s)− Vn−1(s)|

4. if res > ε GOTO 2. else TERMINATE

Question: What is the policy?

• Greedy policy πVn is the policy given as argmax of Vn.

9

Page 19: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI with stopping criterion

1. initialize V0 arbitrarily for each state, e.g to 0, set n = 0

2. Set n = n + 1.

3. Compute Bellman Backup, i.e. for each s ∈ S :

3.1 Vn(s) = maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γVn−1(s′)]

3.2 Calculate residual Res = maxs∈S |Vn(s)− Vn−1(s)|

4. if res > ε GOTO 2. else TERMINATE

Question: What is the policy?

• Greedy policy πVn is the policy given as argmax of Vn.

9

Page 20: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value iteration properties

• Convergence: VI converges from any initialization (unlike PI)

• Termination: when residual is ”small”

Question: What are the memory requirements of VI?

• Value of each state needs to be stored twice

10

Page 21: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value iteration properties

• Convergence: VI converges from any initialization (unlike PI)

• Termination: when residual is ”small”

Question: What are the memory requirements of VI?

• Value of each state needs to be stored twice

10

Page 22: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Value iteration properties

• Convergence: VI converges from any initialization (unlike PI)

• Termination: when residual is ”small”

Question: What are the memory requirements of VI?

• Value of each state needs to be stored twice

10

Page 23: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

VI extensions

Page 24: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Another beautiful MDP example

• All undeclared rewards are -1

Task: Initialize VI with negative distance to S5 and calculate first 3 iterations

of VI, with state ordering S0 to S5

Y Mausam, Kobolov: Planning With Markov Decision Processes, p.41 11

Page 25: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Asynchronous VI

1. initialize V0 arbitrarily for each state, e.g to 0

2. While ResV > ε, do:

2.1 pick some state s

2.2 Bellman backup V (s)← maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γV (s′)]

2.3 Update residual at s ResV (s) = |Vold(s)− Vnew(s)|

Question: Memory requirements compared to VI?

Question: Convergence condition?

• Asymptotic as VI under condition that every state visited ∞ often.

Question: How to pick s in 2.1?

• Simplest is Gauss-Seidel VI, that is run AVI over all states iteratively

12

Page 26: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Asynchronous VI

1. initialize V0 arbitrarily for each state, e.g to 0

2. While ResV > ε, do:

2.1 pick some state s

2.2 Bellman backup V (s)← maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γV (s′)]

2.3 Update residual at s ResV (s) = |Vold(s)− Vnew(s)|

Question: Memory requirements compared to VI?

Question: Convergence condition?

• Asymptotic as VI under condition that every state visited ∞ often.

Question: How to pick s in 2.1?

• Simplest is Gauss-Seidel VI, that is run AVI over all states iteratively

12

Page 27: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Asynchronous VI

1. initialize V0 arbitrarily for each state, e.g to 0

2. While ResV > ε, do:

2.1 pick some state s

2.2 Bellman backup V (s)← maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γV (s′)]

2.3 Update residual at s ResV (s) = |Vold(s)− Vnew(s)|

Question: Memory requirements compared to VI?

Question: Convergence condition?

• Asymptotic as VI under condition that every state visited ∞ often.

Question: How to pick s in 2.1?

• Simplest is Gauss-Seidel VI, that is run AVI over all states iteratively

12

Page 28: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Asynchronous VI

1. initialize V0 arbitrarily for each state, e.g to 0

2. While ResV > ε, do:

2.1 pick some state s

2.2 Bellman backup V (s)← maxa∈A∑

s′∈S T (s, a, s′)[R(s, a, s′) + γV (s′)]

2.3 Update residual at s ResV (s) = |Vold(s)− Vnew(s)|

Question: Memory requirements compared to VI?

Question: Convergence condition?

• Asymptotic as VI under condition that every state visited ∞ often.

Question: How to pick s in 2.1?

• Simplest is Gauss-Seidel VI, that is run AVI over all states iteratively

12

Page 29: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Prioritized VI

How else can we pick states to update? (Ans: by ordering them in clever ways)

• Build priority queue of states to update → Prioritized Sweeping VI.

Update states in the order of the queue. Priority function:

priorityPS(s)← max{priorityPS(s),maxa∈A{T (s, a, s ′)ResV (s ′)}}

EXAMPLE ON BOARD

Convergence?

• If all states start with non-zero priority

• OR If you interleave regular VI sweeps with Prioritized VI

13

Page 30: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Prioritized VI

How else can we pick states to update? (Ans: by ordering them in clever ways)

• Build priority queue of states to update → Prioritized Sweeping VI.

Update states in the order of the queue. Priority function:

priorityPS(s)← max{priorityPS(s),maxa∈A{T (s, a, s ′)ResV (s ′)}}

EXAMPLE ON BOARD

Convergence?

• If all states start with non-zero priority

• OR If you interleave regular VI sweeps with Prioritized VI

13

Page 31: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Prioritized VI

How else can we pick states to update? (Ans: by ordering them in clever ways)

• Build priority queue of states to update → Prioritized Sweeping VI.

Update states in the order of the queue. Priority function:

priorityPS(s)← max{priorityPS(s),maxa∈A{T (s, a, s ′)ResV (s ′)}}

EXAMPLE ON BOARD

Convergence?

• If all states start with non-zero priority

• OR If you interleave regular VI sweeps with Prioritized VI

13

Page 32: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Prioritized VI

How else can we pick states to update? (Ans: by ordering them in clever ways)

• Build priority queue of states to update → Prioritized Sweeping VI.

Update states in the order of the queue. Priority function:

priorityPS(s)← max{priorityPS(s),maxa∈A{T (s, a, s ′)ResV (s ′)}}

EXAMPLE ON BOARD

Convergence?

• If all states start with non-zero priority

• OR If you interleave regular VI sweeps with Prioritized VI

13

Page 33: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Prioritized VI

How else can we pick states to update? (Ans: by ordering them in clever ways)

• Build priority queue of states to update → Prioritized Sweeping VI.

Update states in the order of the queue. Priority function:

priorityPS(s)← max{priorityPS(s),maxa∈A{T (s, a, s ′)ResV (s ′)}}

EXAMPLE ON BOARD

Convergence?

• If all states start with non-zero priority

• OR If you interleave regular VI sweeps with Prioritized VI

13

Page 34: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Partitioned VI

Subdivide state space into partitions. Pre-solve parts independently.

• Example is Topological VI

Topological VI:

1. Find acyclic paritioning (by finding strongly connected components in the

graph)

2. Run VI in each partition to convergence backward from terminal states

Question: Why acyclic partitioning?

EXAMPLE ON BOARD

14

Page 35: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Partitioned VI

Subdivide state space into partitions. Pre-solve parts independently.

• Example is Topological VI

Topological VI:

1. Find acyclic paritioning (by finding strongly connected components in the

graph)

2. Run VI in each partition to convergence backward from terminal states

Question: Why acyclic partitioning?

EXAMPLE ON BOARD

14

Page 36: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Partitioned VI

Subdivide state space into partitions. Pre-solve parts independently.

• Example is Topological VI

Topological VI:

1. Find acyclic paritioning (by finding strongly connected components in the

graph)

2. Run VI in each partition to convergence backward from terminal states

Question: Why acyclic partitioning?

EXAMPLE ON BOARD

14

Page 37: VI extensions - cw.fel.cvut.cz · Value function of a policy Look at the following de nition of a value function of a policy for inifnite-horizon MDP. It contains multiple mistakes,

Partitioned VI

Subdivide state space into partitions. Pre-solve parts independently.

• Example is Topological VI

Topological VI:

1. Find acyclic paritioning (by finding strongly connected components in the

graph)

2. Run VI in each partition to convergence backward from terminal states

Question: Why acyclic partitioning?

EXAMPLE ON BOARD

14