cs 6789: foundations of reinforcement learning,cat (,cat)(,dog) given i.i.d examples at training: (...

Post on 02-Jan-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS 6789: Foundations of Reinforcement Learning

Fall 2020

Sham Kakade (UW & MSR) & Wen Sun

https://wensun.github.io/CS6789.html

TA: Jonathan Chang

TD GAMMON [Tesauro 95] [AlphaZero, Silver et.al, 17] [OpenAI Five, 18]

Progress of RL in Practice

2

This course focuses on RL Theory

We care about sample complexity:

Total number of environment interactions needed to learn a high quality policy (i.e., achieve the task)

Four main themes we will cover in this course:

1. Exploration strategies (not just random)

2. Policy Optimization (gradient descent)

3. Control (LQR and nonlinear control)

4. Imitation Learning (i.e., learning from demonstrations)

Logistics

Four (HW0-HW3) assignments (total 55%) and Course Project (45%)

HW0 is out today and due in one week

(HW0 10%, HW1-3 15% each)

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Check out HW0 asap!

Course projects (45%)

• Team work: size 3

• Midterm report (5%), Final presentation (15%), and Final report (25%)

• Basics: survey of a set of similar RL theory papers. Reproduce

analysis and provide a coherent story

• Advanced: identify extensions of existing RL papers, formulate

theory questions, and provide proofs

Course Notes

We will use the monograph: Reinforcement Learning Theory and Algorithms

https://rltheorybook.github.io/rl_monograph_AJK_V2.pdf

Basics of Markov Decision Processes

10

Supervised Learning

10

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

10

f 2 F

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )Cersei Tasha

Markov Decision Process

Policy: determine action based on state

Learning Agent Environmen

π(s) → a

Markov Decision Process

Policy: determine action based on state

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Q function Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Qπ(s, a) = r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman OptimalityProof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ max

a [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]≤ 𝔼 [r(s, ̂π (s)) + γr(s′ , ̂π (s′ )) + …] = V ̂π (s)

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Proof of Bellman Optimality

Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Proof of Bellman Optimality

Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

This implies that is an optimal policyarg maxa

Q⋆(s, a)

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

Proof of Bellman Optimality

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

Proof of Bellman Optimality

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

Proof of Bellman Optimality

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

Proof of Bellman Optimality

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )

Proof of Bellman Optimality

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )≤ max

a1,a2,…ak−1

γk𝔼sk|V(sk) − V⋆(sk) |

Proof of Bellman Optimality

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Objective function: J(π) = 𝔼 [H−1

∑h=0

r(sh, ah)]

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Vπh (sh) = 𝔼ah∼π(sh) [r(sh, ah) + 𝔼sh+1∼P(⋅|sh,ah)V

πh+1(sh+1)]

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)

Q⋆h (s, a) = r(s, a) + 𝔼s′ ∼P(⋅|s,a)V⋆

h+1(s′ ), π⋆h (s) = arg max

aQ⋆

h (s, a)

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s; s0, π) π s h ∈ ℕ s0

dπs0

(s) = (1 − γ)∞

∑h=0

γhℙh(s; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s)π(a |s)r(s, a)

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s, a; s0, π) π (s, a) h ∈ ℕ s0

dπs0

(s, a) = (1 − γ)∞

∑h=0

γhℙh(s, a; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s, a)r(s, a)

top related