cs 6789: foundations of reinforcement learning,cat (,cat)(,dog) given i.i.d examples at training: (...
Post on 02-Jan-2021
2 Views
Preview:
TRANSCRIPT
CS 6789: Foundations of Reinforcement Learning
Fall 2020
Sham Kakade (UW & MSR) & Wen Sun
https://wensun.github.io/CS6789.html
TA: Jonathan Chang
TD GAMMON [Tesauro 95] [AlphaZero, Silver et.al, 17] [OpenAI Five, 18]
Progress of RL in Practice
2
This course focuses on RL Theory
We care about sample complexity:
Total number of environment interactions needed to learn a high quality policy (i.e., achieve the task)
Four main themes we will cover in this course:
1. Exploration strategies (not just random)
2. Policy Optimization (gradient descent)
3. Control (LQR and nonlinear control)
4. Imitation Learning (i.e., learning from demonstrations)
Logistics
Four (HW0-HW3) assignments (total 55%) and Course Project (45%)
HW0 is out today and due in one week
(HW0 10%, HW1-3 15% each)
Prerequisites (HW0)
Deep understanding of Machine Learning, Optimization, Statistics
ML: sample complexity analysis for supervised learning (PAC)
Opt: Convex (linear) optimization, e.g., gradient decent for convex functions
Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound
Prerequisites (HW0)
Deep understanding of Machine Learning, Optimization, Statistics
ML: sample complexity analysis for supervised learning (PAC)
Opt: Convex (linear) optimization, e.g., gradient decent for convex functions
Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound
Check out HW0 asap!
Course projects (45%)
• Team work: size 3
• Midterm report (5%), Final presentation (15%), and Final report (25%)
• Basics: survey of a set of similar RL theory papers. Reproduce
analysis and provide a coherent story
• Advanced: identify extensions of existing RL papers, formulate
theory questions, and provide proofs
Course Notes
We will use the monograph: Reinforcement Learning Theory and Algorithms
https://rltheorybook.github.io/rl_monograph_AJK_V2.pdf
Basics of Markov Decision Processes
10
Supervised Learning
10
Supervised Learning
,cat ,cat ,dog( )( )
Given i.i.d examples at training:
( )
10
f 2 F
Supervised Learning
,cat ,cat ,dog( )( )
Given i.i.d examples at training:
( )
10
f 2 F
Passive:
Prediction
Data Distribution
Supervised Learning
,cat ,cat ,dog( )( )
Given i.i.d examples at training:
( )
10
f 2 F
Passive:
Prediction
Data Distribution
Supervised Learning
,cat ,cat ,dog( )( )
Given i.i.d examples at training:
( )Cersei Tasha
Markov Decision Process
Policy: determine action based on state
Learning Agent Environmen
π(s) → a
Markov Decision Process
Policy: determine action based on state
Send reward and next state from a Markovian transition dynamics
r(s, a), s′ ∼ P( ⋅ |s, a)
Learning Agent Environmen
π(s) → a
Markov Decision Process
Policy: determine action based on state
Multiple Steps
Send reward and next state from a Markovian transition dynamics
r(s, a), s′ ∼ P( ⋅ |s, a)
Learning Agent Environmen
π(s) → a
Markov Decision Process
Policy: determine action based on state
Multiple Steps
Send reward and next state from a Markovian transition dynamics
r(s, a), s′ ∼ P( ⋅ |s, a)
Learning Agent Environmen
π(s) → a
Markov Decision Process
Policy: determine action based on state
Multiple Steps
Send reward and next state from a Markovian transition dynamics
r(s, a), s′ ∼ P( ⋅ |s, a)
Learning Agent Environmen
π(s) → a
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Learn from Experience Generalize Interactive Exploration Credit
assignment
Supervised Learning
Reinforcement Learning
Table content based on slides from Emma Brunskill
Infinite horizon Discounted Setting
ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)
Infinite horizon Discounted Setting
ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)
Policy π : S ↦ Δ(A)
Infinite horizon Discounted Setting
ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)
Policy π : S ↦ Δ(A)
Value function Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Infinite horizon Discounted Setting
ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)
Policy π : S ↦ Δ(A)
Value function Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Q function Qπ(s, a) = 𝔼 [
∞
∑h=0
γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Bellman Equation:
Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Bellman Equation:
Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]
Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Bellman Equation:
Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [
∞
∑h=0
γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Vπ(s) = 𝔼 [∞
∑h=0
γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]
Bellman Equation:
Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [
∞
∑h=0
γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Qπ(s, a) = r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )
Optimal Policy
For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,
[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π
Optimal Policy
For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,
[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π
We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆
Optimal Policy
For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,
[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π
We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman OptimalityProof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ max
a [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]
≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]
≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Denote we will prove ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )
= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]≤ 𝔼 [r(s, ̂π (s)) + γr(s′ , ̂π (s′ )) + …] = V ̂π (s)
≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )
Proof of Bellman Optimality
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Proof of Bellman Optimality
Denote we just proved ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]
Theorem 1: Bellman Optimality
Proof of Bellman Optimality
Denote we just proved ̂π (s) := arg maxa
Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s
This implies that is an optimal policyarg maxa
Q⋆(s, a)
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
Proof of Bellman Optimality
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
|V(s) − V⋆(s) | = maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa
(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
Proof of Bellman Optimality
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
|V(s) − V⋆(s) | = maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa
(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
Proof of Bellman Optimality
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
|V(s) − V⋆(s) | = maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa
(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )
Proof of Bellman Optimality
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
|V(s) − V⋆(s) | = maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa
(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )
≤ maxa
γ𝔼s′ ∼P(s,a) (maxa′
γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )
Proof of Bellman Optimality
Theorem 2:
For any , if for all ,
then
V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s
V(s) = V⋆(s), ∀s
|V(s) − V⋆(s) | = maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa
(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))
≤ maxa
γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )
≤ maxa
γ𝔼s′ ∼P(s,a) (maxa′
γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )≤ max
a1,a2,…ak−1
γk𝔼sk|V(sk) − V⋆(sk) |
Proof of Bellman Optimality
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Objective function: J(π) = 𝔼 [H−1
∑h=0
r(sh, ah)]
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Time-dependent value/Q function:
Vπh (s) = 𝔼 [
H−1
∑t=h
r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [
H−1
∑t=h
r(st, at) |sh = s, ah = a, at ∼ π(st)]
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Time-dependent value/Q function:
Vπh (s) = 𝔼 [
H−1
∑t=h
r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [
H−1
∑t=h
r(st, at) |sh = s, ah = a, at ∼ π(st)]
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Vπh (sh) = 𝔼ah∼π(sh) [r(sh, ah) + 𝔼sh+1∼P(⋅|sh,ah)V
πh+1(sh+1)]
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆
H−1}
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆
H−1}
Q⋆H−1(s, a) = r(s, a), π⋆
H−1(s) = arg maxa
Q⋆H−1(s, a), V⋆
H−1(s) = maxa
Q⋆H−1(s, a)
Finite Horizon Setting
ℳ = {S, A, P, r, μ0, H}
P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)
Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)
Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆
H−1}
Q⋆H−1(s, a) = r(s, a), π⋆
H−1(s) = arg maxa
Q⋆H−1(s, a), V⋆
H−1(s) = maxa
Q⋆H−1(s, a)
Q⋆h (s, a) = r(s, a) + 𝔼s′ ∼P(⋅|s,a)V⋆
h+1(s′ ), π⋆h (s) = arg max
aQ⋆
h (s, a)
State (action) Occupancy
: probability of visiting at time step , starting at ℙh(s; s0, π) π s h ∈ ℕ s0
dπs0
(s) = (1 − γ)∞
∑h=0
γhℙh(s; s0, π) Vπ(s0) =1
1 − γ ∑s,a
dπs0
(s)π(a |s)r(s, a)
State (action) Occupancy
: probability of visiting at time step , starting at ℙh(s, a; s0, π) π (s, a) h ∈ ℕ s0
dπs0
(s, a) = (1 − γ)∞
∑h=0
γhℙh(s, a; s0, π) Vπ(s0) =1
1 − γ ∑s,a
dπs0
(s, a)r(s, a)
top related