cs 6789: foundations of reinforcement learning,cat (,cat)(,dog) given i.i.d examples at training: (...

63
CS 6789: Foundations of Reinforcement Learning Fall 2020 Sham Kakade (UW & MSR) & Wen Sun https://wensun.github.io/CS6789.html TA: Jonathan Chang

Upload: others

Post on 02-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

CS 6789: Foundations of Reinforcement Learning

Fall 2020

Sham Kakade (UW & MSR) & Wen Sun

https://wensun.github.io/CS6789.html

TA: Jonathan Chang

Page 2: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

TD GAMMON [Tesauro 95] [AlphaZero, Silver et.al, 17] [OpenAI Five, 18]

Progress of RL in Practice

2

Page 3: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

This course focuses on RL Theory

We care about sample complexity:

Total number of environment interactions needed to learn a high quality policy (i.e., achieve the task)

Page 4: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Four main themes we will cover in this course:

1. Exploration strategies (not just random)

2. Policy Optimization (gradient descent)

3. Control (LQR and nonlinear control)

4. Imitation Learning (i.e., learning from demonstrations)

Page 5: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Logistics

Four (HW0-HW3) assignments (total 55%) and Course Project (45%)

HW0 is out today and due in one week

(HW0 10%, HW1-3 15% each)

Page 6: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Page 7: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Check out HW0 asap!

Page 8: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Course projects (45%)

• Team work: size 3

• Midterm report (5%), Final presentation (15%), and Final report (25%)

• Basics: survey of a set of similar RL theory papers. Reproduce

analysis and provide a coherent story

• Advanced: identify extensions of existing RL papers, formulate

theory questions, and provide proofs

Page 9: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Course Notes

We will use the monograph: Reinforcement Learning Theory and Algorithms

https://rltheorybook.github.io/rl_monograph_AJK_V2.pdf

Page 10: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Basics of Markov Decision Processes

Page 11: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

10

Supervised Learning

Page 12: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

10

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

Page 13: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

10

f 2 F

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

Page 14: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

Page 15: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )Cersei Tasha

Page 16: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Markov Decision Process

Policy: determine action based on state

Learning Agent Environmen

π(s) → a

Page 17: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Markov Decision Process

Policy: determine action based on state

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Page 18: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Page 19: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Page 20: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Markov Decision Process

Policy: determine action based on state

Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)

Learning Agent Environmen

π(s) → a

Page 21: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 22: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 23: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 24: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 25: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 26: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Page 27: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Page 28: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Page 29: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Page 30: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)

Policy π : S ↦ Δ(A)

Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Q function Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Page 31: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Page 32: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]

Page 33: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Page 34: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Qπ(s, a) = r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )

Page 35: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

Page 36: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

Page 37: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Page 38: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman OptimalityProof of Bellman Optimality

Page 39: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

Proof of Bellman Optimality

Page 40: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

Proof of Bellman Optimality

Page 41: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

Page 42: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ max

a [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

Page 43: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

Page 44: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

Page 45: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )

= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]≤ 𝔼 [r(s, ̂π (s)) + γr(s′ , ̂π (s′ )) + …] = V ̂π (s)

≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )

Proof of Bellman Optimality

Page 46: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Proof of Bellman Optimality

Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

Page 47: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality

Proof of Bellman Optimality

Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

This implies that is an optimal policyarg maxa

Q⋆(s, a)

Page 48: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

Proof of Bellman Optimality

Page 49: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

Proof of Bellman Optimality

Page 50: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

Proof of Bellman Optimality

Page 51: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

Proof of Bellman Optimality

Page 52: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )

Proof of Bellman Optimality

Page 53: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )≤ max

a1,a2,…ak−1

γk𝔼sk|V(sk) − V⋆(sk) |

Proof of Bellman Optimality

Page 54: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Page 55: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Page 56: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Objective function: J(π) = 𝔼 [H−1

∑h=0

r(sh, ah)]

Page 57: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Page 58: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Vπh (sh) = 𝔼ah∼π(sh) [r(sh, ah) + 𝔼sh+1∼P(⋅|sh,ah)V

πh+1(sh+1)]

Page 59: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Page 60: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)

Page 61: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)

Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)

Q⋆h (s, a) = r(s, a) + 𝔼s′ ∼P(⋅|s,a)V⋆

h+1(s′ ), π⋆h (s) = arg max

aQ⋆

h (s, a)

Page 62: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s; s0, π) π s h ∈ ℕ s0

dπs0

(s) = (1 − γ)∞

∑h=0

γhℙh(s; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s)π(a |s)r(s, a)

Page 63: CS 6789: Foundations of Reinforcement Learning,cat (,cat)(,dog) Given i.i.d examples at training: ( ) Cersei Tasha Markov Decision Process Policy: determine action based on state Learning

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s, a; s0, π) π (s, a) h ∈ ℕ s0

dπs0

(s, a) = (1 − γ)∞

∑h=0

γhℙh(s, a; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s, a)r(s, a)