cs 6789: foundations of reinforcement learning,cat (,cat)(,dog) given i.i.d examples at training: (...

CS 6789: Foundations of Reinforcement Learning

Fall 2020

Sham Kakade (UW & MSR) & Wen Sun

https://wensun.github.io/CS6789.html

TA: Jonathan Chang

TD GAMMON [Tesauro 95] [AlphaZero, Silver et.al, 17] [OpenAI Five, 18]

Progress of RL in Practice

2

This course focuses on RL Theory

We care about sample complexity:

Total number of environment interactions needed to learn a high quality policy (i.e., achieve the task)

Four main themes we will cover in this course:

1. Exploration strategies (not just random)

2. Policy Optimization (gradient descent)

3. Control (LQR and nonlinear control)

4. Imitation Learning (i.e., learning from demonstrations)

Logistics

Four (HW0-HW3) assignments (total 55%) and Course Project (45%)

HW0 is out today and due in one week

(HW0 10%, HW1-3 15% each)

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Prerequisites (HW0)

Deep understanding of Machine Learning, Optimization, Statistics

ML: sample complexity analysis for supervised learning (PAC)

Opt: Convex (linear) optimization, e.g., gradient decent for convex functions

Stats: basics of concentration (e.g., Hoeffding’s), tricks such as union bound

Check out HW0 asap!

Course projects (45%)

• Team work: size 3

• Midterm report (5%), Final presentation (15%), and Final report (25%)

• Basics: survey of a set of similar RL theory papers. Reproduce

analysis and provide a coherent story

• Advanced: identify extensions of existing RL papers, formulate

theory questions, and provide proofs

Course Notes

We will use the monograph: Reinforcement Learning Theory and Algorithms

https://rltheorybook.github.io/rl_monograph_AJK_V2.pdf

Basics of Markov Decision Processes

10

Supervised Learning

10

Supervised Learning

,cat ,cat ,dog( )( )

Given i.i.d examples at training:

( )

10

f 2 F

Supervised Learning



( )

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning



( )

10

f 2 F

Passive:

Prediction

Data Distribution

Supervised Learning



( )Cersei Tasha

Markov Decision Process

Policy: determine action based on state

Learning Agent Environmen

π(s) → a



Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)


π(s) → a



Multiple Steps

Send reward and next state from a Markovian transition dynamics

r(s, a), s′ ∼ P( ⋅ |s, a)


π(s) → a

Learn from Experience Generalize Interactive Exploration Credit

assignment

Supervised Learning

Reinforcement Learning

Table content based on slides from Emma Brunskill

Infinite horizon Discounted Setting

ℳ = {S, A, P, r, γ}P : S × A ↦ Δ(S), r : S × A → [0,1], γ ∈ [0,1)



Policy π : S ↦ Δ(A)




Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]




Value function Vπ(s) = 𝔼 [∞

∑h=0

γhr(sh, ah) s0 = s, ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Q function Qπ(s, a) = 𝔼 [

∞

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Vπ(s) = 𝔼 [∞

∑h=0


Bellman Equation:

Vπ(s) = 𝔼 [∞

∑h=0


Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]

Vπ(s) = 𝔼 [∞

∑h=0


Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∞

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]

Vπ(s) = 𝔼 [∞

∑h=0


Bellman Equation:

Vπ(s) = 𝔼a∼π(s) [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )]Qπ(s, a) = 𝔼 [

∞

∑h=0

γhr(sh, ah) (s0, a0) = (s, a), ah ∼ π(sh), sh+1 ∼ P( ⋅ |sh, ah)]Qπ(s, a) = r(s, a) + γ𝔼s′ ∼P(⋅|s,a)Vπ(s′ )

Optimal Policy

For infinite horizon discounted MDP, there exists a deterministic stationary policy , s.t.,

[Puterman 94 chapter 6, also see theorem 1.4 in the RL monograph]π⋆ : S ↦ A Vπ⋆(s) ≥ Vπ(s), ∀s, π

Optimal Policy



We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

Optimal Policy



We denote V⋆ := Vπ⋆, Q⋆ := Qπ⋆

V⋆(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V⋆(s′ )]

Theorem 1: Bellman Optimality


Theorem 1: Bellman OptimalityProof of Bellman Optimality



Denote we will prove ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

Proof of Bellman Optimality




Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

V⋆(s) = r(s, π⋆(s)) + γ𝔼s′ ∼P(s,π⋆(s))V⋆(s′ )





Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s


≤ maxa [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )





Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s


= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ max

a [r(s, a) + γ𝔼s′ ∼P(s,a)V⋆(s′ )] = r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s))V⋆(s′ )





Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s


= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]






Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s


= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]






Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s


= r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , π⋆(s′ )) + γ𝔼s′ ′ ∼P(s′ ,π⋆(s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ ))V⋆(s′ ′ )]≤ r(s, ̂π (s)) + γ𝔼s′ ∼P(s, ̂π (s)) [r(s′ , ̂π (s′ )) + γ𝔼s′ ′ ∼P(s′ , ̂π (s′ )) [r(s′ ′ , ̂π (s′ ′ )) + γ𝔼s′ ′ ′ ∼P(s′ ′ , ̂π (s′ ′ ))V⋆(s′ ′ ′ )]]≤ 𝔼 [r(s, ̂π (s)) + γr(s′ , ̂π (s′ )) + …] = V ̂π (s)






Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s




Denote we just proved ̂π (s) := arg maxa

Q⋆(s, a), V ̂π (s) = V⋆(s), ∀s

This implies that is an optimal policyarg maxa

Q⋆(s, a)

Theorem 2:

For any , if for all ,

then

V : S → ℝ V(s) = maxa [r(s, a) + γ𝔼s′ ∼P(⋅|s,a)V(s′ )] s

V(s) = V⋆(s), ∀s


Theorem 2:


then


V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − maxa

(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))


Theorem 2:


then


V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa


(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa

(r(s, a) + γ𝔼s∼P(s,a)V(s′ )) − (r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))


Theorem 2:


then


V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa


(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa


≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )


Theorem 2:


then


V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa


(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa


≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )


Theorem 2:


then


V(s) = V⋆(s), ∀s

|V(s) − V⋆(s) | = maxa


(r(s, a) + γ𝔼s∼P(s,a)V⋆(s′ ))

≤ maxa


≤ maxa

γ𝔼s′ ∼P(s,a) V(s′ ) − V⋆(s′ )

≤ maxa

γ𝔼s′ ∼P(s,a) (maxa′

γ𝔼s′ ′ ∼P(s′ ,a′ ) V(s′ ′ ) − V⋆(s′ ′ ) )≤ max

a1,a2,…ak−1

γk𝔼sk|V(sk) − V⋆(sk) |


Finite Horizon Setting

ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Given π : S ↦ Δ(A)s0 ∼ μ0, a0 ∼ π(s0), r(s0, a0), s1 ∼ P( ⋅ |s0, s0), …, sH−1 ∼ P( ⋅ |sH−2, aH−2), aH−1 ∼ π(sH−1), r(sH−1, aH−1)


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)


Objective function: J(π) = 𝔼 [H−1

∑h=0

r(sh, ah)]


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]



ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)

Time-dependent value/Q function:

Vπh (s) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, at ∼ π(st)], Qπh (s, a) = 𝔼 [

H−1

∑t=h

r(st, at) |sh = s, ah = a, at ∼ π(st)]


Vπh (sh) = 𝔼ah∼π(sh) [r(sh, ah) + 𝔼sh+1∼P(⋅|sh,ah)V

πh+1(sh+1)]


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)


Time-dependent optimal policy : π⋆ = {π⋆0 , …, π⋆

H−1}


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)



H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)


ℳ = {S, A, P, r, μ0, H}

P : S × A ↦ Δ(S), r : S × A → [0,1], H ∈ ℕ+, μ0 ∈ Δ(S)



H−1}

Q⋆H−1(s, a) = r(s, a), π⋆

H−1(s) = arg maxa

Q⋆H−1(s, a), V⋆

H−1(s) = maxa

Q⋆H−1(s, a)

Q⋆h (s, a) = r(s, a) + 𝔼s′ ∼P(⋅|s,a)V⋆

h+1(s′ ), π⋆h (s) = arg max

aQ⋆

h (s, a)

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s; s0, π) π s h ∈ ℕ s0

dπs0

(s) = (1 − γ)∞

∑h=0

γhℙh(s; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s)π(a |s)r(s, a)

State (action) Occupancy

: probability of visiting at time step , starting at ℙh(s, a; s0, π) π (s, a) h ∈ ℕ s0

dπs0

(s, a) = (1 − γ)∞

∑h=0

γhℙh(s, a; s0, π) Vπ(s0) =1

1 − γ ∑s,a

dπs0

(s, a)r(s, a)

cs 6789: foundations of reinforcement learning,cat (,cat)(,dog) given i.i.d examples at training: (...

Documents