reinforcement learningchercheurs.lille.inria.fr/ekaufman/rlcours1.pdf · outline of the class...
TRANSCRIPT
![Page 1: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/1.jpg)
Reinforcement LearningLecture 1 : Markov Decision Processes
Emilie Kaufmann
Ecole Centrale de Lille, 2019/2020
Emilie Kaufmann |CRIStAL - 1
![Page 2: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/2.jpg)
Outline of the class
• Lecture 1. Markov Decision Processes (MDP), a formalization forreinforcement learning problem(s)
• Lecture 2. One-state, several actions : solving multi-armed banditsUCB algorithms. Thompson Sampling
• Lecture 3. Solving a MDP with known parameters.Dynamic Programming, Value/Policy Iteration
• Lecture 4. First Reinforcement Learning algorithms.TD Learning, Q-Learning
• Lecture 5. Approximate Dynamic Programming
• Lecture 6. Deep Reinforcement Learning (O. Pietquin)
• Lecture 7. Policy Gradient Methods (O. Pietquin)
• Lecture 8. Bandit tools for RLBandit-based exploration, Monte-Carlo Tree Search Methods
Emilie Kaufmann |CRIStAL - 2
![Page 3: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/3.jpg)
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 3
![Page 4: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/4.jpg)
Markov Decision Process
A Markov Decision Process (MDP) models a situation in which repeateddecisions (= choices of actions) are made. MDP provides models for theconsequence of each decisions :
I in terms of reward
I in terms of the evoluation of the system’s state
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∈ R depending on his choice,
I transits to a new state st+1 depending on his choice.
Emilie Kaufmann |CRIStAL - 4
![Page 5: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/5.jpg)
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∼ ν(st ,at)
I transits to a new state st+1 ∼ p(·|st , at)
[Bellman 1957, Howard 1960, Blackwell 70s...]
Emilie Kaufmann |CRIStAL - 5
![Page 6: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/6.jpg)
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space (sometimes As for each s ∈ S)
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
In each (discrete) decision time t = 0, 1, 2, . . . , a learning agent
I selects an action at based on his current state st(or possibly all the previous observations),
I gets a reward rt ∼ ν(st ,at)
I transits to a new state st+1 ∼ p(·|st , at)
[Bellman 1957, Howard 1960, Blackwell 70s...]
Emilie Kaufmann |CRIStAL - 5
![Page 7: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/7.jpg)
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
Goal : (made more precise later) select actions so as to maximize somenotion of expected cumulated rewards
Mean reward of action a in state s
r(s, a) = ER∼ν(s,a)[R]
Emilie Kaufmann |CRIStAL - 5
![Page 8: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/8.jpg)
Markov Decision Process
A MDP is parameterized by a tuple (S,A,R,P) where
I S is the state space
I A is the action space
I R = (ν(s,a))(s,a)∈S×A where ν(s,a) ∈ ∆(R) is the reward distributionfor the state-action pair (s, a)
I P = (p(·|s, a))(s,a)∈S×A where p(·|s, a) ∈ ∆(S) is the transitionkernel associated to the state-action pair (s, a)
• The tabular case : finite state and action spaces
S = {1, . . . ,S}A = {1, . . . ,A}
For every s, s ′ ∈ S, a ∈ A, p(s ′|s, a) = P (st+1 = s ′|st = s, at = a).
Emilie Kaufmann |CRIStAL - 5
![Page 9: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/9.jpg)
Markovian Dynamics
I Reminder : Markov chain
Definition
A Markov chain on a discrete space X is a stochastic process (Xt)t∈Nthat satisfies the Markov property :
P(Xt = xt |Xt−1 = xt−1, . . . ,X0 = x0) = P(Xt = xt |Xt−1 = xt−1)
for all t ∈ N and (x0, . . . , xt) ∈ X t+1. It is homogeneous if
P(Xt = y |Xt−1 = x) = P(Xt−1 = y |Xt−2 = x).
I An homogeneous Markov chain is characterized by its transitionprobabilities p(y |x) = P (Xt = y |Xt−1 = y) an its initial state.
I If X is continuous, this definition can be extended by mean of atransition kernel such that p(·|x) ∈ ∆(X ) for all x ∈ X .
Emilie Kaufmann |CRIStAL - 6
![Page 10: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/10.jpg)
Markovian Dynamics
I Reminder : Markov chain
play
sleep
cry eat0.6
0.1
0.3
0.7
0.1
0.1
0.2
0.10.6
0.3
0.9
Figure – An example of 4 states Markov chain
Emilie Kaufmann |CRIStAL - 7
![Page 11: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/11.jpg)
Markovian Dynamics
I Back to Markov Decision Processes
In a MDP, the sequence of sucessive states / actions / rewards
s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st
satisfies some extension of the Markov property :
P (st = s, rt−1 = r |s0, a0, r0, s1, a1, r1, . . . , st−1, at−1)
= P (st = s, rt−1 = r |st−1, at−1)
(discrete action and reward)
Emilie Kaufmann |CRIStAL - 8
![Page 12: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/12.jpg)
Illustration of a MDP
credit : Ronan Fruit
I S = {s0, s1, s2}I A = {a0, a1}I the mean reward in state s1 :
r(s1, a0) = 0 and r(s1, a1) = rmax.
I the transition probabilities when performing action a1 in state s0 are
p(s1|s0, a1) = 0.1 and p(s2|s0, a1) = 0.9.
Emilie Kaufmann |CRIStAL - 9
![Page 13: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/13.jpg)
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 10
![Page 14: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/14.jpg)
The Centrale Student Dilemma
credit : Remi Munos, Alessandro Lazaric
Emilie Kaufmann |CRIStAL - 11
![Page 15: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/15.jpg)
Tetris
• State : current board and nextblocks to add
• Action : orientation + position ofthe dropped block
• Reward : increment in the score/number of lines
• Transition : new board +randomness in the new block
Ü difficulty : large state space !
Emilie Kaufmann |CRIStAL - 12
![Page 16: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/16.jpg)
The RiverSwim MDP
Two actions available in each state, materialized → and 99K :
sNsN−1
0.95r = 10.6
0.4
1
0.4
0.05
1
0.05
s1
0.6
0.4
0.05
1
0.6
1r = 0.05
s2
0.4
0.05
1
s3
0.6
0.4
0.05
1
credit : Sadegh Talebi
Ü difficulty : delayed, sparse, reward
Emilie Kaufmann |CRIStAL - 13
![Page 17: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/17.jpg)
Grid worlds
I State : position of the robot
I Actions : ←,↑,→,↓I Transitions : (quasi)
deterministic
I Rewards : depends on thebehavior to incentivise(positive or negative rewardson some states / −1 for eachstep before a goal...)
Emilie Kaufmann |CRIStAL - 14
![Page 18: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/18.jpg)
Retail Store Management (1/2)
You owe a bike store. During week t, the (random) demand is Dt units.On Monday morning you may choose to command at additional units :they are delivered immediately before the shop opens.
For each week :
I Maintenance cost : h per unit left in your stock
I Ordering cost : c per unit ordered + fix cost c0 if an order is placed
I Sales profit : p per unit sold
Constraints :
I your warehouse has a maximal capacity of M bikes(any additional bike gets stolen)
I you cannot sell bikes that you don’t have in stock
Emilie Kaufmann |CRIStAL - 15
![Page 19: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/19.jpg)
Retail Store Management (2/2)
I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}
I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}
I Reward = balance of the week : if you command At bikes,
rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)
I Transition : you end the week with
st+1 = max(0,min(M, st + at)− Dt
)bikes
Ü Markov Decision Process
r(s, a)? p(·|s, a)?
Emilie Kaufmann |CRIStAL - 16
![Page 20: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/20.jpg)
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 17
![Page 21: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/21.jpg)
RL objective (informal)
Learn / Act according to a
Good Policy
in a potentially unknown MDP
Emilie Kaufmann |CRIStAL - 18
![Page 22: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/22.jpg)
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
Ü An agent acting under policy π selects at round t the action
at ∼ πt(st)
I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on
ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)
Emilie Kaufmann |CRIStAL - 19
![Page 23: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/23.jpg)
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
Ü An agent acting under policy π selects at round t the action
at ∼ πt(st)
I Remark : one could also consider history-dependent policiesπt : Ht → ∆(A), where the next action is chosen based on
ht = (s0, a0, r0, s1, a1, r1, . . . , st−1, at−1, rt−1, st)
Emilie Kaufmann |CRIStAL - 19
![Page 24: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/24.jpg)
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
A policy may be
Deterministic Stochasticπt : S → A πt : S → ∆(A)
I Terminology : policy = strategy = decision rule = control
Emilie Kaufmann |CRIStAL - 19
![Page 25: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/25.jpg)
Policies
Definition
A (Markovian) policy is a sequence π = (πt)t∈N of mappings
πt : S → ∆(A),
where ∆(A) is the set of probability distributions over the action space.
A policy may be
Stationary Non-stationaryπ = (π, π, π, . . . ) π = (π0, π1, π2, . . . )
I Terminology : policy = strategy = decision rule = control
Emilie Kaufmann |CRIStAL - 19
![Page 26: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/26.jpg)
Policies
Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability
Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))
(can be extended to stochastic policies and continuous spaces)
Ü A MDP is sometimes referred to as a controlled Markov chain
Emilie Kaufmann |CRIStAL - 20
![Page 27: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/27.jpg)
Policies
Under stationary (deterministic) policy π : S → A, the random process(st)t∈N is a Markov chain, with transition probability
Pπ(st+1 = s ′|st = s) = P(st+1 = s ′|st = s, at = π(s)) = p(s ′|s, π(s))
(can be extended to stochastic policies and continuous spaces)
Ü A MDP is sometimes referred to as a controlled Markov chain
Emilie Kaufmann |CRIStAL - 20
![Page 28: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/28.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
![Page 29: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/29.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
![Page 30: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/30.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
À Finite horizon
Given a known horizon T ∈ N∗,
V π(s) = Eπ[
T−1∑t=0
rt + rT
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? In the presence of a natural notion of duration of anepisode (e.g. maximal number of steps in a game)
Emilie Kaufmann |CRIStAL - 21
![Page 31: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/31.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
Á Infinite time horizon with a discount parameter
Given a known discount parameter γ ∈ (0, 1),
V π(s) = Eπ[ ∞∑
t=0
γtrt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? To put more weight on short-term reward / when there is anatural notion of discount
Emilie Kaufmann |CRIStAL - 22
![Page 32: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/32.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
 Infinite time horizon with a terminal state
Given τ the random time at which we first reach a terminal state.
V π(s) = Eπ[
τ∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? For tasks that have a natural notion of terminal state(e.g. achieve some goal)
Emilie Kaufmann |CRIStAL - 23
![Page 33: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/33.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
à Infinite time horizon with average reward
V π(s) = limT→∞
Eπ[
1
T
T−1∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? The system should be controlled for a very long time
Ü slightly harder to work with (not mentioned much in this class)
c.f. [Puterman, Markov Decision Processes, 1994]
Emilie Kaufmann |CRIStAL - 24
![Page 34: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/34.jpg)
What is a good policy ?
It is a policy that yield a large value in each state, which is always somenotion of cumulative reward.
à Infinite time horizon with average reward
V π(s) = limT→∞
Eπ[
1
T
T−1∑t=0
rt
∣∣∣∣∣ s0 = s
]
at ∼ π(st), st+1 ∼ p(·|st , at), rt ∼ νst ,at starting from state s
Ü When ? The system should be controlled for a very long time
Ü slightly harder to work with (not mentioned much in this class)
c.f. [Puterman, Markov Decision Processes, 1994]
Emilie Kaufmann |CRIStAL - 24
![Page 35: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/35.jpg)
Optimal policy
Given a value function (À,Á, or Ã), one can define the following.
DefinitionThe optimal value in state s is given by
V ?(s) = maxπ
V π(s).
DefinitionAn optimal policy π? satisfies
π? ∈ argmaxπ
V π,
that is∀s ∈ S, π? ∈ argmax
πV π(s)
orV ? = V π?
.
Emilie Kaufmann |CRIStAL - 25
![Page 36: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/36.jpg)
Optimal policy
Properties :
I there exists an optimal policy !(i.e. a policy maximizing the value in all states)
I there exists an optimal policy that is deterministic
I ... an even stationary (except with a finite horizon À )
DefinitionAn optimal policy π? satisfies
π? ∈ argmaxπ
V π,
that is∀s ∈ S, π∗ ∈ argmax
πV π(s)
orV ? = V π?
.
Emilie Kaufmann |CRIStAL - 26
![Page 37: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/37.jpg)
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 27
![Page 39: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/39.jpg)
Example : Cart-Pole
Task : maintain the pole as long as possible in a quasi-vertical position,by applying some force on the cart towards the left or right
Introductory notebook
Emilie Kaufmann |CRIStAL - 29
![Page 40: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/40.jpg)
Back to Retail Store Management
I State : number of bikes in stock on SundayState space : S = {0, . . . ,M}
I Action : number of bikes ordered at the beginning of the weekAction space : A = {0, . . . ,M}
I Reward = balance of the week : if you command at bikes,
rt = −c01(at>0) − c × at − h × st + p ×min(Dt , st + at ,M)
I Transition : you end the week with
st+1 = max(0,min(M, st + at)− Dt
)bikes
Goal : From an initial stock s, maximize the sum of discounted rewards
V π(s) = Eπ[ ∞∑
t=0
γtrt
∣∣∣∣∣ s0 = s
]
Emilie Kaufmann |CRIStAL - 30
![Page 41: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/41.jpg)
Possible policies
I Uniform policy (on reasonnable orders)
π(s) ∼ U({0, . . . ,M − s})
I Constant policy : always buy m0 machines
π(s) = max(M − s,m0)
I Threshold policy : whenever there are less than m1 bikes in stock,refill it up to m2 bikes. Otherwise, do not order.
π(s) = 1(s≤m1)(m2 − s)
Emilie Kaufmann |CRIStAL - 31
![Page 42: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/42.jpg)
Simulations
Notebook
0 20 40 60 80 100weeks
0
2
4
6
8
10
stoc
k
Evolution of the stock under a threshold policy
Figure – Evolution of the stock st under a threshold policy (m1 = 4,m2 = 10)
Emilie Kaufmann |CRIStAL - 32
![Page 43: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/43.jpg)
1 Markov Decision Processes
2 Examples
3 Objectives : Policies and Values
4 Trying Policies
5 What is Reinforcement Learning ?
Emilie Kaufmann |CRIStAL - 33
![Page 44: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/44.jpg)
Questions
In an known Markov Decision Process
I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))
I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)
Beyond :
I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?
I ... and can we do it while maximizing reward ?
Broad goal of Reinforcement Learning
Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.
Emilie Kaufmann |CRIStAL - 34
![Page 45: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/45.jpg)
Questions
In an known Markov Decision Process
I can we compute an optimal policy ?(based on the explicit knowledge of r(s, a) and p(·|s, a))
I ... even with very large (or infinite) state and/or action spaces ?(e.g. based on a simulator for transitions)
Beyond :
I Can we learn a good policy in an unknown MDP, only by selectingactions and performing transitions ?
I ... and can we do it while maximizing reward ?
Broad goal of Reinforcement Learning
Learning an optimal policy in an unknown (or very large) MDP, byacting (=choosing action) and observing transitions.
Emilie Kaufmann |CRIStAL - 34
![Page 46: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/46.jpg)
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Policy Estimation
Make sure that the learnt policy πt is eventually a good policy
I sample complexity result. If t ≥ ...,∣∣V πt (s)− V ?(s)∣∣ ≤ ε.
Ü Exploration/Exploitation trade-off
Emilie Kaufmann |CRIStAL - 35
![Page 47: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/47.jpg)
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Rewards maximization
Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.
E
[t∑
s=0
(γs)rs
]
Ü Exploration/Exploitation trade-off
Emilie Kaufmann |CRIStAL - 35
![Page 48: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/48.jpg)
Reinforcement Learning
During learning, we increment a database of observed transition
Dt = {(s0, a0, r0, s1), (s1, a1, r1, s2), . . . , (st−1, at−1, rt−1, st)},
which is used to
Ü select the next action to perform, at
rt ∼ νst ,at st+1 ∼ p(·|st , at) Dt+1 = Dt ∪ {(st , at , rt , st+1)}
Ü possibly output a guess πt for π?
Possible goal : Rewards maximization
Maximize the rewards accumulated during learning : the value of thelearning policy should be close to the optimal value.
E
[t∑
s=0
(γs)rs
]
Ü Exploration/Exploitation trade-offEmilie Kaufmann |CRIStAL - 35
![Page 49: Reinforcement Learningchercheurs.lille.inria.fr/ekaufman/RLCours1.pdf · Outline of the class Lecture 1. Markov Decision Processes (MDP), a formalization for reinforcement learning](https://reader030.vdocuments.mx/reader030/viewer/2022040811/5e5260ac9a4f137cae727a6d/html5/thumbnails/49.jpg)
Let’s get started
... with one state MDPsa.k.a. Multi-armed bandits
(next class)
Emilie Kaufmann |CRIStAL - 36