Download - Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)
![Page 1: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/1.jpg)
Introduction to Reinforcement Learning
E0397 LectureSlides by Richard Sutton(With small changes)
![Page 2: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/2.jpg)
Reinforcement Learning: Basic Idea Learn to take correct actions over time by experience Similar to how humans learn: “trial and error” Try an action –
“see” what happens “remember” what happens Use this to change choice of action next time you are in the
same situation “Converge” to learning correct actions
Focus on long term return, not immediate benefit Some action may seem not beneficial for short term, but its long
term return will be good.
![Page 3: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/3.jpg)
RL in Cloud Computing
Why are we talking about RL in this class? Which problems due you think can be
solved by a learning approach? Is focus on long-term benefit appropriate?
![Page 4: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/4.jpg)
RL Agent Object is to affect the environment Environment is stochastic and uncertain
Environment
actionstate
rewardAgent
![Page 5: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/5.jpg)
What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external
environment Learning what to do—how to map situations to actions—so as to
maximize a numerical reward signal
![Page 6: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/6.jpg)
Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward
Sacrifice short-term gains for greater long-term gains
The need to explore and exploit Considers the whole problem of a goal-directed
agent interacting with an uncertain environment
![Page 7: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/7.jpg)
Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Reidmiller et al.
World’s best player of simulated soccer, 1999; Runner-up 2000
Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods
Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls
Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller
Many Robots navigation, bi-pedal walking, grasping, switching between skills...
TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player
![Page 8: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/8.jpg)
Supervised Learning
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
•Best examples: Classification/identification systems. E.g. fault classification, WiFi location
determination, workload identification
•E.g. Classification systems: give agent lots of data where data has been already
classified. Agent can “train” itself using this knowledge.
•Training phase is non trivial
•(Normally) no learning happens in on-line phase
![Page 9: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/9.jpg)
Reinforcement Learning
RLSystemInputs Outputs (“actions”)
Training Info = evaluations (“rewards” / “penalties”)
Objective: get as much reward as possible
•Absolutely no “already existing training data”•Agent learns ONLY by experience
![Page 10: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/10.jpg)
Elements of RL
State: state of the system Actions: Suggested by RL agent, taken by
actuators in the system Actions change state (deterministically, or
stochastically) Reward: Immediate gain when taking action a in
state s Value: Long term benefit of taking action a in
state s Rewards gained over a long time horizon
![Page 11: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/11.jpg)
The n-Armed Bandit Problem Choose repeatedly from one of n
actions; each choice is called a play After each play , you get a reward
, whereat rt
These are unknown action valuesDistribution of depends only on rt at
Objective is to maximize the reward in the long term, e.g., over 1000 plays
To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them
![Page 12: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/12.jpg)
The Exploration/Exploitation Dilemma Suppose you form estimates
The greedy action at t is at
You can’t exploit all the time; you can’t explore all the time
You can never stop exploring; but you should always reduce exploring. Maybe.
Qt(a) Q*(a) action value estimates
at* argmax
aQt(a)
at at* exploitation
at at* exploration
![Page 13: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/13.jpg)
Action-Value Methods
Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then
ka
r1, r2,, rka,
“sample average”
limk a
Qt(a) Q*(a)
a
![Page 14: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/14.jpg)
-Greedy Action Selection Greedy action selection:
-Greedy:
at at* arg max
aQt(a)
at* with probability 1
random action with probability {at
. . . the simplest way to balance exploration and exploitation
![Page 15: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/15.jpg)
10-Armed Testbed n = 10 possible actions Each is chosen randomly from a normal distrib.: each is also normal: 1000 plays repeat the whole thing 2000 times and average the results
rt
Q*(a)
![Page 16: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/16.jpg)
-Greedy Methods on the 10-Armed Testbed
![Page 17: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/17.jpg)
Incremental Implementation
Qk
r1 r2 rk
k
Recall the sample average estimation method:
Can we do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk1 Qk 1
k 1rk1 Qk
The average of the first k rewards is(dropping the dependence on ):
This is a common form for update rules:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
a
![Page 18: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/18.jpg)
Tracking a Nonstationary Problem
Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,
But not in a nonstationary problem.
Qk
Q*(a)
Better in the nonstationary case is:Qk1 Qk rk1 Qk
for constant , 0 1
(1 ) kQ0 (1 i1
k
)k iri
exponential, recency-weighted average
![Page 19: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/19.jpg)
Formal Description of RL
![Page 20: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/20.jpg)
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t 0,1, 2, Agent observes state at step t : st S
produces action at step t : at A(st )
gets resulting reward : rt1 and resulting next state : st1
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a
![Page 21: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/21.jpg)
Policy at step t, t :
a mapping from states to action probabilities
t (s, a) probability that at a when st s
The Agent Learns a Policy
Reinforcement learning methods specify how the agent changes its policy as a result of experience.
Roughly, the agent’s goal is to get as much reward as it can over the long run.
![Page 22: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/22.jpg)
Getting the Degree of Abstraction Right
Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high level
(e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc.
States can low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”).
An RL agent is not like a whole animal or robot. Reward computation is in the agent’s environment because the
agent cannot change it arbitrarily. The environment is not necessarily unknown to the agent, only
incompletely controllable.
![Page 23: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/23.jpg)
Goals and Rewards Is a scalar reward signal an adequate notion of a
goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve,
not how we want to achieve it. A goal must be outside the agent’s direct control
—thus outside the agent. The agent must be able to measure success:
explicitly; frequently during its lifespan.
![Page 24: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/24.jpg)
Returns
Suppose the sequence of rewards after step t is :
rt1, rt2 , rt3,What do we want to maximize?
In general,
we want to maximize the expected return, E Rt , for each step t.
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.
Rt rt1 rt2 rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
![Page 25: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/25.jpg)
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
Rt rt1 rt2 2rt3 krtk1,k 0
where , 0 1, is the discount rate.
shortsighted 0 1 farsighted
![Page 26: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/26.jpg)
An ExampleAvoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward 1 for each step before failure
return number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:reward 1 upon failure; 0 otherwise
return k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
![Page 27: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/27.jpg)
In episodic tasks, we number the time steps of each episode starting from zero.
We usually do not have to distinguish between episodes, so we write instead of for the state at step t of episode j.
Think of each episode as ending in an absorbing state that always produces reward of zero:
We can cover all cases by writing
A Unified Notation
st st, j
Rt krtk 1,k 0
where can be 1 only if a zero reward absorbing state is always reached.
![Page 28: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/28.jpg)
The Markov Property “the state” at step t, means whatever information is
available to the agent at step t about its environment. The state can include immediate “sensations,” highly
processed sensations, and structures built up over time from sequences of sensations.
Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:
Pr st1 s ,rt1 r st ,at ,rt , st 1,at 1,,r1,s0 ,a0 Pr st1 s ,rt1 r st ,at for all s , r, and histories st ,at ,rt , st 1,at 1,,r1, s0 ,a0.
![Page 29: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/29.jpg)
Markov Decision Processes If a reinforcement learning task has the Markov
Property, it is basically a Markov Decision Process (MDP).
If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:
state and action sets one-step “dynamics” defined by transition probabilities:
Expected values of reward:
Ps s a Pr st1 s st s,at a for all s, s S, a A(s).
Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).
![Page 30: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/30.jpg)
Recycling Robot
An Example Finite MDP
At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.
Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).
Decisions made on basis of current energy level: high, low. Reward = number of cans collected
![Page 31: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/31.jpg)
Recycling Robot MDP
S high,low A(high) search, wait A(low) search,wait, recharge
Rsearch expected no. of cans while searching
Rwait expected no. of cans while waiting
Rsearch Rwait
![Page 32: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/32.jpg)
The value of a state is the expected return starting from that state; depends on the agent’s policy:
The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :
Value Functions
State - value function for policy :
V (s)E Rt st s E krtk 1 st sk 0
Action - value function for policy :
Q (s, a) E Rt st s, at a E krtk1 st s,at ak0
![Page 33: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/33.jpg)
Bellman Equation for a Policy
Rt rt1 rt2 2rt3 3rt4
rt1 rt2 rt3 2rt4 rt1 Rt1
The basic idea:
So: V (s)E Rt st s E rt1 V st1 st s
Or, without the expectation operator:
V (s) (s,a) Ps s a Rs s
a V ( s ) s
a
![Page 34: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/34.jpg)
More on the Bellman Equation
V (s) (s,a) Ps s a Rs s
a V ( s ) s
a
This is a set of equations (in fact, linear), one for each state.The value function for is its unique solution.
Backup diagrams:
for V for Q
![Page 35: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/35.jpg)
For finite MDPs, policies can be partially ordered:
There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all *.
Optimal policies share the same optimal state-value function:
Optimal policies also share the same optimal action-value function:
if and only if V (s) V (s) for all s S
Optimal Value Functions
V (s) max
V (s) for all s S
Q(s,a)max
Q (s, a) for all s S and a A(s)
This is the expected return for taking action a in state s and thereafter following an optimal policy.
![Page 36: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/36.jpg)
Bellman Optimality Equation for V*
V (s) maxaA (s)
Q
(s,a)
maxaA (s)
E rt1 V (st1) st s,at a max
aA (s)Ps s
a
s
Rs s a V ( s )
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
The relevant backup diagram:
is the unique solution of this system of nonlinear equations.V
![Page 37: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/37.jpg)
Bellman Optimality Equation for Q*
Q(s,a) E rt1 maxa
Q(st1, a ) st s,at a Ps s
a Rs s a max
a Q( s , a )
s
The relevant backup diagram:
is the unique solution of this system of nonlinear equations.Q*
![Page 38: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/38.jpg)
Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of
special states A and B as shown.
State-value function for equiprobable random policy;= 0.9
![Page 39: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/39.jpg)
Why Optimal State-Value Functions are Useful
V
V
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld:
*
![Page 40: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/40.jpg)
What About Optimal Action-Value Functions?
Given , the agent does not evenhave to do a one-step-ahead search:
Q*
(s)arg maxaA (s)
Q(s,a)
![Page 41: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/41.jpg)
Solving the Bellman Optimality Equation
Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property.
How much space and time do we need? polynomial in number of states (via dynamic programming methods;
Chapter 4), BUT, number of states is often huge (e.g., backgammon has about
1020 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving
the Bellman Optimality Equation.
![Page 42: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/42.jpg)
Solving MDPs
The goal in MDP is to find the policy that maximizes long term reward (value)
If all the information is there, this can be done by explicitly solving, or by iterative methods
![Page 43: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/43.jpg)
Policy Iteration
0 V 0 1 V1 * V * *
policy evaluation policy improvement“greedification”
![Page 44: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/44.jpg)
Policy Iteration
![Page 45: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/45.jpg)
From MDPs to reinforcement learning
![Page 46: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/46.jpg)
MDPs are great if we know transition probabilities
And expected immediate rewards
But in real life tasks – this is precisely what we do NOT know! This is what turns the task into a learning problem
Simplest method to learn: Monte Carlo
MDPs vs RL
Ps s a Pr st1 s st s,at a for all s, s S, a A(s).
Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).
![Page 47: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/47.jpg)
Monte Carlo Methods
Monte Carlo methods learn from complete sample returnsOnly defined for episodic tasks
Monte Carlo methods learn directly from experienceOn-line: No model necessary and still attains
optimalitySimulated: No need for a full model
![Page 48: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/48.jpg)
Monte Carlo Policy Evaluation Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s
Every-Visit MC: average returns for every time s is visited in an episode
First-visit MC: average returns only for first time s is visited in an episode
Both converge asymptotically
1 2 3 4 5
![Page 49: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/49.jpg)
First-visit Monte Carlo policy evaluation
![Page 50: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/50.jpg)
Monte Carlo Estimation of Action Values (Q)
Monte Carlo is most useful when a model is not available We want to learn Q*
Q(s,a) - average return starting from state s and action a following
Also converges asymptotically if every state-action pair is visited
Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
![Page 51: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/51.jpg)
Monte Carlo Control
MC policy iteration: Policy evaluation using MC methods followed by policy improvement
Policy improvement step: greedify with respect to value (or action-value) function
![Page 52: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/52.jpg)
Monte Carlo Exploring StartsFixed point is optimal policy *
Now proven (almost)
Problem with Monte Carlo – learning happens only after end of entire episode
![Page 53: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/53.jpg)
TD Prediction
Simple every - visit Monte Carlo method :
V(st ) V(st) Rt V (st )
Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V
Recall:
The simplest TD method, TD(0) :
V(st ) V(st) rt1 V (st1 ) V(st )
target: the actual return after time t
target: an estimate of the return
![Page 54: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/54.jpg)
Simple Monte Carlo
T T T TT
T T T T T
V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .st
T T
T T
TT T
T TT
![Page 55: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/55.jpg)
Simplest TD Method
T T T TT
T T T T T
st1
rt1
st
V(st ) V(st) rt1 V (st1 ) V(st )
TTTTT
T T T T T
![Page 56: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/56.jpg)
cf. Dynamic ProgrammingV(st ) E rt1 V(st )
T
T T T
st
rt1
st1
T
TT
T
TT
T
T
T
![Page 57: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/57.jpg)
TD methods bootstrap and sample Bootstrapping: update involves an estimate
MC does not bootstrap DP bootstraps TD bootstraps
Sampling: update does not involve an expected value MC samples DP does not sample TD samples
![Page 58: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/58.jpg)
Example: Driving Home
State Elapsed Time(minutes)
PredictedTime to Go
PredictedTotal Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
(5)
(15)
(10)
(10)
(3)
![Page 59: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/59.jpg)
Driving HomeChanges recommended by Monte Carlo methods =1)
Changes recommendedby TD methods (=1)
![Page 60: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/60.jpg)
Advantages of TD Learning
TD methods do not require a model of the environment, only experience
TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome
Less memory Less peak computation
You can learn without the final outcome From incomplete sequences
Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?
![Page 61: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/61.jpg)
Random Walk Example
Values learned by TD(0) aftervarious numbers of episodes
![Page 62: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/62.jpg)
TD and MC on the Random Walk
Data averaged over100 sequences of episodes
![Page 63: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/63.jpg)
Optimality of TD(0)Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.
Compute updates according to TD(0), but only update estimates after each complete pass through the data.
For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small .
Constant- MC also converges under these conditions, but toa difference answer!
![Page 64: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/64.jpg)
Random Walk under Batch Updating
After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.
![Page 65: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/65.jpg)
You are the PredictorSuppose you observe the following 8 episodes:
A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0
V(A)?
V(B)?
![Page 66: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/66.jpg)
You are the Predictor
V(A)?
![Page 67: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/67.jpg)
You are the Predictor The prediction that best matches the training data is
V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets
If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov
model generating the data i.e, if we do a best fit Markov model, and assume it is exactly
correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets
![Page 68: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/68.jpg)
Learning An Action-Value Function
Estimate Q for the current behavior policy .
After every transition from a nonterminal state st , do this :
Q st , at Q st , at rt1 Q st1,at1 Q st ,at If st1 is terminal, then Q(st1, at1 ) 0.
![Page 69: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/69.jpg)
Sarsa: On-Policy TD ControlTurn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:
![Page 70: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/70.jpg)
Windy Gridworld
undiscounted, episodic, reward = –1 until goal
![Page 71: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/71.jpg)
Results of Sarsa on the Windy Gridworld
![Page 72: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/72.jpg)
Q-Learning: Off-Policy TD Control
One - step Q - learning :
Q st , at Q st , at rt1 maxa
Q st1, a Q st , at
![Page 73: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/73.jpg)
Planning and Learning
Use of environment models Integration of planning and learning
methods
![Page 74: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/74.jpg)
The Original Idea
Sutton, 1990
![Page 75: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/75.jpg)
The Original Idea
Sutton, 1990
![Page 76: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/76.jpg)
Models
Model: anything the agent can use to predict how the environment will respond to its actions
Distribution model: description of all possibilities and their probabilities e.g.,
Sample model: produces sample experiences e.g., a simulation model
Both types of models can be used to produce simulated experience
Often sample models are much easier to come by
Ps s a and Rs s
a for all s, s , and a A(s)
![Page 77: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/77.jpg)
Planning Planning: any computational process that uses a
model to create or improve a policy
Planning in AI: state-space planning plan-space planning (e.g., partial-order planner)
We take the following (unusual) view: all state-space planning methods involve computing value
functions, either explicitly or implicitly they all apply backups to simulated experience
![Page 78: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/78.jpg)
Planning Cont.
Random-Sample One-Step Tabular Q-Planning
Classical DP methods are state-space planning methods
Heuristic search methods are state-space planning methods
A planning method based on Q-learning:
![Page 79: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/79.jpg)
Learning, Planning, and Acting Two uses of real experience:
model learning: to improve the model
direct RL: to directly improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.
![Page 80: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/80.jpg)
Direct vs. Indirect RL
Indirect methods:make fuller use of
experience: get better policy with fewer environment interactions
Direct methodssimplernot affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously
and in parallel
![Page 81: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/81.jpg)
The Dyna Architecture (Sutton 1990)
![Page 82: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/82.jpg)
The Dyna-Q Algorithm
model learning
planning
direct RL
![Page 83: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/83.jpg)
Dyna-Q on a Simple Maze
rewards = 0 until goal, when =1
![Page 84: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/84.jpg)
Value Prediction with FAAs usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V
In earlier chapters, value functions were stored in lookup tables.
Here, the value function estimate at time t, Vt , depends
on a parameter vector t , and only the parameter vector
is updated.
e.g., t could be the vector of connection weights
of a neural network.
![Page 85: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/85.jpg)
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
![Page 86: Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)](https://reader033.vdocuments.mx/reader033/viewer/2022051000/56649ee55503460f94bf3efb/html5/thumbnails/86.jpg)