cs221 rl
TRANSCRIPT
![Page 1: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/1.jpg)
CS 221: Artificial Intelligence
Reinforcement Learning
Peter Norvig and Sebastian Thrun
Slide credit: Dan Klein, Stuart Russell, Andrew Moore
![Page 2: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/2.jpg)
Review: Markov Decision Processes
An MDP is defined by: set of states s S set of actions a A transition function P(s’| s, a)
aka T(s,a,s’) reward function R(s, a, s’)
or R(s, a) or R(s’)
Problem: Find policy π(s) to maximize
∑ 𝛾t R(s, π(s), s′ )
2
![Page 3: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/3.jpg)
Value Iteration
Idea: Start with U0(s) = 0 Given Ui(s), calculate:
Repeat until convergence
Theorem: will converge to unique optimal utilities
But: Requires model (R and P) 3
![Page 4: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/4.jpg)
Reinforcement Learning
What if the state transition function P and the reward function R are unknown?
4
![Page 5: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/5.jpg)
Example: Backgammon
Reward only for win / loss in terminal states, zero otherwise
TD-Gammon (Tesauro, 1992) learns a function approximation to U(s) using a neural network
Combined with depth 3 search, one of the top 3 human players in the world – good at positional judgment, but mistakes in end game beyond depth 3
5
![Page 6: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/6.jpg)
Example: Animal Learning
RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated
Example: foraging Bees learn near-optimal foraging plan in field of
artificial flowers with controlled nectar supplies Bees have a direct neural connection from
nectar intake measurement to motor planning area
6
![Page 7: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/7.jpg)
What to Learn
If you don’t know P, R, what you learn depends on what you want to do Utility-based agent: learn U(s) Q-learning agent: learn utility Q(s, a) Reflex agent: learn policy π(s)
And on how patient/reckless you are Passive: use a fixed policy π Active: modify policy as you go
7
![Page 8: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/8.jpg)
Passive Temporal-Difference
8
![Page 9: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/9.jpg)
Example
9
+1
-1
0 00
00
0 00
0
![Page 10: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/10.jpg)
Example
10
+1
-1
0 00
00
0 00
0
![Page 11: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/11.jpg)
Example
11
+1
-1
0 00
00
00.9
0
0
![Page 12: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/12.jpg)
Example
12
+1
-1
0 00
00
00.9
0
0
![Page 13: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/13.jpg)
Example
13
+1
-1
0 00
00
0.80.92
0
0
![Page 14: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/14.jpg)
Example
14
+1
-1
-0.01 -.16.12
-.12.17
.28 .36.20
-.2
![Page 15: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/15.jpg)
Sample results
15
![Page 16: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/16.jpg)
Problem
The problem Algorithm computes the value of one specific policy May never try the best moves May never visit some states May waste time computing unneeded utilities
16
![Page 17: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/17.jpg)
Quiz
Is there a fix?
Can we explore different policies (active reinforcement learning), yet still learn the optimal policy?
17
![Page 18: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/18.jpg)
Greedy TD-Agent
Learn U with TD-Learning After each change to U, solve MDP to
compute new optimal policy π(s) Continue learning with new policy
18
![Page 19: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/19.jpg)
Greedy Agent Results
Wrong policy Too much exploitation, not enough exploration Fix: higher U(s) for unexplored states
19
![Page 20: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/20.jpg)
Exploratory Agent
Converges to correct policy
20
![Page 21: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/21.jpg)
Alternative: Q-Learning
Instead of learning U(s), learnQ(a,s): the utility of action a in state s
With utilities, need a model, P, to act:
With Q-values, no model needed Compute Q-values with TD/exploration algorithm
21
![Page 22: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/22.jpg)
Q-Learning
22
0 0 0 +1
-1
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0
0 0
0 0 00
0
00
0
00
![Page 23: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/23.jpg)
Q-Learning
23
0 0 0 +1
-1
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0
0 0
0 0 00
0
00
0
00
![Page 24: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/24.jpg)
Q-Learning
24
0 0 0 +1
-1
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ??
0 0
0 0 00
0
00
0
00
![Page 25: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/25.jpg)
Q-Learning
0 0 0 +1
-1
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 .45
0 0
0 0 00
0
00
0
00
![Page 26: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/26.jpg)
Q-Learning
0 0 0 +1
-1
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 .33 .78
0 0
0 0 00
0
00
0
00
![Page 27: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/27.jpg)
Q-Learning
Q-learning produces tables of q-values:
27
![Page 28: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/28.jpg)
Q-Learning Properties Amazing result: Q-learning converges to optimal policy
If you explore enough If you make the learning rate small enough … but not decrease it too quickly! Basically doesn’t matter how you select actions (!)
Neat property: off-policy learning learn optimal policy without following it
(some caveats, see SARSA)
S E S E
28
![Page 29: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/29.jpg)
Practical Q-Learning
In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory
Instead, we want to generalize: Learn about some small number of training states
from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and
we’ll see it over and over again
29
![Page 30: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/30.jpg)
Generalization Example
Let’s say we discover through experience that this state is bad:
In naïve Q learning, we know nothing about this state or its Q states:
Or even this one!30
![Page 31: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/31.jpg)
Feature-Based Representations
Solution: describe a state using a vector of features Features are functions from states
to real numbers (often 0/1) that capture important properties of the state
Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2
Is Pacman in a tunnel? (0/1) …… etc.
Can also describe a Q-state (s, a) with features (e.g. action moves closer to food)
31
![Page 32: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/32.jpg)
Linear Feature Functions
Using a feature representation, we can write a Q function (or Utility function) for any state using a few weights:
Advantage: our experience is summed up in a few powerful numbers
Disadvantage: states may share features but be very different in value!
32
![Page 33: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/33.jpg)
Function Approximation
Q-learning with linear Q-functions:
Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states
with that state’s features (prudence or superstition?)
33
![Page 34: Cs221 rl](https://reader035.vdocuments.mx/reader035/viewer/2022062319/55617d84d8b42aac268b51a9/html5/thumbnails/34.jpg)
Summary: MDPs and RL
If we know the MDP Compute U*, Q*, * exactly Evaluate a fixed policy
If we don’t know the MDP We can estimate the MDP then solve
We can estimate U for a fixed policy We can estimate Q*(s,a) for the
optimal policy while executing an exploration policy
34
Model-based DP Value Iteration Policy iteration
Model-based RL
Model-free RL: Value learning Q-learning
Things we know how to do: Techniques: