![Page 1: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/1.jpg)
1
Learning to Maximize Reward: Reinforcement Learning
04/22/23
Brian C. Williams16.412J/6.834J October 28th, 2002
Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU
![Page 2: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/2.jpg)
2
Reading• Today: Reinforcement Learning
• Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20
• Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285.
• For Markov Decision Processes• Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4.
• Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.
![Page 3: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/3.jpg)
3
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 4: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/4.jpg)
4
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon
Situations: • Board configurations (1020)
Actions:• Moves
Rewards:• +100 if win• - 100 if lose• 0 for all other states
• Trained by playing 1.5 million games against self.
Currently, roughly equal to best human player.
![Page 5: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/5.jpg)
5
Reinforcement Learning ProblemGiven: Repeatedly…• Executed action• Observed state• Observed reward
Learn action policy : S A• Maximizes life reward
r0 + r1 + 2 r2 . . .from any start state.
• Discount: 0 1
Note:• Unsupervised learning• Delayed reward
Agent
Environment
s0 r0
a0 s1a1
r1
s2a2
r2
s3
State Reward Action
Goal: Learn to choose actions that maximize life reward
r0 + r1 + 2 r2 . . .
![Page 6: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/6.jpg)
6
How About Learning the Policy Directly?
1. *: S A2. fill out table entries for * by collecting statistics on
training pairs <s,a>.
3. Where does acome from?
![Page 7: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/7.jpg)
7
How About Learning the Value Function?
1. Have agent learn value function V, denoted V.
2. Given learned V, agent selects optimal action by one step lookahead
*(s) = argmaxar(s,a + V((s, a)]
Problem:• Works well if agent knows the environment model.
• : S x A S• r: S x A
• With no model, agent can’t choose action from V.• With a model, could compute V via value iteration, why
learn it?
![Page 8: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/8.jpg)
8
How About Learning the Model as Well?1. Have agent learn and r by statistics on training instances <st,rt+1,st+1>
2. Compute V by value iteration.Vt+1(s) maxa [r(s,a + V t((s, a))]
3. Agent selects optimal action by one step lookahead*(s) = argmaxar(s,a + V((s, a)]
Problem: A viable strategy for many problems, but …• When do you stop learning the model and compute V?
• May take a long time to converge on model.• Would like to continuously interleave learning and acting,
but repeatedly computing Vis costly.
• How can we avoid learning the model and Vexplicitly?
![Page 9: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/9.jpg)
9
Eliminating the Model with Q Functions*(s) = argmaxar(s,a + V((s, a)]
Key idea:• Define function that encapsulates V, and r:
Q(s,a = r(s,a + V((s, a))
• From learned Q, can choose an optimal action without knowing or r.
*(s) = argmaxaQ(s,a
V = Cumulative reward of being in s.Q = Cumulative reward of being in s and taking action a.
![Page 10: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/10.jpg)
10
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 11: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/11.jpg)
11
How Do We Learn Q?Q(st,at = r(st,at + V((st, at))
Idea: • Create update rule similar to Bellman equation.• Perform updates on training examples <st , at , rt+1 , st+1 >
Q(st,at rt+1 + V(st+1 )
How do we eliminate V*?• Q and V* are closely related:
V*(s) = maxa’ Q(s,a’)
• Substituting Q for V*:
Q(st,at rt+1 + maxa’ Q(st+1,a’)
Called a backup
![Page 12: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/12.jpg)
12
Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.
Initially:• For each s, a initialize table entry Q(s, a) 0• Observe initial state s0
Do for all time t:• Select an action at and execute it• Receive immediate reward rt+1
• Observe the new state st+1
• Update the table entry for Q (st, at) as follows:Q(st, at) rt+1+ maxa’ Q(st+1,a’)
• st st+1
![Page 13: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/13.jpg)
13
Example – Q Learning Update100
81
72
63
= 0.9 100
8163
0 reward received
![Page 14: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/14.jpg)
14
Example – Q Learning Update
Q(s1,aright) r(s1,aright) + maxa’ Q(s2,a’) 0 + 0.9max {63, 81, 100} 90
Note: if rewards are non-negative:• For all s, a, n, Qn(s, a) Qn+1(s, a)
• For all s, a, n, 0 Qn(s, a) Q(s, a)
= 0.9 100
8163
100
81
72
63
s1 s2
s2s1Maxrt
aright
0 reward received
90
![Page 15: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/15.jpg)
15
Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
G
10
1010
s1 s2 s3
s4s5s6
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0
![Page 16: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/16.jpg)
16
Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
G
10
1010
s1 s2 s3
s4s5s6
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0
![Page 17: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/17.jpg)
17
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
G
10
1010
s1 s2 s3
s4s5s6
![Page 18: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/18.jpg)
18
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
G
10
1010
s1 s2 s3
s4s5s6
![Page 19: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/19.jpg)
19
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
G
10
1010
s1 s2 s3
s4s5s6
![Page 20: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/20.jpg)
20
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
8 10
G
10
1010
s1 s2 s3
s4s5s6
![Page 21: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/21.jpg)
21
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
8 10
G
10
1010
s1 s2 s3
s4s5s6
![Page 22: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/22.jpg)
22
Example Summary: Value Iteration and Q-Learning
G100
100
90
81
100
90
0
100
G
V*(s) values
G
One Optimal Policy
R(s, a) values
G100
100
Q(s, a) values
81
90
8172
8181
7281
90
![Page 23: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/23.jpg)
23
Exploration vs Exploitation
How do you pick actions as you learn?
1. Greedy Action Selection:• Always select the action that looks best:
*(s) = arg maxaQ(s,a
2. Probabilistic Action Selection:• Likelihood of a is proportional to current Q value.
• P(ai|s) = kQ(s, ai) / j kQ(s, aj)
![Page 24: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/24.jpg)
24
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 25: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/25.jpg)
25
TD(): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997.
Q learning: reduce discrepancy between successive Q estimates
One step time difference:
Q(1)(st,at = rt + maxaQ(st+1,at
Why not two steps? Q(2)(st,at = rt + rt+1 + 2 maxaQ(st+2,at+1
Or n ?Q(n)(st,at = rt + rt+1 + (n-1) rt+n-1 + n maxaQ(st+n,at+n-1
Blend all of these:
Q(st,at = (1-) [Q(1)(st,at + Q(2)(st,at + 2Q(3)(st,at + …] …
![Page 26: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/26.jpg)
26
Eligibility Traces
Idea: Perform backups on N previous data points, as well as most recent data point.• Select data to backup based on frequency of visitation.• Bias towards frequent data by geometric decay i-j.
Visits to data point <s,a>:
t
Accumulating trace:
t
![Page 27: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/27.jpg)
27
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs:
• Value Iteration• Q Learning
• Function Approximators• Model-based Learning• Summary
![Page 28: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/28.jpg)
28
Nondeterministic MDPsstate transitions become
probabilistic: (s,a,s’)
S1Unemployed
D
R
S2Industry
D
S3Grad School
D
R
S4Academia
D
R
R
0.1
0.9
1.0
0.9
0.1
1.0
0.9
0.1
1.00.90.1
1.0
R – Research pathD – Development pathExample
![Page 29: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/29.jpg)
29
NonDeterministic Case• How do we redefine cumulative reward to handle non-
determinism?• Define V and Q based on expected values:
V(st) = E[rt + rt+1 + 2 rt+2 . . .]
V(st) = E[ i rt+I ]
Q(st,at = E[r(st,at + V((st, at))]
![Page 30: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/30.jpg)
30
Value Iteration for Non-deterministic MDPs
V1(s) := 0 for all st := 1loop t := t + 1 loop for all s in S loop for all a in A
Qt (s ,a) := r(s,a + s’ in S(s,a,s’) V
t(s’) end loop
Vt(s) := maxa [Qt (s,a)]
end loopuntil |V*t+1(s) - V
t (s) | < for all s in S
![Page 31: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/31.jpg)
31
Q Learning for Nondeterministic MDPs
Q* (s) = r(s,a + s’ in S(s,a,s’) maxa’ [Q* (s’,a’)]
• Alter training rule for non-deterministic Qn:
Qn(st, at) (1- n) Qn-1 (st,at) + n [rt+1+ maxa’ Qn-1(st+1,a’)]
where n = 1/(1+visitsn(s,a))
Can still prove convergence of Q [Watkins and Dayan, 92]
![Page 32: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/32.jpg)
32
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 33: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/33.jpg)
33
Function Approximation
Function Approximators:• Backprop Neural Network• Radial Basis Function Network• CMAC Network• Nearest Neighbor, Memory-based• Decision Tree
gradient-descentmethods
FunctionApproximator targets or error
Q(s,a)s
a
![Page 34: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/34.jpg)
34
Function Approximation Example:Adjusting Network Weights
Function Approximator:• Q(s,a) = f(s,a,w)
Update: Gradient-descent Sarsa:• w w + [rt+1 + Q(st+1,at+1)-Q(st,at)] w f(st,at,w)
FunctionApproximator targets or error
Q(s,a)s
a
weight vector
estimated valuetarget value
StandardBackpropgradient
![Page 35: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/35.jpg)
35
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon
Situations: • Board configurations (1020)
Actions:• Moves
Rewards:• +100 if win• - 100 if lose• 0 for all other states
• Trained by playing 1.5 million games against self.
Currently, roughly equal to best human player.
![Page 36: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/36.jpg)
36
Example: TD-Gammon [Tesauro, 1995]
Hidden Units0 - 160
Random InitialWeights
Raw Board Position(# of pieces at each position)
V(s) predicted probability of winning
On win: Outcome = 1On Loss: Outcome = 0
TD error
V(st+1) – V(st)
![Page 37: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/37.jpg)
37
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 38: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/38.jpg)
38
Model-based Learning: Certainty-Equivalence MethodFor every step:1. Use new experience to update model parameters.
• Transitions• Rewards
2. Solve the model for V and .• Value iteration.• Policy iteration.
3. Use the policy to choose the next action.
![Page 39: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/39.jpg)
39
Learning the Model
For each state-action pair <s,a> visited accumulate:
1. Mean Transition:
T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a)
2. Mean Reward:
R(s, a)
![Page 40: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/40.jpg)
40
Comparison of Model-based and Model-free methodsTemporal Differencing / Q Learning:
Only does computation for the states the system is actually in.• Good real-time performance • Inefficient use of data
Model-based methods: Computes the best estimates for every state on every time step.
• Efficient use of data• Terrible real-time performance
What is a middle ground?
![Page 41: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/41.jpg)
41
Dyna: A Middle Ground[Sutton, Intro to RL, 97]
At each step, incrementally:1. Update model based on new data2. Update policy based on new data3. Update policy based on updated model
Performance, until optimal, on Grid World:• Q-Learning:
• 531,000 Steps• 531,000 Backups
• Dyna:• 61,908 Steps• 3,055,000 Backups
![Page 42: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/42.jpg)
42
Dyna Algorithm
Given state s:1. Choose action a using estimated policy.2. Observe new state s’ and reward r. 3. Update T and R of model.4. Update V at <s, a>:
V(s) maxa [r(s,a + s’T(s,a,s’)V(s’))]5. Perform k additional updates:
a) Pick k random states sj in {s1, s2, . . . sk}
b) Update each V(sj):
V(sj) maxa [r(sj,a + s’T(sj,a,s’)V(s’))]
![Page 43: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/43.jpg)
43
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
![Page 44: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/44.jpg)
44
Ongoing Research• Handling cases where state is only partially observable• Design of optimal exploration strategies• Extend to continuous action, state• Learn and use S x A S• Scaling up in the size of the state space
• Function approximators (neural net instead of table)• Generalization• Macros• Exploiting substructure
• Multiple learners – Multi-agent reinforcement learning
![Page 45: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/45.jpg)
45
Markov Decision Processes (MDPs)Model:
• Finite set of states, S• Finite set of actions, A• Probabilistic state
transitions, (s,a)• Reward for each state
and action, R(s,a)
Process:• Observe state st in S
• Choose action at in A
• Receive immediate reward rt
• State changes to st+1
Deterministic Example:
G
10
1010
• Legal transitions shown• Reward on unlabeled transitions is 0.
s0 r0
a0 s1a1
r1
s2a2
r2
s3s1 a1
![Page 46: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/46.jpg)
46
Crib Sheet: MDPs by Value IterationInsight: Can calculate optimal values iteratively using
Dynamic Programming.
Algorithm:• Iteratively calculate value using Bellman’s Equation:
V*t+1(s) maxa [r(s,a + Vt((s, a))]
• Terminate when values are “close enough”|V*t+1(s) - V
t (s) | <
• Agent selects optimal action by one step lookahead on V*(s) = argmaxar(s,a + V((s, a)]
![Page 47: Learning to Maximize Reward: Reinforcement Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815e43550346895dccb172/html5/thumbnails/47.jpg)
47
Crib Sheet: Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.
Initially:• For each s, a initialize table entry Q(s, a) 0• Observe current state s
Do forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q (s, a) as follows:
Q(s, a) r+ maxa’ Q(s’,a’)• s s’