machine learningrl1 reinforcement learning michael l. littman slides from elena/ml_13light.ppt which...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Machine Learning RL 1
Reinforcement Learning
Michael L. LittmanSlides from
http://www.cs.vu.nl/~elena/ml_13light.ppt
which appear to have been adapted from
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps
Machine Learning RL 2
Reinforcement Learning
[Read Ch. 13][Exercise 13.2]
• Control learning • Control policies that choose optimal actions • Q learning • Convergence
Machine Learning RL 3
Control Learning
Consider learning to choose actions, e.g.,• Robot learning to dock on battery charger • Learning to choose actions to optimize factory output• Learning to play Backgammon
Note several problem characteristics:• Delayed reward • Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same
sensors/effectors
Machine Learning RL 4
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon.
Immediate reward:• +100 if win • -100 if lose • 0 for all other states
Trained by playing 1.5 million games against itself.
Now approximately equal to best human player.
Machine Learning RL 5
Reinforcement Learning Problem
Goal: learn to choose actions that maximize
r0 + r1 + 2r2 + … , where 0 < 1
Machine Learning RL 6
Markov Decision Processes
Assume• finite set of states S • set of actions A
• at each discrete time agent observes state st S and chooses action at A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = (st, at) and rt = r(st, at)
– i.e., rt and st+1 depend only on current state and action
– functions and r may be nondeterministic – functions and r not necessarily known to agent
Machine Learning RL 7
Agent’s Learning Task
Execute actions in environment, observe results, and• learn action policy : S A that maximizes
E [rt + rt+1 + 2rt+2 + … ] from any starting state in S
• here 0 < 1 is the discount factor for future rewards (sometimes makes sense with = 1)
Note something new:• Target function is : S A• But, we have no training examples of form s, a • Training examples are of form s, a , r
Machine Learning RL 8
Value Function
To begin, consider deterministic worlds...
For each possible policy the agent might adopt, we can define an evaluation function over states
0
22
1 ...)(
iit
i
ttt
r
rrrsV
)(),(maxarg* ssV
where rt, rt+1, ... are generated by following policy starting at state s
Restated, the task is to learn the optimal policy *
Machine Learning RL 9
Machine Learning RL 10
What to Learn
We might try to have agent learn the evaluation function V* (which we write as V*)
It could then do a look-ahead search to choose the best action from any state s because
))],((*),([maxarg)(* asVasrsa
A problem:• This can work if agent knows : S A S, and
r : S A • But when it doesn't, it can't choose actions this way
Machine Learning RL 11
Q Function
Define a new function very similar to V*
)),((*),(),( asVasrasQ
If agent learns Q, it can choose optimal action even without knowing !
))],((*),([maxarg)(* asVasrs
),(maxarg)(* asQs
Q is the evaluation function the agent will learn [Watkins 1989].
Machine Learning RL 12
Training Rule to Learn Q
Note Q and V* are closely related:
))',(max)(*'
asQsVa
This allows us to write Q recursively as
))',((max),(
)),((*),(),(
1'
11
111111
asQasr
asVasrasQ
ta
Nice! Let denote learner’s current approximation to Q. Consider training rule
)','(ˆmax),(ˆ'
asQrasQa
Q̂
where s’ is the state resulting from applying action a in state s.
Machine Learning RL 13
Q Learning for Deterministic Worlds
For each s, a initialize table entry
Observe current state s
Do forever:• Select an action a and execute it• Receive immediate reward r • Observe the new state s’ • Update the table entry for as follows:
• s s’.
)','(ˆmax)(ˆ'
asQrsQa
0),(ˆ asQ
),(ˆ asQ
Machine Learning RL 14
Updating Q
90
}100,81,63max{9.00
)','(ˆmax),(ˆ2
'1
asQrasQa
right
Notice if rewards non-negative, then
and),(ˆ),(ˆ ),,( 1 asQasQnas nn
),(),(ˆ0 ),,( asQasQnas n
Machine Learning RL 15
converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often.
Proof: Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of .
Let be table after n updates, and Δn be the maximum error in : that is
For any table entry updated on iteration n+1, the error in the revised estimate is
Q̂
Q̂
nQ̂ nQ̂
),(ˆ asQn),(ˆ
1 asQn
|),(),(ˆ|max,
asQasQnas
n
|)','(max)','(ˆmax|
|))','(max())','(ˆmax(| |),(),(ˆ|
''
''1
asQasQ
asQrasQrasQasQ
an
a
an
an
Convergence Theorem
Machine Learning RL 16
nn
nas
na
an
a
an
an
asQasQ
asQasQ
asQasQ
asQasQ
asQrasQrasQasQ
|),(),(ˆ|
|)',''()',''(ˆ|max
|)','()','(ˆ|max
|)','(max)','(ˆmax|
|))','(max())','(ˆmax(| |),(),(ˆ|
1
',''
'
''
''1
Note we used general fact that:
|)()(|max|)(max)(max| 2121 afafafafaaa
This works with things other than max that satisfythis non-expansion property [Szepesvári & Littman, 1999].
Machine Learning RL 17
Non-deterministic Case (1)
What if reward and next state are non-deterministic?
We redefine V and Q by taking expected values.
0
22
1 ...][)(
iit
i
ttt
rE
rrrEsV
))],((*),([),( asVasrEasQ
Machine Learning RL 18
Nondeterministic Case (2)
Q learning generalizes to nondeterministic worlds
Alter training rule to
where
)]','(ˆmax[),(ˆ)1(),(ˆ1
'1 asQrasQasQ n
annnn
Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties: n = 0, n
2 = .
Q̂
),(1
1
asvisitsnn
Machine Learning RL 19
Temporal Difference Learning (1)
),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ
),(ˆmax),( 1)1( asQrasQ t
attt
Q learning: reduce discrepancy between successive Q estimates
One step time difference:
Why not two steps?
),(ˆmax),( 22
1)2( asQrrasQ t
atttt
Or n?),(ˆmax),( 1
)1(1
)( asQrrrasQ nta
nnt
ntttt
n
Blend all of these:
Machine Learning RL 20
Temporal Difference Learning (2)
TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1 [Dayan, 1992]• Tesauro's TD-Gammon uses this algorithm
),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ
)],(),(ˆmax)1[(),( 11 tttta
ttt asQasQrasQ Equivalent expression:
Machine Learning RL 21
Subtleties and Ongoing Research
• Replace table with neural net or other generalizer• Handle case where state only partially observable
(next?)• Design optimal exploration strategies• Extend to continuous action, state• Learn and use• Relationship to dynamic programming (next?)
Q̂
SAS :̂