machine learningrl1 reinforcement learning michael l. littman slides from elena/ml_13light.ppt which...

Machine Learning RL 1

Reinforcement Learning

Michael L. LittmanSlides from

http://www.cs.vu.nl/~elena/ml_13light.ppt

which appear to have been adapted from

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps


Reinforcement Learning

[Read Ch. 13][Exercise 13.2]

• Control learning • Control policies that choose optimal actions • Q learning • Convergence


Control Learning

Consider learning to choose actions, e.g.,• Robot learning to dock on battery charger • Learning to choose actions to optimize factory output• Learning to play Backgammon

Note several problem characteristics:• Delayed reward • Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same

sensors/effectors


One Example: TD-Gammon

[Tesauro, 1995]

Learn to play Backgammon.

Immediate reward:• +100 if win • -100 if lose • 0 for all other states

Trained by playing 1.5 million games against itself.

Now approximately equal to best human player.


Reinforcement Learning Problem

Goal: learn to choose actions that maximize

r0 + r1 + 2r2 + … , where 0 < 1


Markov Decision Processes

Assume• finite set of states S • set of actions A

• at each discrete time agent observes state st S and chooses action at A

• then receives immediate reward rt

• and state changes to st+1

• Markov assumption: st+1 = (st, at) and rt = r(st, at)

– i.e., rt and st+1 depend only on current state and action

– functions and r may be nondeterministic – functions and r not necessarily known to agent


Agent’s Learning Task

Execute actions in environment, observe results, and• learn action policy : S A that maximizes

E [rt + rt+1 + 2rt+2 + … ] from any starting state in S

• here 0 < 1 is the discount factor for future rewards (sometimes makes sense with = 1)

Note something new:• Target function is : S A• But, we have no training examples of form s, a • Training examples are of form s, a , r


Value Function

To begin, consider deterministic worlds...

For each possible policy the agent might adopt, we can define an evaluation function over states

0

22

1 ...)(

iit

i

ttt

r

rrrsV

)(),(maxarg* ssV

where rt, rt+1, ... are generated by following policy starting at state s

Restated, the task is to learn the optimal policy *


What to Learn

We might try to have agent learn the evaluation function V* (which we write as V*)

It could then do a look-ahead search to choose the best action from any state s because

))],((*),([maxarg)(* asVasrsa

A problem:• This can work if agent knows : S A S, and

r : S A • But when it doesn't, it can't choose actions this way


Q Function

Define a new function very similar to V*

)),((*),(),( asVasrasQ

If agent learns Q, it can choose optimal action even without knowing !

))],((*),([maxarg)(* asVasrs

),(maxarg)(* asQs

Q is the evaluation function the agent will learn [Watkins 1989].


Training Rule to Learn Q

Note Q and V* are closely related:

))',(max)(*'

asQsVa

This allows us to write Q recursively as

))',((max),(

)),((*),(),(

1'

11

111111

asQasr

asVasrasQ

ta

Nice! Let denote learner’s current approximation to Q. Consider training rule

)','(ˆmax),(ˆ'

asQrasQa

Q̂

where s’ is the state resulting from applying action a in state s.


Q Learning for Deterministic Worlds

For each s, a initialize table entry

Observe current state s

Do forever:• Select an action a and execute it• Receive immediate reward r • Observe the new state s’ • Update the table entry for as follows:

• s s’.

)','(ˆmax)(ˆ'

asQrsQa

0),(ˆ asQ

),(ˆ asQ


Updating Q

90

}100,81,63max{9.00

)','(ˆmax),(ˆ2

'1

asQrasQa

right

Notice if rewards non-negative, then

and),(ˆ),(ˆ ),,( 1 asQasQnas nn

),(),(ˆ0 ),,( asQasQnas n


converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often.

Proof: Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of .

Let be table after n updates, and Δn be the maximum error in : that is

For any table entry updated on iteration n+1, the error in the revised estimate is

Q̂

Q̂

nQ̂ nQ̂

),(ˆ asQn),(ˆ

1 asQn

|),(),(ˆ|max,

asQasQnas

n

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

''

''1

asQasQ

asQrasQrasQasQ

an

a

an

an

Convergence Theorem


nn

nas

na

an

a

an

an

asQasQ

asQasQ

asQasQ

asQasQ

asQrasQrasQasQ

|),(),(ˆ|

|)',''()',''(ˆ|max

|)','()','(ˆ|max

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

1

',''

'

''

''1

Note we used general fact that:

|)()(|max|)(max)(max| 2121 afafafafaaa

This works with things other than max that satisfythis non-expansion property [Szepesvári & Littman, 1999].


Non-deterministic Case (1)

What if reward and next state are non-deterministic?

We redefine V and Q by taking expected values.

0

22

1 ...][)(

iit

i

ttt

rE

rrrEsV

))],((*),([),( asVasrEasQ


Nondeterministic Case (2)

Q learning generalizes to nondeterministic worlds

Alter training rule to

where

)]','(ˆmax[),(ˆ)1(),(ˆ1

'1 asQrasQasQ n

annnn

Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties: n = 0, n

2 = .

Q̂

),(1

1

asvisitsnn


Temporal Difference Learning (1)

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ

),(ˆmax),( 1)1( asQrasQ t

attt

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Why not two steps?

),(ˆmax),( 22

1)2( asQrrasQ t

atttt

Or n?),(ˆmax),( 1

)1(1

)( asQrrrasQ nta

nnt

ntttt

n

Blend all of these:


Temporal Difference Learning (2)

TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1 [Dayan, 1992]• Tesauro's TD-Gammon uses this algorithm

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ

)],(),(ˆmax)1[(),( 11 tttta

ttt asQasQrasQ Equivalent expression:


Subtleties and Ongoing Research

• Replace table with neural net or other generalizer• Handle case where state only partially observable

(next?)• Design optimal exploration strategies• Extend to continuous action, state• Learn and use• Relationship to dynamic programming (next?)

Q̂

SAS :̂

machine learningrl1 reinforcement learning michael l. littman slides from elena/ml_13light.ppt which...

Documents

state s t s

r slide

e r t r t

agent slide

state s restated

immediate reward r t

machine learningrl9

sensorseffectors slide