machine learningrl1 reinforcement learning michael l. littman slides from elena/ml_13light.ppt which...

21
Machine Learning RL 1 Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/ cs.cmu.edu/project/theo-3/www/ l20.ps

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 1

Reinforcement Learning

Michael L. LittmanSlides from

http://www.cs.vu.nl/~elena/ml_13light.ppt

which appear to have been adapted from

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps

Page 2: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 2

Reinforcement Learning

[Read Ch. 13][Exercise 13.2]

• Control learning • Control policies that choose optimal actions • Q learning • Convergence

Page 3: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 3

Control Learning

Consider learning to choose actions, e.g.,• Robot learning to dock on battery charger • Learning to choose actions to optimize factory output• Learning to play Backgammon

Note several problem characteristics:• Delayed reward • Opportunity for active exploration• Possibility that state only partially observable• Possible need to learn multiple tasks with same

sensors/effectors

Page 4: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 4

One Example: TD-Gammon

[Tesauro, 1995]

Learn to play Backgammon.

Immediate reward:• +100 if win • -100 if lose • 0 for all other states

Trained by playing 1.5 million games against itself.

Now approximately equal to best human player.

Page 5: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 5

Reinforcement Learning Problem

Goal: learn to choose actions that maximize

r0 + r1 + 2r2 + … , where 0 < 1

Page 6: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 6

Markov Decision Processes

Assume• finite set of states S • set of actions A

• at each discrete time agent observes state st S and chooses action at A

• then receives immediate reward rt

• and state changes to st+1

• Markov assumption: st+1 = (st, at) and rt = r(st, at)

– i.e., rt and st+1 depend only on current state and action

– functions and r may be nondeterministic – functions and r not necessarily known to agent

Page 7: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 7

Agent’s Learning Task

Execute actions in environment, observe results, and• learn action policy : S A that maximizes

E [rt + rt+1 + 2rt+2 + … ] from any starting state in S

• here 0 < 1 is the discount factor for future rewards (sometimes makes sense with = 1)

Note something new:• Target function is : S A• But, we have no training examples of form s, a • Training examples are of form s, a , r

Page 8: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 8

Value Function

To begin, consider deterministic worlds...

For each possible policy the agent might adopt, we can define an evaluation function over states

0

22

1 ...)(

iit

i

ttt

r

rrrsV

)(),(maxarg* ssV

where rt, rt+1, ... are generated by following policy starting at state s

Restated, the task is to learn the optimal policy *

Page 9: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 9

Page 10: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 10

What to Learn

We might try to have agent learn the evaluation function V* (which we write as V*)

It could then do a look-ahead search to choose the best action from any state s because

))],((*),([maxarg)(* asVasrsa

A problem:• This can work if agent knows : S A S, and

r : S A • But when it doesn't, it can't choose actions this way

Page 11: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 11

Q Function

Define a new function very similar to V*

)),((*),(),( asVasrasQ

If agent learns Q, it can choose optimal action even without knowing !

))],((*),([maxarg)(* asVasrs

),(maxarg)(* asQs

Q is the evaluation function the agent will learn [Watkins 1989].

Page 12: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 12

Training Rule to Learn Q

Note Q and V* are closely related:

))',(max)(*'

asQsVa

This allows us to write Q recursively as

))',((max),(

)),((*),(),(

1'

11

111111

asQasr

asVasrasQ

ta

Nice! Let denote learner’s current approximation to Q. Consider training rule

)','(ˆmax),(ˆ'

asQrasQa

where s’ is the state resulting from applying action a in state s.

Page 13: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 13

Q Learning for Deterministic Worlds

For each s, a initialize table entry

Observe current state s

Do forever:• Select an action a and execute it• Receive immediate reward r • Observe the new state s’ • Update the table entry for as follows:

• s s’.

)','(ˆmax)(ˆ'

asQrsQa

0),(ˆ asQ

),(ˆ asQ

Page 14: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 14

Updating Q

90

}100,81,63max{9.00

)','(ˆmax),(ˆ2

'1

asQrasQa

right

Notice if rewards non-negative, then

and),(ˆ),(ˆ ),,( 1 asQasQnas nn

),(),(ˆ0 ),,( asQasQnas n

Page 15: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 15

converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often.

Proof: Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of .

Let be table after n updates, and Δn be the maximum error in : that is

For any table entry updated on iteration n+1, the error in the revised estimate is

nQ̂ nQ̂

),(ˆ asQn),(ˆ

1 asQn

|),(),(ˆ|max,

asQasQnas

n

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

''

''1

asQasQ

asQrasQrasQasQ

an

a

an

an

Convergence Theorem

Page 16: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 16

nn

nas

na

an

a

an

an

asQasQ

asQasQ

asQasQ

asQasQ

asQrasQrasQasQ

|),(),(ˆ|

|)',''()',''(ˆ|max

|)','()','(ˆ|max

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

1

',''

'

''

''1

Note we used general fact that:

|)()(|max|)(max)(max| 2121 afafafafaaa

This works with things other than max that satisfythis non-expansion property [Szepesvári & Littman, 1999].

Page 17: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 17

Non-deterministic Case (1)

What if reward and next state are non-deterministic?

We redefine V and Q by taking expected values.

0

22

1 ...][)(

iit

i

ttt

rE

rrrEsV

))],((*),([),( asVasrEasQ

Page 18: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 18

Nondeterministic Case (2)

Q learning generalizes to nondeterministic worlds

Alter training rule to

where

)]','(ˆmax[),(ˆ)1(),(ˆ1

'1 asQrasQasQ n

annnn

Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties: n = 0, n

2 = .

),(1

1

asvisitsnn

Page 19: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 19

Temporal Difference Learning (1)

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ

),(ˆmax),( 1)1( asQrasQ t

attt

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Why not two steps?

),(ˆmax),( 22

1)2( asQrrasQ t

atttt

Or n?),(ˆmax),( 1

)1(1

)( asQrrrasQ nta

nnt

ntttt

n

Blend all of these:

Page 20: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 20

Temporal Difference Learning (2)

TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1 [Dayan, 1992]• Tesauro's TD-Gammon uses this algorithm

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ

)],(),(ˆmax)1[(),( 11 tttta

ttt asQasQrasQ Equivalent expression:

Page 21: Machine LearningRL1 Reinforcement Learning Michael L. Littman Slides from elena/ml_13light.ppt which appear to have been adapted from

Machine Learning RL 21

Subtleties and Ongoing Research

• Replace table with neural net or other generalizer• Handle case where state only partially observable

(next?)• Design optimal exploration strategies• Extend to continuous action, state• Learn and use• Relationship to dynamic programming (next?)

SAS :̂