exploration in reinforcement learning

33
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw @ cs . bham .ac. uk www. cs . bham .ac. uk /research/robotics www. cs . bham .ac. uk /~ jlw

Upload: jeff

Post on 16-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Exploration in Reinforcement Learning. Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK [email protected] www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw. The talk in one slide. - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: Exploration in Reinforcement Learning

2

The talk in one slide

• Optimal learning problems: how to act while learning how to act

• We’re going to look at this while learning from rewards

• Old heuristic: be optimistic in the face of uncertainty

• Our method: apply principle of optimism directly to model of how the world works

• Bayesian

Page 3: Exploration in Reinforcement Learning

3

Plan• Reinforcement Learning

• How to act while learning from rewards

• An approximate algorithm

• Results

• Learning with structure

Page 4: Exploration in Reinforcement Learning

4

Reinforcement Learning (RL) • Learning from punishments and rewards• Agent moves through world, observing states and

rewards• Adapts its behaviour to maximise some function of

reward

s9s5s4s2

……

…s3

+50

-1-1

+3

r9r5r4r1

s1

a9a5a4a2 …a3a1

Page 5: Exploration in Reinforcement Learning

5

Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a

policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

Page 6: Exploration in Reinforcement Learning

6

Reinforcement Learning (RL)

• Rt is a random variable

• So it has an expected value in a state under a given policy

• RL problem is to find optimal policy that maximises the expected value in every state

21 2 3 1

0

kt t t t t k

k

R r r r r

10

( ) { | , } | ,kt t t t k t

k

V s E R s E r s

Page 7: Exploration in Reinforcement Learning

7

Markov Decision Processes (MDPs)

• The transitions between states are uncertain• The probabilities depend only on the current state

• Transition matrix P, and reward function

r = 2211r = 0a1

a2

112p

111p

212p

211p

312 1p Pr( 2 | 1, 3)t t ts s a

1 111 12

2 211 12

p p

p p

P =

0

2

= R

Page 8: Exploration in Reinforcement Learning

8

Bellman equations and bootstrapping

• Conditional independence allows us to define the expected return V* for the optimal policy in terms of a recurrence relation:

where

• We can use the recurrence relation to bootstrap our estimate of Vin two ways

* *ij( , ) { ( )}a

jj S

Q i a p V j

R

* *( ) max ( , )a

V i Q i a

A i

4

3

5

a

Page 9: Exploration in Reinforcement Learning

9

Two types of bootstrapping

• We can bootstrap using explicit knowledge of P and (Dynamic Programming)

• Or we can bootstrap using samples from P and (temporal difference learning)

* *ij 1( , ) { ( )}a

n j nj S

Q i a p V j

R i

4

3

5

a

* * * *1 1 1

ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t tb A

Q s a Q s a r Q s b Q s a

st+1

atst

rt+1

Page 10: Exploration in Reinforcement Learning

10

Multi-agent RL: Learning to play football

• Learning to play in a team• Too time consuming to do on real robots• There is a well established simulator league• We can learn effectively from reinforcement

Page 11: Exploration in Reinforcement Learning

11

Learning to play backgammon

• TD() learning and a Backprop net with one hidden layer

• 1,500,000 training games (self play)

• Equivalent in skill to the top dozen human players

• Backgammon has ~1020 states, so can’t be solved using DP

Page 12: Exploration in Reinforcement Learning

12

The Exploration problem: intuition

• We are learning to maximise performance

• But how should we act while learning?

• Trade-off: exploit what we know or explore to gain new information?

• Optimal learning: maximise performance while learning given your imperfect knowledge

Page 13: Exploration in Reinforcement Learning

13

The optimal learning problem

• If we knew P it would be easy

• However …

– We estimate P from observations

– P is a random variable

– There is a density f(P) over the space of possible MDPs

– What is it? How does it help us solve our problem?

* *ij( , ) { max ( , ) }a

jb

j S

Q i a p Q j b

A

R

Page 14: Exploration in Reinforcement Learning

14

A density over MDPs• Suppose we’ve wandered around for a while

• We have a matrix M containing the transition counts

• The density over possible P depends on

M, f(P|M) and is a product of Dirichlet

densities

r = 2211r = 0a1

a2

112 4m

111 2m

212 2m

211 4m

1 111 12

2 211 12

m m

m m

M =

212p

112p

00

1

1

1 111 12

2 211 12

p p

p p

P =

Page 15: Exploration in Reinforcement Learning

15

A density over multinomials

211a1

112 4m

111 2m

1 111 12m m

1m =

212p

1 111 12p p

1p =

112p

00

1

1

8,161m =

112p

112( | )f p 1m

Page 16: Exploration in Reinforcement Learning

16

Optimal learning formally stated

• Given f(P|M), find that maximises

* *( , , ) ( , , ) ( | )Q i a Q i a f dM P P M PM

212p

112p

00

1

1

r = 2211r = 0a1

a2

112p

111p

212p

211p

Page 17: Exploration in Reinforcement Learning

17

Transforming the problem

• When we evaluate the integral we get another MDP!

• This one is defined over the space of

information states

• This space grows exponentially in the

depth of the look ahead

a1

* *( , , ) ( ) ( , ( ))a aij j ij

j

Q i a p V j TM M + MR

1122, ( )T M

1,M

2133, ( )T M

a2

Page 18: Exploration in Reinforcement Learning

18

A heuristic: Optimistic Model Selection

• Solving the information state space MDP is intractable

• An old heuristic is be optimistic in the face of uncertainty

• So here we pick an optimistic P

• Find V* for that P only

• How do we pick P optimistically?

Page 19: Exploration in Reinforcement Learning

19

Optimistic Model Selection

• h

Do some DP style bootstrapping to improve estimated V

Page 20: Exploration in Reinforcement Learning

20

Experimental results

Page 21: Exploration in Reinforcement Learning

21

Bayesian view: performance while learning

Page 22: Exploration in Reinforcement Learning

22

Bayesian view: policy quality

Page 23: Exploration in Reinforcement Learning

23

Do we really care?

• Why solve for MDPs? While challenging they are too simple to be useful

• Structured representations

are more powerful

Page 24: Exploration in Reinforcement Learning

24

Model-based RL: structured models

• Transition model P is represented compactly using a Dynamic Bayes Net

(or factored MDP)

• V is represented as a

tree

• Backups look like goal

regression operators

• Converging with the AI

planning community

Page 25: Exploration in Reinforcement Learning

25

Structured Exploration: results

Page 26: Exploration in Reinforcement Learning

26

Challenge: Learning with Hidden State

• Learning in a POMDP, or k-Markov environment

• Planning in POMDPs is intractable

• Factored POMDPs look promising

• POMDPs are the basis of the state of the art in mobile robotics

at at+1 at+2

rt+1 rt+2

st st+1 st+2

ot ot+1 ot+2

Page 27: Exploration in Reinforcement Learning

27

Wrap up• RL is a class of problems

• We can pose some optimal learning problems elegantly in this framework

• Can’t be perfect, but we can do alright

• BUT: probabilistic representations while very useful in many fields are a source of frequent intractability

• General probabilistic representations are best avoided

• How?

Page 28: Exploration in Reinforcement Learning

28

Cumulative Discounted Return

Page 29: Exploration in Reinforcement Learning

29

Cumulative Discounted Return

Page 30: Exploration in Reinforcement Learning

30

Cumulative Discounted Return

Page 31: Exploration in Reinforcement Learning

31

Policy Quality

Page 32: Exploration in Reinforcement Learning

32

Policy Quality

Page 33: Exploration in Reinforcement Learning

33

Policy Quality