exploration in reinforcement learning

Exploration in Reinforcement Learning

Jeremy Wyatt

Intelligent Robotics Lab

School of Computer Science

University of Birmingham, UK

[email protected]

www.cs.bham.ac.uk/research/robotics

www.cs.bham.ac.uk/~jlw

mailto:[email protected]







http://www.cs.bham.ac.uk/research/robotics







http://www.cs.bham.ac.uk/~jlw








2

The talk in one slide

• Optimal learning problems: how to act while learning how to act

• We’re going to look at this while learning from rewards

• Old heuristic: be optimistic in the face of uncertainty

• Our method: apply principle of optimism directly to model of how the world works

• Bayesian

3

Plan• Reinforcement Learning

• How to act while learning from rewards

• An approximate algorithm

• Results

• Learning with structure

4

Reinforcement Learning (RL) • Learning from punishments and rewards• Agent moves through world, observing states and

rewards• Adapts its behaviour to maximise some function of

reward

s9s5s4s2

……

…s3

+50

-1-1

+3

r9r5r4r1

s1

a9a5a4a2 …a3a1

5

Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a

policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

6

Reinforcement Learning (RL)

• Rt is a random variable

• So it has an expected value in a state under a given policy

• RL problem is to find optimal policy that maximises the expected value in every state

21 2 3 1

0

kt t t t t k

k

R r r r r

10

( ) { | , } | ,kt t t t k t

k

V s E R s E r s

7

Markov Decision Processes (MDPs)

• The transitions between states are uncertain• The probabilities depend only on the current state

• Transition matrix P, and reward function

r = 2211r = 0a1

a2

112p

111p

212p

211p

312 1p Pr( 2 | 1, 3)t t ts s a

1 111 12

2 211 12

p p

p p

P =

0

2

= R

8

Bellman equations and bootstrapping

• Conditional independence allows us to define the expected return V* for the optimal policy in terms of a recurrence relation:

where

• We can use the recurrence relation to bootstrap our estimate of Vin two ways

* *ij( , ) { ( )}a

jj S

Q i a p V j

R

* *( ) max ( , )a

V i Q i a

A i

4

3

5

a

9

Two types of bootstrapping

• We can bootstrap using explicit knowledge of P and (Dynamic Programming)

• Or we can bootstrap using samples from P and (temporal difference learning)

* *ij 1( , ) { ( )}a

n j nj S

Q i a p V j

R i

4

3

5

a

* * * *1 1 1

ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t tb A

Q s a Q s a r Q s b Q s a

st+1

atst

rt+1

10

Multi-agent RL: Learning to play football

• Learning to play in a team• Too time consuming to do on real robots• There is a well established simulator league• We can learn effectively from reinforcement

11

Learning to play backgammon

• TD() learning and a Backprop net with one hidden layer

• 1,500,000 training games (self play)

• Equivalent in skill to the top dozen human players

• Backgammon has ~1020 states, so can’t be solved using DP

12

The Exploration problem: intuition

• We are learning to maximise performance

• But how should we act while learning?

• Trade-off: exploit what we know or explore to gain new information?

• Optimal learning: maximise performance while learning given your imperfect knowledge

13

The optimal learning problem

• If we knew P it would be easy

• However …

– We estimate P from observations

– P is a random variable

– There is a density f(P) over the space of possible MDPs

– What is it? How does it help us solve our problem?

* *ij( , ) { max ( , ) }a

jb

j S

Q i a p Q j b

A

R

14

A density over MDPs• Suppose we’ve wandered around for a while

• We have a matrix M containing the transition counts

• The density over possible P depends on

M, f(P|M) and is a product of Dirichlet

densities

r = 2211r = 0a1

a2

112 4m

111 2m

212 2m

211 4m

1 111 12

2 211 12

m m

m m

M =

212p

112p

00

1

1

1 111 12

2 211 12

p p

p p

P =

15

A density over multinomials

211a1

112 4m

111 2m

1 111 12m m

1m =

212p

1 111 12p p

1p =

112p

00

1

1

8,161m =

112p

112( | )f p 1m

16

Optimal learning formally stated

• Given f(P|M), find that maximises

* *( , , ) ( , , ) ( | )Q i a Q i a f dM P P M PM

212p

112p

00

1

1

r = 2211r = 0a1

a2

112p

111p

212p

211p

17

Transforming the problem

• When we evaluate the integral we get another MDP!

• This one is defined over the space of

information states

• This space grows exponentially in the

depth of the look ahead

a1

* *( , , ) ( ) ( , ( ))a aij j ij

j

Q i a p V j TM M + MR

1122, ( )T M

1,M

2133, ( )T M

a2

18

A heuristic: Optimistic Model Selection

• Solving the information state space MDP is intractable

• An old heuristic is be optimistic in the face of uncertainty

• So here we pick an optimistic P

• Find V* for that P only

• How do we pick P optimistically?

19

Optimistic Model Selection

• h

Do some DP style bootstrapping to improve estimated V

20

Experimental results

21

Bayesian view: performance while learning

22

Bayesian view: policy quality

23

Do we really care?

• Why solve for MDPs? While challenging they are too simple to be useful

• Structured representations

are more powerful

24

Model-based RL: structured models

• Transition model P is represented compactly using a Dynamic Bayes Net

(or factored MDP)

• V is represented as a

tree

• Backups look like goal

regression operators

• Converging with the AI

planning community

25

Structured Exploration: results

26

Challenge: Learning with Hidden State

• Learning in a POMDP, or k-Markov environment

• Planning in POMDPs is intractable

• Factored POMDPs look promising

• POMDPs are the basis of the state of the art in mobile robotics

at at+1 at+2

rt+1 rt+2

st st+1 st+2

ot ot+1 ot+2

27

Wrap up• RL is a class of problems

• We can pose some optimal learning problems elegantly in this framework

• Can’t be perfect, but we can do alright

• BUT: probabilistic representations while very useful in many fields are a source of frequent intractability

• General probabilistic representations are best avoided

• How?

28

Cumulative Discounted Return

29


30


31

Policy Quality

32

Policy Quality

33

Policy Quality

exploration in reinforcement learning

Documents

backgammontdl learning

slideoptimal learning

possible p

optimal policy p

explicit knowledge of

0a1a20011a density

density fp

expected value