deep recurrent q-learning for partially observable mdps...extend dqn to handle partially observable...

Deep Recurrent Q-Learning for PartiallyObservable MDPs

Matthew Hausknecht and Peter Stone

University of Texas at Austin

November 13, 2015

1

Motivation

Intelligent decision making is the heart of AI

Desire agents capable of learning to act intelligently in diverseenvironments

Reinforcement Learning provides a general learning framework

RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

DQN lacks mechanisms for handling partial observability

Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

2

Motivation

2

Motivation

2

Motivation

2

Motivation

2

Motivation

2

Outline

Motivation

BackgroundMDPPOMDPAtari DomainDeep Q-Network

Deep Recurrent Q-Network

Results

Related Work

Appendix

3

Markov Decision Process (MDP)

Statest

Actionat

Rewardrt

At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

Markov property ensures thatst+1 depends only on st , at

Learning an optimal policy π∗

requires no memory of past states

4

Statest

Actionat

Rewardrt

4

Statest

Actionat

Rewardrt

4

Partially Observable Markov Decision Process(POMDP)

Observationot

Actionat

Rewardrt

True state of environment ishidden. Observations ot provideonly partial information.

Memory of past observations mayhelp understand true systemstate, improve the policy

5

Partially Observable Markov Decision Process(POMDP)

Observationot

Actionat

Rewardrt

True state of environment ishidden. Observations ot provideonly partial information.

Memory of past observations mayhelp understand true systemstate, improve the policy

5

Atari Domain

ActionObservation Score

160× 210 state space→ 84× 84 grayscale

18 discrete actions

Rewards clipped ∈ {−1, 0, 1}

Source: www.arcadelearningenvironment.org

6

www.arcadelearningenvironment.orgwww.arcadelearningenvironment.org

Atari Domain: MDP or POMDP?

Depends on the staterepresentation!

• Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP

7

Deep Q-Network (DQN)Q-Values 18

IP1512

Conv3

Conv2

Conv1 4

84

Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

Takes the last four game screensas input: enough to make mostAtari games Markov

How well does DQN perform inpartially observed domains?

8

Deep Q-Network (DQN)Q-Values 18

IP1512

Conv3

Conv2

Conv1 4

84

Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

Takes the last four game screensas input: enough to make mostAtari games Markov

How well does DQN perform inpartially observed domains?

8

Flickering Atari

Observationot

Actionat

Rewardrt

Induce partial observability bystochastically obscuring the gamescreen

ot =

{st with p =

12

< 0, . . . , 0 > otherwise

Game state must now be inferredfrom past observations

9

Flickering Atari

Observationot

Actionat

Rewardrt

ot =

{st with p =

12

< 0, . . . , 0 > otherwise

9

Flickering Atari

Observationot

Actionat

Rewardrt

ot =

{st with p =

12

< 0, . . . , 0 > otherwise

9

DQN Pong

True Game Screen Perceived Game Screen

10

DQN Flickering Pong

11

Outline

Motivation

Results

Related Work

Appendix

12

. . . LSTM 512

18

1

84

tt − 1Long Short Term MemoryHochreiter (1997)

Identical to DQN Except:

• Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension

• Each timestep takes asingle frame as input

LSTM provides a selectivememory of past game states

Trained end-to-end usingBPTT: unrolled for last 10timesteps

13

. . . LSTM 512

18

1

84

13

. . . LSTM 512

18

1

84

13

. . . LSTM 512

18

1

84

13

DRQN Maximal Activations

Unit detects the agent missing the ball

Unit detects ball reflection on paddle

Unit detects ball reflection on wall

14

Unit detects ball reflection on wall

14

Unit detects ball reflection on wall14

Outline

Motivation

Results

Related Work

Appendix

15

DRQN Flickering Pong

16

Flickering Pong

17

Pong Generalization: POMDP ⇒ MDP

How does DRQN generalize when trained on Flickering Pong andevaluated on standard Pong?

18

Pong Generalization: POMDP ⇒ MDP

0.0 0.2 0.4 0.6 0.8 1.0Observation Probability

20

15

10

5

0

5

10

15

20DRQN 1-frameDQN 10-frameDQN 4-frame

18

Performance on Flickering Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)

Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

19

Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)

Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

19

Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

19

Performance on Standard Atari Games

Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)

Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

20

Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)

Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

20

Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

20

21

DRQN Frostbite

22

Generalization: MDP ⇒ POMDP

How does DRQN generalize when trained on standard Atari andevaluated on flickering Atari?

23

Generalization: MDP ⇒ POMDP

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Observation Probability

0.2

0.3

0.4

0.5

0.6

0.7Perc

enta

ge O

rigin

al Sco

reDRQN

DQN

23

Outline

Motivation

Results

Related Work

Appendix

24

Related Work

Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Comput., 9(8):1735–1780.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. (2015). Human-level control through deep reinforcementlearning. Nature, 518(7540):529–533.

Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Languageunderstanding for text-based games using deep reinforcementlearning. CoRR, abs/1506.08941.

Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007).Solving deep memory POMDPs with recurrent policy gradients.

25

Thanks!

. . . LSTM

1

84

tt − 1LSTM can help deal withpartial observability

Largest gains in generalizationbetween MDP ⇔ POMDP

Future work understandingwhy DRQN does better/worseon certain games

Source: https://github.com/mhauskn/dqn/tree/recurrent

Matthew Hausknecht and Peter Stone

26

https://github.com/mhauskn/dqn/tree/recurrenthttps://github.com/mhauskn/dqn/tree/recurrent

Outline

Motivation

Results

Related Work

Appendix

27

Computational Efficiency

Backwards (ms) Forwards (ms)

Frames 1 4 10 1 4 10

Baseline 8.82 13.6 26.7 2.0 4.0 9.0Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4

Table : Average milliseconds per backwards/forwards pass. Frames refersto the number of channels in the input image. Baseline is a non recurrentnetwork (e.g. DQN). Unroll refers to an LSTM network backpropagatedthrough time 1/10/30 steps.

28

MotivationBackgroundMDPPOMDPAtari DomainDeep Q-Network

Deep Recurrent Q-NetworkResultsRelated WorkAppendix

deep recurrent q-learning for partially observable mdps...extend dqn to handle partially observable...

Documents