deep recurrent q-learning for partially observable mdps...extend dqn to handle partially observable...

52
Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone University of Texas at Austin November 13, 2015 1

Upload: others

Post on 15-Feb-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Deep Recurrent Q-Learning for PartiallyObservable MDPs

    Matthew Hausknecht and Peter Stone

    University of Texas at Austin

    November 13, 2015

    1

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Motivation

    Intelligent decision making is the heart of AI

    Desire agents capable of learning to act intelligently in diverseenvironments

    Reinforcement Learning provides a general learning framework

    RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)

    DQN lacks mechanisms for handling partial observability

    Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)

    2

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    3

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy π∗

    requires no memory of past states

    4

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy π∗

    requires no memory of past states

    4

  • Markov Decision Process (MDP)

    Statest

    Actionat

    Rewardrt

    At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment

    Markov property ensures thatst+1 depends only on st , at

    Learning an optimal policy π∗

    requires no memory of past states

    4

  • Partially Observable Markov Decision Process(POMDP)

    Observationot

    Actionat

    Rewardrt

    True state of environment ishidden. Observations ot provideonly partial information.

    Memory of past observations mayhelp understand true systemstate, improve the policy

    5

  • Partially Observable Markov Decision Process(POMDP)

    Observationot

    Actionat

    Rewardrt

    True state of environment ishidden. Observations ot provideonly partial information.

    Memory of past observations mayhelp understand true systemstate, improve the policy

    5

  • Atari Domain

    ActionObservation Score

    160× 210 state space→ 84× 84 grayscale

    18 discrete actions

    Rewards clipped ∈ {−1, 0, 1}

    Source: www.arcadelearningenvironment.org

    6

    www.arcadelearningenvironment.orgwww.arcadelearningenvironment.org

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    • Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP

    7

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    • Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP

    7

  • Atari Domain: MDP or POMDP?

    ActionObservation Score

    Depends on the staterepresentation!

    • Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP

    7

  • Deep Q-Network (DQN)Q-Values 18

    IP1512

    Conv3

    Conv2

    Conv1 4

    84

    84

    Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

    Takes the last four game screensas input: enough to make mostAtari games Markov

    How well does DQN perform inpartially observed domains?

    8

  • Deep Q-Network (DQN)Q-Values 18

    IP1512

    Conv3

    Conv2

    Conv1 4

    84

    84

    Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)

    Takes the last four game screensas input: enough to make mostAtari games Markov

    How well does DQN perform inpartially observed domains?

    8

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • Flickering Atari

    Observationot

    Actionat

    Rewardrt

    Induce partial observability bystochastically obscuring the gamescreen

    ot =

    {st with p =

    12

    < 0, . . . , 0 > otherwise

    Game state must now be inferredfrom past observations

    9

  • DQN Pong

    True Game Screen Perceived Game Screen

    10

  • DQN Flickering Pong

    True Game Screen Perceived Game Screen

    11

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    12

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt − 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    • Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension

    • Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt − 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    • Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension

    • Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt − 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    • Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension

    • Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • Deep Recurrent Q-Network

    . . . LSTM 512

    18

    1

    84

    84

    tt − 1Long Short Term MemoryHochreiter (1997)

    Identical to DQN Except:

    • Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension

    • Each timestep takes asingle frame as input

    LSTM provides a selectivememory of past game states

    Trained end-to-end usingBPTT: unrolled for last 10timesteps

    13

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall

    14

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall

    14

  • DRQN Maximal Activations

    Unit detects the agent missing the ball

    Unit detects ball reflection on paddle

    Unit detects ball reflection on wall14

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    15

  • DRQN Flickering Pong

    True Game Screen Perceived Game Screen

    16

  • Flickering Pong

    17

  • Pong Generalization: POMDP ⇒ MDP

    How does DRQN generalize when trained on Flickering Pong andevaluated on standard Pong?

    18

  • Pong Generalization: POMDP ⇒ MDP

    0.0 0.2 0.4 0.6 0.8 1.0Observation Probability

    20

    15

    10

    5

    0

    5

    10

    15

    20DRQN 1-frameDQN 10-frameDQN 4-frame

    18

  • Performance on Flickering Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)

    Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

    19

  • Performance on Flickering Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)

    Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

    19

  • Performance on Flickering Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)

    19

  • Performance on Standard Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)

    Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

    20

  • Performance on Standard Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)

    Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

    20

  • Performance on Standard Atari Games

    Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)

    20

  • Performance on Standard Atari Games

    21

  • DRQN Frostbite

    True Game Screen Perceived Game Screen

    22

  • Generalization: MDP ⇒ POMDP

    How does DRQN generalize when trained on standard Atari andevaluated on flickering Atari?

    23

  • Generalization: MDP ⇒ POMDP

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Observation Probability

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7Perc

    enta

    ge O

    rigin

    al Sco

    reDRQN

    DQN

    23

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    24

  • Related Work

    Deep Recurrent Q-Network

    Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Comput., 9(8):1735–1780.

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. (2015). Human-level control through deep reinforcementlearning. Nature, 518(7540):529–533.

    Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Languageunderstanding for text-based games using deep reinforcementlearning. CoRR, abs/1506.08941.

    Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007).Solving deep memory POMDPs with recurrent policy gradients.

    25

  • Thanks!

    . . . LSTM

    1

    84

    84

    tt − 1LSTM can help deal withpartial observability

    Largest gains in generalizationbetween MDP ⇔ POMDP

    Future work understandingwhy DRQN does better/worseon certain games

    Source: https://github.com/mhauskn/dqn/tree/recurrent

    Matthew Hausknecht and Peter Stone

    26

    https://github.com/mhauskn/dqn/tree/recurrenthttps://github.com/mhauskn/dqn/tree/recurrent

  • Outline

    Motivation

    BackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-Network

    Results

    Related Work

    Appendix

    27

  • Computational Efficiency

    Backwards (ms) Forwards (ms)

    Frames 1 4 10 1 4 10

    Baseline 8.82 13.6 26.7 2.0 4.0 9.0Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4

    Table : Average milliseconds per backwards/forwards pass. Frames refersto the number of channels in the input image. Baseline is a non recurrentnetwork (e.g. DQN). Unroll refers to an LSTM network backpropagatedthrough time 1/10/30 steps.

    28

    MotivationBackgroundMDPPOMDPAtari DomainDeep Q-Network

    Deep Recurrent Q-NetworkResultsRelated WorkAppendix