deep reinforcement learning - university of cambridgepv273/slides/rl-petarv-presentation.pdf ·...

Deep reinforcement learningDQN and extensions

Petar Veličković

University of CambridgeNVIDIA D�� L�� D��

30 June 2016

Deep reinforcement learning: DQN and extensions Petar Veličković

Reinforcement learning DQN Extensions

Introduction

Motivation

I In this talk I will provide a brief introduction to reinforcementlearning, specifically focusing on the Deep Q-Network (DQN)and its variants.

I Deep reinforcement learning has become extremely popular insome of the recent major advances in artificial intelligence(Atari player, AlphaGo, . . . )



Introduction

Context

We want to learn how to act optimally in a very generalenvironment—performing state-changing actions on theenvironment which carry a reward signal.

Agent

Environment

s0

atst+1

rt+1



Mathematical formalism

Reinforcement learning

I The environment assumes a current state s out of a set S, andthe agent may perform actions out of an action set A.

I There are also two functions: S : S ⇥A! S andR : S ⇥A! R, which the agent does not typically haveaccess to (can be stochastic).

I A�er performing action a in state s, the environment assumesstate S(s, a), and the agent gets a reward signalR(s, a).

I Our goal is to learn a policy function, p : S ! A, choosingactions that maximise expected future rewards.



Mathematical formalism

Cumulative reward metric

I Aside from the aforementioned goal, it is also desirable toreceive positive rewards as soon as possible (i.e. we assign largerimportance to more immediate rewards).

I This is typically done by defining a discount factor � 2 [0, 1),used to scale future rewards in the total value of a policy:

V p(s) =

+1X

k=1

�k�1rk = r1

+ �r2

+ �2r3

+ . . .

where rk is the reward obtained a�er k steps, starting fromstate s and following policy p therea�er.

I Such a definition is required, among other things, to assureconvergence of the relevant learning algorithms.



Q-learning

Q function

I As mentioned, we want to find an optimal policy popt

: S ! A,where p

opt

(s) = argmaxp {V p(s)} for any state s. We also

denote Vopt

= V popt .

I Define a Q function, for all states s and actions a:Q (s, a) = R (s, a) + �V

opt

(S (s, a))

corresponding to the reward obtained upon executing action afrom state s and assuming optimal behaviour therea�er.

I Learning the Q function is su�icient for determining theoptimal policy, as:

popt

(s) = argmax

a{Q (s, a)}



Q-learning

Towards an iterative method

I Note that, for any state s, Vopt

(s) = max↵ {Q (s,↵)}.

I Factoring this into the definition of Q:Q (s, a) = R (s, a) + �max

↵{Q (S (s, a) ,↵)}

I This hints at the fact that, if Q0 is an estimate of Q, thenR (s, a) + �max

↵

�Q0

(S (s, a) ,↵)

will be a no-worse estimate of Q (s, a) than Q0(s, a).

I Thus far, we represent Q0 as a table of current estimates of Qfor all elements of S ⇥A.



Q-learning

Q-learning

Initialise the Q0 table with random values.1. Choose an action a to perform in the current state, s.2. Perform a and receive rewardR (s, a).3. Observe the new state, S (s, a).4. Update:

Q0(s, a) R (s, a) + �max

↵

�Q0

(S(s, a),↵)

5. If the next state is not terminal, go back to step 1.



Q-learning

Exploration vs. exploitation

I How to choose an action to perform?

I Ideally, would like to start o� with a period of exploring theenvironment’s characteristics, converging towards the policythat fully exploits the learnt Q values (greedy policy).

I A very active topic of research. Two common approaches:I So�max (⌧ ! 0):

P(a|s) = exp(Q(s, a)/⌧)P↵ exp(Q(s,↵)/⌧)

I "-greedy (1! "! 0):

P(a|s) = " · 1

|A| + (1� ") · I(a = argmax

↵Q(s,↵))



Initial setup

Playing Atari

I We now turn our a�ention to the initial success story thatlinked deep and reinforcement learning, allowing multipleAtari games to be played at superhuman level using the samebasic algorithm. (Mnih et al. (2013), NIPS)



Initial setup

Setup

I No prior knowledge about the game itself is incorporated—thealgorithm perceives solely the RGB matrices of pixels in theframebu�er as states, and changes in score as rewards.

I Actions correspond to joypad inputs (along with no-op).

I Further modifications:I The colour information is discarded—the matrices are

converted to grayscale before further processing;I A sequence of six previous frames is used as a single state (to

account for the fact that the system is only partially observedby the current frame);

I The rewards are all represented as ±1 (to avoid scaling issuesbetween di�erent games).



Neural networks

Q-learning, revisited

I Should we be able to successfully learn the values of the Qfunction for all pixel matrices and joypad input pairs, webecome capable of playing any Atari game.

I Major issue: we cannot store the entire Q table in memory!

I It needs to be approximated somehow. . .

I The authors used a deep neural network (dubbed a “DeepQ-Network”) as the approximator. This builds up on successesof a shallow neural network as an approximator for playingbackgammon (TD-gammon).



Neural networks

Deep Q-Network

Conv [16]! ReLU! Conv [32]! ReLU! FC [256]! ReLU! FC [|A|]Deep reinforcement learning: DQN and extensions Petar Veličković


Deep Q-learning

Deep Q-learning

Initialise an empty replay memory.Initialise the DQN with random (small) weights.1. Choose an action a to perform in the current state, s,

using an "-greedy strategy (with " annealed from 1.0 to 0.1).2. Perform a and receive rewardR (s, a).3. Observe the new state, S (s, a).4. Add (s, a,R(s, a),S(s, a)) to the replay memory.5. Sample a minibatch of tuples (si, ai, ri, si+1

) from the replaymemory, and perform stochastic gradient descent on the DQN,based on the loss function

⇣Q0

(si, a)�⇣ri + �max

↵Q0

(si+1

,↵)⌘⌘

2

where Q0(s, ·) is computed by feeding s into the DQN.

6. If the next state is not terminal, go back to step 1.Deep reinforcement learning: DQN and extensions Petar Veličković


Deep Q-learning

Leveraging GPU

I The discussed neural network architecture represents acommon architecture used in computer vision, relying onconvolutional layers followed by fully connected layers, withrectifier activations.

I Such operations are typically implemented as matrixoperations, and they may be extremely sped up using SIMDinstructions.

I The DQN is easily implementable in deep learning frameworks(e.g. TensorFlow, Theano, Torch, . . . ), which automaticallyleverage the available GPUs in the system to facilitate this.



Double Q-learning

Double Q-learning

I Original paper by van Hasselt et al. (2015).

I Q-learning known to have tendency to overestimate actionvalues as the training progresses—this is o�en due to the factthat the same table is used for action selection and evaluation.

I Double Q-learning addresses this issue by using two sets of DQNweights—one used for evaluation, and the other for selection.They are symmetrically updated by switching their roles a�ereach training step of the algorithm.



Proposed extensions

Three-dimensional se�ing

Currently evaluating the feasibility of DQN (along with severalextensions) in the context of 3D games.



Proposed extensions

Ideas

I Investigating more powerful neural network architectures forthe DQN;

I Leveraging unsupervised methods to extract the “mostimportant” regions of the frame as the training progresses.

I Incorporating sound information (completely ignored in priorwork)—requiring recurrent neural networks (RNNs).

I All of the above probably infeasible only a few years ago! Thedevelopment of GPUs made it all easier for us.



Proposed extensions

Thank you!

�estions?


deep reinforcement learning - university of cambridgepv273/slides/rl-petarv-presentation.pdf ·...

Documents