deep reinforcement learning - university of cambridgepv273/slides/rl-petarv-presentation.pdf ·...
TRANSCRIPT
Deep reinforcement learningDQN and extensions
Petar Veličković
University of CambridgeNVIDIA D��� L������� D��
30 June 2016
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Introduction
Motivation
I In this talk I will provide a brief introduction to reinforcementlearning, specifically focusing on the Deep Q-Network (DQN)and its variants.
I Deep reinforcement learning has become extremely popular insome of the recent major advances in artificial intelligence(Atari player, AlphaGo, . . . )
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Introduction
Context
We want to learn how to act optimally in a very generalenvironment—performing state-changing actions on theenvironment which carry a reward signal.
Agent
Environment
s0
atst+1
rt+1
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Mathematical formalism
Reinforcement learning
I The environment assumes a current state s out of a set S, andthe agent may perform actions out of an action set A.
I There are also two functions: S : S ⇥A! S andR : S ⇥A! R, which the agent does not typically haveaccess to (can be stochastic).
I A�er performing action a in state s, the environment assumesstate S(s, a), and the agent gets a reward signalR(s, a).
I Our goal is to learn a policy function, p : S ! A, choosingactions that maximise expected future rewards.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Mathematical formalism
Cumulative reward metric
I Aside from the aforementioned goal, it is also desirable toreceive positive rewards as soon as possible (i.e. we assign largerimportance to more immediate rewards).
I This is typically done by defining a discount factor � 2 [0, 1),used to scale future rewards in the total value of a policy:
V p(s) =
+1X
k=1
�k�1rk = r1
+ �r2
+ �2r3
+ . . .
where rk is the reward obtained a�er k steps, starting fromstate s and following policy p therea�er.
I Such a definition is required, among other things, to assureconvergence of the relevant learning algorithms.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Q-learning
Q function
I As mentioned, we want to find an optimal policy popt
: S ! A,where p
opt
(s) = argmaxp {V p(s)} for any state s. We also
denote Vopt
= V popt .
I Define a Q function, for all states s and actions a:Q (s, a) = R (s, a) + �V
opt
(S (s, a))
corresponding to the reward obtained upon executing action afrom state s and assuming optimal behaviour therea�er.
I Learning the Q function is su�icient for determining theoptimal policy, as:
popt
(s) = argmax
a{Q (s, a)}
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Q-learning
Towards an iterative method
I Note that, for any state s, Vopt
(s) = max↵ {Q (s,↵)}.
I Factoring this into the definition of Q:Q (s, a) = R (s, a) + �max
↵{Q (S (s, a) ,↵)}
I This hints at the fact that, if Q0 is an estimate of Q, thenR (s, a) + �max
↵
�Q0
(S (s, a) ,↵)
will be a no-worse estimate of Q (s, a) than Q0(s, a).
I Thus far, we represent Q0 as a table of current estimates of Qfor all elements of S ⇥A.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Q-learning
Q-learning
Initialise the Q0 table with random values.1. Choose an action a to perform in the current state, s.2. Perform a and receive rewardR (s, a).3. Observe the new state, S (s, a).4. Update:
Q0(s, a) R (s, a) + �max
↵
�Q0
(S(s, a),↵)
5. If the next state is not terminal, go back to step 1.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Q-learning
Exploration vs. exploitation
I How to choose an action to perform?
I Ideally, would like to start o� with a period of exploring theenvironment’s characteristics, converging towards the policythat fully exploits the learnt Q values (greedy policy).
I A very active topic of research. Two common approaches:I So�max (⌧ ! 0):
P(a|s) = exp(Q(s, a)/⌧)P↵ exp(Q(s,↵)/⌧)
I "-greedy (1! "! 0):
P(a|s) = " · 1
|A| + (1� ") · I(a = argmax
↵Q(s,↵))
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Initial setup
Playing Atari
I We now turn our a�ention to the initial success story thatlinked deep and reinforcement learning, allowing multipleAtari games to be played at superhuman level using the samebasic algorithm. (Mnih et al. (2013), NIPS)
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Initial setup
Setup
I No prior knowledge about the game itself is incorporated—thealgorithm perceives solely the RGB matrices of pixels in theframebu�er as states, and changes in score as rewards.
I Actions correspond to joypad inputs (along with no-op).
I Further modifications:I The colour information is discarded—the matrices are
converted to grayscale before further processing;I A sequence of six previous frames is used as a single state (to
account for the fact that the system is only partially observedby the current frame);
I The rewards are all represented as ±1 (to avoid scaling issuesbetween di�erent games).
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Neural networks
Q-learning, revisited
I Should we be able to successfully learn the values of the Qfunction for all pixel matrices and joypad input pairs, webecome capable of playing any Atari game.
I Major issue: we cannot store the entire Q table in memory!
I It needs to be approximated somehow. . .
I The authors used a deep neural network (dubbed a “DeepQ-Network”) as the approximator. This builds up on successesof a shallow neural network as an approximator for playingbackgammon (TD-gammon).
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Neural networks
Deep Q-Network
Conv [16]! ReLU! Conv [32]! ReLU! FC [256]! ReLU! FC [|A|]Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Deep Q-learning
Deep Q-learning
Initialise an empty replay memory.Initialise the DQN with random (small) weights.1. Choose an action a to perform in the current state, s,
using an "-greedy strategy (with " annealed from 1.0 to 0.1).2. Perform a and receive rewardR (s, a).3. Observe the new state, S (s, a).4. Add (s, a,R(s, a),S(s, a)) to the replay memory.5. Sample a minibatch of tuples (si, ai, ri, si+1
) from the replaymemory, and perform stochastic gradient descent on the DQN,based on the loss function
⇣Q0
(si, a)�⇣ri + �max
↵Q0
(si+1
,↵)⌘⌘
2
where Q0(s, ·) is computed by feeding s into the DQN.
6. If the next state is not terminal, go back to step 1.Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Deep Q-learning
Leveraging GPU
I The discussed neural network architecture represents acommon architecture used in computer vision, relying onconvolutional layers followed by fully connected layers, withrectifier activations.
I Such operations are typically implemented as matrixoperations, and they may be extremely sped up using SIMDinstructions.
I The DQN is easily implementable in deep learning frameworks(e.g. TensorFlow, Theano, Torch, . . . ), which automaticallyleverage the available GPUs in the system to facilitate this.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Double Q-learning
Double Q-learning
I Original paper by van Hasselt et al. (2015).
I Q-learning known to have tendency to overestimate actionvalues as the training progresses—this is o�en due to the factthat the same table is used for action selection and evaluation.
I Double Q-learning addresses this issue by using two sets of DQNweights—one used for evaluation, and the other for selection.They are symmetrically updated by switching their roles a�ereach training step of the algorithm.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Proposed extensions
Three-dimensional se�ing
Currently evaluating the feasibility of DQN (along with severalextensions) in the context of 3D games.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Proposed extensions
Ideas
I Investigating more powerful neural network architectures forthe DQN;
I Leveraging unsupervised methods to extract the “mostimportant” regions of the frame as the training progresses.
I Incorporating sound information (completely ignored in priorwork)—requiring recurrent neural networks (RNNs).
I All of the above probably infeasible only a few years ago! Thedevelopment of GPUs made it all easier for us.
Deep reinforcement learning: DQN and extensions Petar Veličković
Reinforcement learning DQN Extensions
Proposed extensions
Thank you!
�estions?
Deep reinforcement learning: DQN and extensions Petar Veličković