ace (actor–critic–explorer) paradigm for reinforcement learning in basal ganglia: highlighting...

14
ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei Denny Joseph b , Garipelli Gangadhar a , V. Srinivasa Chakravarthy b, a Machine Learning Group, IDIAP Research Institute, CH-1920 Martigny, Switzerland b Department of Biotechnology, Indian Institute of TechnologyMadras. Chennai 600036, India article info Article history: Received 3 May 2007 Received in revised form 24 November 2009 Accepted 1 March 2010 Communicated by R. Kozma Available online 2 April 2010 Keywords: Basal ganglia Dopamine Norepinephrine Reinforcement learning Reaching abstract We present a comprehensive model of basal ganglia in which the three important reinforcement learning componentsActor, Critic and Explorer (ACE),are represented and their anatomical substrates are identified. Particularly, we identify the subthalamic-nucleus and globus pallidus externa (STN–GPe) loop as the Explorer, and argue that complex activity of STN and GPe neurons, found in experimental studies, provides the stochastic drive necessary for exploration. Simulations involving a two-link arm model show task-dependent variations in complexity of STN–GPe activity when the ACE network is trained to perform simple reaching movements. Complexity and average levels of STN–GPe activity are observed to be higher before training than in post-training conditions. Further, in order to simulate Parkinsonian conditions, when dopamine levels in substantia nigra portion of the model are reduced, the arm displayed, as a primary change, small amplitude movements, which on persistent network training, amplified to large amplitude unregulated movements reminiscent of Parkinsonian tremor. & 2010 Elsevier B.V. All rights reserved. 1. Introduction When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Such interactions are a major source of information about the environment and ourselves. Learning from interaction is a foundational idea underlying all theories of intelligence and learning [56]. This form of the learning theory with no explicit teacher is coined as reinforcement learning (RL). RL has become popular in the past decade in the artificial intelligence community [37,53,56,59] as well as in communities related to neuroscience and its allied areas [16,17,35,36,46,51]. It has been suggested that in the mammalian brain, learning by reinforcement is a function of brain nuclei known as the basal ganglia (BG). Specifically, the idea that dopamine release from the substantia nigra compacta (SNc) and ventral tegmental area (VTA) encodes error in the predicted future reward for the animal enjoys which is significant experimental support too [52]. It is now believed that the BG uses this reward-related information to modulate sensory–motor pathways so as to render future behaviors more rewarding. Thus, such a reinforcement signal helps to control the acquisition of learned behaviors [48]. Other than reward based learning, the BG are thought to be involved in diverse functions like sequence learning [10], working memory [25], action selection [9,47], action gating [61], motor preparation [12], timing [14], etc. Though enormous progress has been made in terms of anatomy, pathology, electrophysiology, and imaging studies related to the BG, a comprehensive understanding of the contribution of these nuclei to behavioral control still remains elusive. There is a strong need for functional models which can assimilate both the constraints imposed by neurobiological data, and at the same time be able to simulate various candidate behavioral functions that the BG nuclei are thought to subserve [1]. From reviews of computational models of BG [44,25] it appears that most BG models are exclusive and capture specific functional roles. There are models that describe the role of BG in Action gating [61], in action selection between competing actions [9,47], in sustaining working memory representations [25], in sequence learning [10], and most importantly in reinforcement learning [5,34,52]. In a recent review by Buhusi and Meck [14], it has been highlighted that that BG has a role in timing. An immense challenge that lies ahead of BG modelers is to forge the many exclusive albeit useful insights embodied by the models noted above into a single integrated framework. Although exploration and exploitation are equally important in an RL framework, literature concerned with the role of the BG in RL seems to focus mainly on the reward signal – its chemical messenger, its anatomical site, and its consequences in learning, Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.03.001 Corresponding author. Tel.: + 91 44 2257 4115; fax: + 91 44 2257 4102. E-mail addresses: [email protected] (D. Joseph), [email protected] (G. Gangadhar), [email protected] (V. Srinivasa Chakravarthy). Neurocomputing 74 (2010) 205–218

Upload: denny-joseph

Post on 26-Jun-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Neurocomputing 74 (2010) 205–218

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

Gangad

(V. Srin

journal homepage: www.elsevier.com/locate/neucom

ACE (Actor–Critic–Explorer) paradigm for reinforcement learning inbasal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Denny Joseph b, Garipelli Gangadhar a, V. Srinivasa Chakravarthy b,�

a Machine Learning Group, IDIAP Research Institute, CH-1920 Martigny, Switzerlandb Department of Biotechnology, Indian Institute of Technology—Madras. Chennai 600036, India

a r t i c l e i n f o

Article history:

Received 3 May 2007

Received in revised form

24 November 2009

Accepted 1 March 2010

Communicated by R. Kozmaexperimental studies, provides the stochastic drive necessary for exploration. Simulations involving a

Available online 2 April 2010

Keywords:

Basal ganglia

Dopamine

Norepinephrine

Reinforcement learning

Reaching

12/$ - see front matter & 2010 Elsevier B.V. A

016/j.neucom.2010.03.001

esponding author. Tel.: +91 44 2257 4115; fa

ail addresses: [email protected] (D. Jose

[email protected] (G. Gangadhar), schakr

ivasa Chakravarthy).

a b s t r a c t

We present a comprehensive model of basal ganglia in which the three important reinforcement

learning components—Actor, Critic and Explorer (ACE),—are represented and their anatomical

substrates are identified. Particularly, we identify the subthalamic-nucleus and globus pallidus externa

(STN–GPe) loop as the Explorer, and argue that complex activity of STN and GPe neurons, found in

two-link arm model show task-dependent variations in complexity of STN–GPe activity when the ACE

network is trained to perform simple reaching movements. Complexity and average levels of STN–GPe

activity are observed to be higher before training than in post-training conditions. Further, in order to

simulate Parkinsonian conditions, when dopamine levels in substantia nigra portion of the model are

reduced, the arm displayed, as a primary change, small amplitude movements, which on persistent

network training, amplified to large amplitude unregulated movements reminiscent of Parkinsonian

tremor.

& 2010 Elsevier B.V. All rights reserved.

1. Introduction

When an infant plays, waves its arms, or looks about, it has noexplicit teacher, but it does have a direct sensorimotor connectionto its environment. Such interactions are a major source ofinformation about the environment and ourselves. Learning frominteraction is a foundational idea underlying all theories ofintelligence and learning [56]. This form of the learning theorywith no explicit teacher is coined as reinforcement learning (RL). RLhas become popular in the past decade in the artificial intelligencecommunity [37,53,56,59] as well as in communities related toneuroscience and its allied areas [16,17,35,36,46,51].

It has been suggested that in the mammalian brain, learning byreinforcement is a function of brain nuclei known as the basalganglia (BG). Specifically, the idea that dopamine release from thesubstantia nigra compacta (SNc) and ventral tegmental area (VTA)encodes error in the predicted future reward for the animal enjoyswhich is significant experimental support too [52]. It is nowbelieved that the BG uses this reward-related information tomodulate sensory–motor pathways so as to render futurebehaviors more rewarding. Thus, such a reinforcement signalhelps to control the acquisition of learned behaviors [48].

ll rights reserved.

x: +91 44 2257 4102.

ph),

[email protected]

Other than reward based learning, the BG are thought to beinvolved in diverse functions like sequence learning [10], workingmemory [25], action selection [9,47], action gating [61], motorpreparation [12], timing [14], etc.

Though enormous progress has been made in terms ofanatomy, pathology, electrophysiology, and imaging studiesrelated to the BG, a comprehensive understanding of thecontribution of these nuclei to behavioral control still remainselusive. There is a strong need for functional models which canassimilate both the constraints imposed by neurobiological data,and at the same time be able to simulate various candidatebehavioral functions that the BG nuclei are thought to subserve[1]. From reviews of computational models of BG [44,25] itappears that most BG models are exclusive and capture specificfunctional roles. There are models that describe the role of BG inAction gating [61], in action selection between competing actions

[9,47], in sustaining working memory representations [25], insequence learning [10], and most importantly in reinforcement

learning [5,34,52]. In a recent review by Buhusi and Meck [14], ithas been highlighted that that BG has a role in timing. Animmense challenge that lies ahead of BG modelers is to forge themany exclusive albeit useful insights embodied by the modelsnoted above into a single integrated framework.

Although exploration and exploitation are equally important inan RL framework, literature concerned with the role of the BG inRL seems to focus mainly on the reward signal – its chemicalmessenger, its anatomical site, and its consequences in learning,

Page 2: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218206

etc. – but only presents a summary treatment of exploration. Evenin studies where exploration is discussed, no anatomical site forthe exploratory signal is hypothesized [34]. It is often said thatactivity of the dopaminergic cells in the SNc and/or the ventraltegmental area (VTA), and signal reward [34]. But which part ofthe BG generates the stochastic signal is necessary for theexploration?

Experimental studies of activity in the subthalamic nucleus(STN) and globus pallidus external (GPe) have revealed that underdopamine depleted circumstances (analogous to Parkinsonianconditions), activity of these nuclei exhibited, though not muchreduction in firing rate is observed, a dramatic increase incorrelations among neurons [8,11,13,58] occurred. Correlatedactivity of neurons in the STN–GPe loop has been linked toParkinsonian tremor frequencies [58]. Complex activity of theSTN–GPe loop in normal BG, and its loss under Parkinsonianconditions, has been attributed to a deep functional significance,and is interpreted as a source of the stochastic signal required byRL [55]. The model of Sridharan et al. [55] describes a simulatedMorris water pool experiment, wherein a virtual rat explores for ahidden platform with the help of visible landmarks. When theplatform (i.e. the set of landmarks associated with it) wasinvisible, the STN–GPe loop in the model exhibited complexactivity, reflecting exploratory behavior. When the platform (i.e.its associated landmarks) falls within view, the STN–GPe activitydramatically switches to regular activity.

The functional architecture of BG proposed in this paperis mainly based on the idea that BG is involved in error correction.Motor deficits in disorders associated with basal ganglia(Parkinson’s disease, Huntington’s disease (HD), etc.) are thoughtto arise from impaired error correction. This idea has beencogently reviewed in [32]. In a study involving HD patients, thesubjects are asked to make reaching movements to one of 8surrounding targets [54]. Although HD subjects were comparableto normals in their ability to execute initial trajectory of themovement, they had trouble correcting the error accurately asthey approached the end point. Furthermore, when brief pertur-bative forces, corrective movements of normals were far moreeffective than those of HD patients [54]. Some of the motordeficits in PD were attributed to slowed error correctionmechanisms [2]. A similar idea was echoed by Rosvold [49] whosuggested that ‘‘caudate nucleus forms part of neural mechanismsfor achieving error correction in the motor system [32]’’.

Starting from the above idea, in the present model reachingerror is linked to reward and RL is used to describe the role of BGin correcting error and shaping reaching movements. In thispaper, we present a model of BG in which nearly every nucleus ofBG is incorporated. Particularly, the model highlights the role ofthe STN–GPe loop in exploration. The model is attached to asimple arm model that is trained to reach a small number oftargets. It will be shown that complex activity of the STN–GPeloop is essential for the arm to learn to make a successful reach. Itwill also be shown that training the model under dopaminedeficient conditions progressively results in a defective reach,with the arm displaying Parkinsonian-like tremor.

The paper is outlined as follows: Section 2 describes theproposed BG model, Section 3 describes the results of simulationsof the mode. A discussion of the results is presented in thefollowing section.

Fig. 1. ACE architecture.

2. Basal ganglia model

In this section, the proposed BG model, its architecture,training, and function is discussed. Also, detailed resemblances

between the model architecture and known BG anatomy ispointed out.

2.1. Functional anatomy of the basal ganglia

The basal ganglia receives inputs from most of the sensory andmotor areas of the cerebral cortex, including primary andsecondary somatosensory areas, primary motor cortex (M1), anda variety of premotor areas, including the supplementary motorarea (SMA), and the dorsal and ventral premotor areas. BGconsists of five extensively connected subcortical nuclei: caudatenucleus, putamen, globus pallidus, STN, and substantia nigra (parscompacta; SNc, and pars reticula; SNr) (Fig. 1). The input nucleusof the BG is the striatum (caudate+putamen). Axons ofdopaminergic neurons in the SNc project onto the striatum.There are two pathways from the striatum to the globus pallidusinternal (GPi), one of the output nuclei of the BG. In the direct

pathway, neurons in the striatum directly project onto the GPi.The other pathway, namely the indirect pathway, connects thestriatum, GPe, STN, and GPi in that order. Also there existexcitatory and inhibitory recurrent connections between the STNand the GPe.

2.2. Model architecture

The proposed BG model (Fig. 1) consists of three maincomponents: the Actor, the Critic, and the Explorer. The Actorrepresents the sensory–motor cortical pathway, where the resultsof learning a motor task are consolidated. The Critic representsthe striatum or the corticostriatal network. The Explorer, which isthe new element in our BG model, represents the STN–GPe loop.The ACE model is coupled to a simple two joints four muscle armsmodel (Fig. 2(a,b)). The goal of the network is to learn to reach agiven target out of eight targets located on a circle within thearm’s workspace, when instructed to do so (Fig. 2(c)). Reachingerrors due to the Actor are corrected by perturbations arising outof the indirect pathway. Thus, the direct pathway connectionsbetween striatum and GPi are omitted in the model. Similarly,direct connections from the cortex to the STN, comprising the so-called hyperdirect pathway, are also ignored in the present model.temporal difference (TD) error, which represents dopamine signal,d, is thought to be computed within the feedback loop:striatum-SNc-striatum. The Critic is thought to beimplemented within the striatum itself, where the Value iscomputed as a function of corticostriatal inputs.

Page 3: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 2. (a) A simple muscle model based on a spring–damper system, (b) a single link configuration with an agonist and an antagonist muscle, and (c) the shaded region

represents the workspace of the arm.

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 207

2.3. Arm model

The arm model consists of two links with four muscles asshown in the Fig. 2(b). A single joint controlled by two muscles isshown in Fig. 2(b). For the realization of the reaching task, it isassumed that the muscles of the arm model are driven by theneural activation from the Actor. A simple muscle model isassumed which consists of a spring and damper system as shownin Fig. 2(a) and whose dynamics are described by Eqs. (1) and (2).The resting length, Li, and tension, Ti, of the simple muscle model(lumped) is controlled by the neural activation as,

LiðuiÞ ¼ aMuscleðV0�uiÞ ð1Þ

Ti ¼ kiðxi�LiðuiÞÞþbidxi

dtð2Þ

where, V0 and aMuscle are constants, ui is the neural input to the ithmuscle, ki is the spring constant, x is the actual length of themuscle, and bi is the damper coefficient. The effect of a given set ofneural activations to the agonist and antagonist muscles is toplace the arm in an equilibrium configuration. Note that themapping from neural activations to arm configurations is many-to-one. Activations of an agonist and antagonist pair at a givenjoint can be increased in such a way that the joint angle remainsthe same. Therefore, to avoid multiple solutions for a single jointangle, the following constraint is placed on the neural activationsto agonist and antagonist muscles, i.e., on their ‘‘resting lengths’’.

L1þL2 ¼ L3þL4 ¼ C ð3Þ

where C is the constant, L1 and L2 are the resting lengths of themuscles corresponding to shoulder joint ‘A’ while L3 and L4 arethat of elbow joint ‘B’(Fig. 2(b)). The neural activations (u1 and u3)are calculated using the outputs of the Actor, ga and gb, as u1¼ga

and u3¼gb. Combing Eq. (1) and the constraint of Eq. (3), we canexpress u2 and u4 in terms of u1 and u3 as follows:u1þu2 ¼ ð2aV0�C=aÞ ¼ u3þu4. For the given neural inputs, theresting lengths of all the muscles can be calculated using Eqs. (1)and (2). The joint angle corresponding to these neural activations

(valid for both the joint angles y1 and y2) is obtained by solving,

dydt¼�ðh2r2x2Þðk2ðx2�L2ÞÞ�k1ðx1�L1Þðh1r1x1Þ

2h21r2

1x2x1

� �b1 sinðyÞþ2h2

1r21

x1x2

� �b2 sinðyÞ

ð4Þ

where r1 and r2, are the pivot lengths (at points P1, and P2), h1 isthe height of the pivot on Link-1 (i.e., distance from point Q1 tothe joint ‘A’) (Fig. 2(b)). Then x1 and x2 are calculated using thecosine rule as,

x21 ¼ h2

1þr21þ2h1r1 cosy ð5Þ

x22 ¼ h2

2þr22þ2h2r2 cosy ð6Þ

Eq. (4) is derived by equating the moments due to tensions inthe agonist and the antagonist muscles at the joints. The positionof the end effector, which is a function of the joint angles, iscalculated as,

xe ¼ l1 cosy1þ l2 cosðy1þy2Þ ð7Þ

ye ¼ l1 siny1þ l2 sinðy1þy2Þ ð8Þ

where l1and l2 are the lengths of Links 1 and 2, respectively. Theregion of space that may be reached by the 2-links arm model iscalled as the ‘workspace’ of the arm. The workspace of the currentmodel with r1¼r2¼r3¼r4, h1¼h2, and l1¼ l2 is shown in Fig. 2(c).The arm-related parameters used in the simulations of the nextsection are: h1¼8¼h2; r1¼8¼r2; k1¼5¼k2; b1¼5¼b2; Vo¼10;C¼10.

2.4. Actor

The Actor performs actions on the environment so as toachieve a desired goal. It receives the command, (xA), andperforms an action, G¼[ga, gb].

The Actor is modeled as a perceptron neural network [27],with eight input nodes representing the eight targets to bereached, and two output nodes, which generate the muscleactivations needed for the arm model at Joint-A and Joint-B as ga

and gb, respectively (Fig. 3(a)). The input to the network is given

Page 4: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 3. (a) Architecture of the Actor network and (b) the 2D arm and the targets to be reached.

Fig. 4. Architecture of the Critic network.

Fig. 5. Architecture of the Explorer.

D. Joseph et al. / Neurocomputing 74 (2010) 205–218208

by, xA¼{1, �1, �1, �1, �1, �1, �1, �1} for target #1, forexample, which communicates to the Actor about its present goal.The position of the single +1 in the eight dimensional vectorspecifies the target to be reached. The output of the Actor iscomputed as a weighted summation of input, xA and thecorrective signal from the Explorer and is given by,

ga ¼1

1þe�ðlaðWa0

AxAþDgaÞÞ

ð9Þ

gb ¼1

1þe�ðlaðWb0

AxAþDgbÞÞ

ð10Þ

where, ga and gb are the corrective signals from the Explorernetwork, Wa

A and WbA are the weights connecting the input, xA, to

the muscle activation layer, G¼[ga, gb], la(¼0.6) is the slope of thenonlinearity. The muscle activations cause the arm model to settleat a point in the workspace, which represents the equilibriumposition of the arm dynamics.

2.5. Critic

The function of the Critic is to ‘‘assign value’’. The Critic assignseach output of the Actor a Value; this Value depends on the goalof the Actor. Accurate training of the Critic is paramount, since itprovides the gradient information necessary for guiding the armas it traverses the workspace searching for the target. A radialbasis function (RBF) network [31] is used to implement the Critic.It receives as input, a concatenated vector, consisting of the targetselection vector which is the current input to the Actor, and themuscle activation, G¼[ga, gb] output by the Actor when instructedto reach the aforementioned target. Thus, the input to the Criticnetwork is given by, xC¼[xA, ga, gb]. The network has one hiddenlayer, consisting of h (¼7) radial basis functions, each of which isa Gaussian with standard deviation, s(¼2). The output layerconsists of a single neuron, whose output is the ‘‘Value’’, Q(t),(Fig. 4). Neuroanatomically, the Critic is assumed to be computedwithin the striatum, with the striatal interneurons playing therole of the hidden neurons of the RBF network.

2.6. Explorer

The Explorer network has three stages: an input stage, whichrepresents the striatum, a hidden layer (which is actually a doublelayer) representing the STN–GPe loop, and an output stage thatrepresents the GPi (Fig. 5). The Explorer network input, xE, is givenas, xE¼[xA, ga, gb]; where xA (which is actually same as xC) is theinput to the Actor and [ga, gb] is the output of the Actor.

The processing of the input among these stages may be describedby the following:

IGPe ¼W1E xE ð11Þ

where IGPe is the input to the GPe network (described below) andW1

E represents the weights connecting the input layer to the GPe

Page 5: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 209

network. The networks of STN and GPe are implemented in a 2Dgrid fashion (Eqs. (17)–(19)). The output, g¼[Dga, Dgb], of theExplorer is a corrective signal to the Actor and is given by

nga ¼W2EaUSTN ð12Þ

ngb ¼W2EbUSTN ð13Þ

where USTN is the output of the STN layer, W2Ea and W2

Eb thatrepresent the weights connecting the STN layer to the outputstage.

2.7. Architecture of the STN–GPe network

The STN–GPe network is implemented as a system consistingof a pair of layers connected in a feedforward positive, andfeedback negative fashion (Fig. 6(a)). Terman et al. [58],performed simulations of detailed conductance-based models ofneurons in the STN–GPe loop and showed the existence of variousregimes of operation of the STN–GPe loop, viz., clustered activity,traveling waves and repetitive spiking. The regime of operationdepends on the architecture—the pattern of connections bothwithin and between these layers and also on the strength of theseconnections [58]. Low-frequency periodicity (4–30 Hz) of firingand dramatically increased correlations among neurons in theSTN–GPe system are observed in experimental preparations,under dopamine deficient conditions [13,33,40,45]. Termanet al. [58] suggest that the destruction of dopaminergic neuronsin Parkinson’s disease and the consequent change in thearchitecture of connections are a possible explanation for theseobservations. An oscillatory neural network is designed (Fig. 6(b)and Eqs. (14)–(16)) based on the neurobiological data discussedabove.

A single STN–GPe neuron pair with glutamergic (excitatory)and GABAergic (inhibitory) connections are shown as excitatoryand inhibitory connections in Fig. 6(a). On par with the level of

Fig. 6. (a) An STN–GPe neuron pair exhibiting excitatory–inhibitory connections, (b) a

the GPe network, with ‘a’, as the height of the inverted gausian, e, as the posive bias t

abstraction used in other components of our model, we present asimplified model of the STN–GPe system. Accordingly, thedynamics of the GPe neuron is given by,

tgdx

dt¼�xþUGPeþUSTNþ IGPe ð14Þ

UGPe ¼ tanhðlsxÞ ð15Þ

where ‘UGPe’, denotes the output of the GPe neuron with itsinternal state ‘x’, ‘IGPe’ is the external input to the GPe neuron,‘USTN’ is the state of STN neuron, tg is the time constant and ls

controls the slope of ‘tanh’ function. Similarly, the dynamics ofSTN neuron is given by,

tsdUSTN

dt¼�USTN�UGPe ð16Þ

where, ts is the time constant. Note that while ‘UGPe’ hasinhibitory influence on ‘USTN’, ‘USTN’ in turn excites ‘UGPe’.Oscillations are produced by the above system, but only withincertain limits of the external input IGPe. Existence of limit cycle isproved in Appendix A. Note that the models of STN and GPeneurons are not biophysical models but are abstract models withneuron output taking both positive and negative values. The onlything we would like to emphasize in the STN–GPe part of themodel is that, an excitatory–inhibitory loop system like the STN–GPe is capable of producing oscillations, which is exploited in themodel.

The pair of neurons described above is replicated andconnected in a 2D grid fashion for realizing the STN–GPe loopas shown in Fig. 6(b). The connections between these nuclei areassumed to be one-to-one with the inclusion of lateral connec-tions in the GPe layer and no lateral connections in the STN layer.Lateral connections in the GPe layer are calculated using Eq. (20)(Fig. 6(c)). Each of these layers is implemented in a 2D grid

2D grid of such STN–GPe neuron pairs, and (c) the lateral connections strengths in

o it.

Page 6: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218210

fashion and the dynamics of the network is given by,

tgdxij

dt¼�xijþ

Xn

q ¼ 1

Xn

p ¼ 1

W latij,pqUGPe

pq þUSTNij þ IGPe

ij þ INeij ð17Þ

UGPeij ¼ tanhðlxijÞ ð18Þ

ts

dUSTNij

dt¼�USTN

ij �UGPeij ð19Þ

where (i, j) and (p, q) denote the neurons position on the 2D grid,‘n’ is the size of the 2D grid, xij is the internal state of the (i,j)thneuron on the GPe grid, USTN

ij is the state of the (i, j)th neuron onthe STN grid, UGpe

ij is the output of the (i, j)th neuron in the GPegrid, and INe

ij is the input to the (i, j)th GPe neuron to account forthe effect of norepinephrine (DNe) on the STN–GPe layer (this isexplained further in Section 2.8). The lateral connections, withinthe GPe layer are assumed to be translation invariant and aregiven by (Fig. 7(c)),

W latij,pq ¼ e�a expð�r2

lat=s2latÞ for roR

¼ 0, otherwise: ð20Þ

where rlat¼[(i�p)2+(j�q)2]1/2, is the distance between the (i,j)thneuron and the (p,q)th neuron on the 2D grid, a controls the depthof the Gaussian bell function with slat as its width, and R is theneighborhood size. Thus each unit has a negative center and apositive surround; the relative sizes of the center and surroundare determined by e. Smaller e implies, more negative lateral GPeconnections. GPe neurons are known to be inhibitory (GABAergic).However, it is not known that if they have or do not have self-connections. In the absence of specific data, we assumed acontinuous neighborhood profile which implies a strong self-inhibition of GPe neurons.

In the absence of input from the input layer (i.e., IGPeij ¼ 0), as e

is varied from 0 to a, the activity of the GPe traverses throughthree different regimes: (1) uncorrelated activity, (2) travelingwaves, and (3) clustering (Fig. 7).

Similar dynamic regimes have also been observed in moredetailed conductance-based models of the STN–GPe system [58].Operation of the network in the first regime – uncorrelatedactivity – is most crucial since it is the complexity in the activityof the STN–GPe system that helps the network extensivelyexplore the output space.

Fig. 7. Regimes of operation of the STN–GPe loop. Three characteristic patterns of

(c) clustering. The three activity regimes (from left to right) are obtained by progressive

connections in the GPe. In regime (c), clustering regime, the array splits into a center and

of simulated GPe array is 30�30 (white¼ +1 and black¼�1).

2.8. The norepinephrine feedback loop

Daw et al. [15] have suggested that subcortical structures maybe implicated in the control of exploration, with norepinephrineregulating the global propensity to explore [21,60]. Discussingpossible roles of various neuromodulators (dopamine, serotonin,norepinephrine, and acetylcholine) in brain function seen from anRL perspective, Doya [21] hypothesized that: (i) dopaminerepresents the global learning signal for the prediction of rewardsand the reinforcement of actions, (ii) serotonin controls thebalance between short-term and long-term prediction of reward,(iii) norepinephrine controls the balance between wide explora-tion and focused execution and (iv) acetylcholine controls thebalance between memory storage and renewal. Specifically, insupport of the hypothesized role of norepinephrine in exploration,Doya [21] points out that noradrenergic neurons in the locusceruleus (LC) are activated in emergency situations. Further, it isknown that phasic response in the LC neurons at the time ofstimulus presentation is correlated with a high accuracy ofresponse [3]. There is also evidence to connect norepinephrinewith the level of activity in the globus pallidus [50]. Suchperceptions are much in tune with our idea of implicating theSTN–GPe loop in exploratory behavior [55].

Keeping in line with the above-described perspective on therole of norepinephrine in exploration, in the present model, weassume that norepinephrine activity controls the activity level ofGPe neurons (and, therefore, STN neurons), and that thenorepinephrine activity is in turn dependent on dopamine levels.Accordingly, we designate a quantity called DNe, which signifiesthe level of norepinephrine in the GPe and which also depends onthe level of dopamine. We also assume that DNe controls theoverall activity level (number of active neurons) in the STN.Further, we assume that the precise form of DNe dependence ondopamine, d, is given by (Fig. 8):

DNe ¼ A1 1�1

1þeðadðtÞ�bÞ

� �þA2 ð21Þ

The value of DNe decreases from a maximum value of (A1+A2)to a minimum value of A2, with increase in the value of d. Theparameters a and b control the slope and the bias of the non-linearity, respectively.

Now, a negative feedback loop to the GPe layer is designed tocontrol the activity level (number of active neurons in the GPe) ofthe GPe loop using the DNe signal. This is achieved as follows:

Da ¼1

2

XN

i,j

ðUGPeij þ1Þ ð22Þ

activity of the GPe layer are shown: (a) uncorrelated, (b) traveling waves, and

ly increasing e from 0 to a. Increasing e increases the percentage of positive lateral

a surround, with neurons is either region forming a synchronized cluster. The size

Page 7: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 8. Variation of DNe with d(A1¼45 and A2¼5, a¼3, and b¼0.33).

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 211

e¼DNe�Da ð23Þ

t dE

dt¼ tanhðlgeÞ ð24Þ

INeij ¼ E�

N

2ð25Þ

where Da is the actual number of active units, N is the number ofneurons in the GPe layer, and e denotes the discrepancy betweenthe actual number of active units, Da, and DNe levels at any giveninstant. This discrepancy is accumulated in E. The parameter,lg (¼10) controls the slope of ‘tanh’ function and t is the timeconstant of the feedback loop. Iij

Ne is the input to the GPe neuronas shown in (17).

From Eqs. (18,19), note that UGPeij is bounded (within (�1,1))

and therefore USTNij , which simply follows (�UGPe

ij ) with a delay,also ends up being bounded. Therefore, USTN

ij approximately keepsswitching between the limits of 1 and �1 (not necessarilyperiodically).

The value of DNe controls not only the average activity of theGPe layer but also its ‘‘complexity.’’ For example, let us assumetwo 10�10 grids of neurons representing the GPe and STN layers,respectively, and assume DNe¼50. It was mentioned earlier thatDNe controls the number of active neurons in GPe (neurons in the‘ON’ state) at any given time. So DNe¼50, results in any 50neurons out of the 100 neurons being in the ‘ON’ state. Thenumber of different states of the loop can assume is highest forthis value of DNe, viz. 50, because C100

50 4C100n for any other n. Note

that here Cnm denotes the number of combinations of m taken out

of n quantities. The network travels through these states in apseudo-random fashion. But then how many of these states areactually visited by the network depends on the dynamic regime inwhich the network functions. When the network is uncorrelatedmode, it can visit a larger number of states than when it is intraveling wave mode or in clustering mode. Accordingly, C100

50 isthe highest number of states that the network can access whenDNe¼50, though in actual practice it may never visit many ofthose states. Thus, with DNe less than 50, the number of possiblestates the network can visit is less than that possible withDNe¼50. Simulation results, showing how DNe influences thecomplexity of the STN–GPe dynamics, are shown in Fig. 9. We usea parameter called effective dimension (E-dim) to quantify the

randomness of the activity in the STN–GPe layer (see Appendix Bfor details).

2.9. Computation of reward

Reward is the result of the interaction of the Actor with theenvironment and plays a crucial role in Actor training.Accordingly, a reward of +1 is administered, if the arm settlesto within a threshold distance, dthresh, from the target point. Also,a negative reward of –0.3 is given, if the arm activations take onextreme values (0.06oga, gbo0.94). To speed up the learningprocess, in the initial few epochs of training, dthresh, is maintainedsufficiently high; its value is reduced as a function of the numberof epochs as follows: dthresh ¼ 2 expð�nepochs=50Þþ0:1.

2.10. Temporal difference error (d)

Calculation of temporal difference (TD) error, d(t), is done asfollows:

dðtÞ ¼ rðtÞþgQ ðtÞ�Q ðt�1Þ ð26Þ

where r(t) is the reward obtained for the action of the Actor attime t, Q(t) is the Value assigned to the State–Action pair at time t,and Q(t�1) is the Value corresponding to the State–Action pair attime t�1. The discount factor,g, is assumed to be 1.0. This methodof calculating d(t) is similar to SARSA(l) [1]. Throughout the paper,d(t) is interpreted as the firing activity of SNc and thecorresponding changes in dopamine concentration.

2.11. Model description

The input to the Actor is the target selection vector (e.g.,xA¼{1, �1, �1, �1, �1, �1, �1, �1}; for target�#1). The Actoris expected to output muscle activations (ga and gb for joints a andb, respectively, (Fig. 3(a)) that place the arm’s end effector at thetarget location. Since striatum receives both sensory and motorrepresentations, the target selection pattern, xA, and the estimatedmuscle activations, (ga, gb), are presented as input to the Critic.The Critic uses this information and estimates the State–ActionValue function [56]. Input xC (¼[xA, ga, gb]) received by the Criticis also copied to the Explorer (i.e. the input to Explorer is, xE¼xC).The Explorer generates corrective output (Dga, Dgb), whichrepresents ‘‘Exploration’’ of the state space of possibilities in anattempt to determine the right muscle activations. The correctiveoutput (Dga, Dgb) is sent to modify the estimated output (ga, gb) ofthe Actor. Training of the system proceeds as follows. Given acommand to reach a specific target, the arm makes exploratoryreaching movements. When the end effector strays too close tothe target by chance, the system is bestowed with reward, whichis used for training the ACE components. With this generaldescription of the ACE system, we now present a more detailedaccount of the information flow in the model.

The Critic and the Explorer help the Actor to reach the targeton command. The Actor requires their help during the trainingperiod, in order to reach the targets. However, once trained, theActor, on being instructed to do so, can reach the target in anentirely feed-forward manner. However, the continued presenceof the Critic and the Explorer are required by the Actor to learnnew goals for which it has not been trained. The Actor, Critic, andExplorer thus form a closely knit team geared to efficientlyacquire the motor skill at hand.

Page 8: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 9. Snapshots of STN activity for various values of DNe. (a) DNe¼50; observed E-dim �96, (b) DNe¼20; observed E-dim �48, and (c) DNe¼5; observed E-dim �15. There

is a consistent decrease in E-dim with decreasing DNe.

Fig. 10. Output of the Critic for different targets, with the x–y plane representing [ga, gb] values and the z-axis representing the value, Q.

D. Joseph et al. / Neurocomputing 74 (2010) 205–218212

2.12. Training

We split the process of ACE training into different stages. Allthe ACE components (Actor, Critic, and Explorer) are not trainedsimultaneously but are trained in an order. During the first stage,only the Critic is trained. In the next stage, the Critic is no longertrained, instead the Actor and the Explorer are trained. Descrip-tions of the training methodology followed for individualelements of the ACE model are given below.

2.13. Critic training

The RBF network is trained off-line, with two sets of input–output examples. The first set as input, a concatenated vector,consisting of the target selection vector and the approximatemuscle activations to reach the target (determined by a trial anderror method); this input is paired with a desired output of 1,which represents the case of a ‘high’ value (i.e., Q¼1). In thesecond set, for the same target, muscle activations at a distance ofone standard deviation, stol (¼2), from the activations of the firstset are chosen. These inputs are paired with a desired output of0.1, which corresponds to a ‘low’ value (i.e., Q¼0.1). After training,we observe that the network generalizes well, producing localpeaks in the state space (Fig. 10). The network is trained offlineusing the neural network toolbox available in MATLAB.

The Value function modeled by the Critic in our modelrepresents the effectiveness of the reach, i.e., the nearness of theend effector of the arm to the target position. We train the Criticoff-line and use its predictions to train the Actor and the Explorer.Existence of such a ‘‘pretrained’’ Critic may be justified as follows.It is known that the posterior parietal cortex (PPC) containsneurons that respond when a successful reach coincides withvisual appearance of the hand with the grasped object [22].Based on studies involving transcranial magnetic stimulation,

Desmurget et al. [19] suggest that the PPC generates an internalrepresentation of hand position and provides a dynamic reachingerror, which is used by motor cortical areas for real-timecorrection of movement. Imaging studies on the role of PPC inreaching also suggest that the visual and motor components ofreaching may have different functional organization [29]. Func-tional magnetic resonance imaging studies of sequential motorlearning by Bapi et al. [4] reveal that there is an early stage ofvisuo-spatial-based learning subserved by parietal areas and alate stage of motor-based learning subserved by motor corticalareas. Finally, projections from parietal association areas tostriatum are common knowledge of basal ganglia connectivity[62]. Thus, our Critic model is meant to represent the computa-tions that probably occur in the striatum by processing the visuo-spatial information that comes from the parietal visuo-spatialmachinery supporting reaching movements.

2.14. Actor training

The weights of the Actor are updated as per the following:

WaA ¼Wa

AþZadðtÞðDgaðtÞxAÞ ð27Þ

WbA ¼Wb

AþZadðtÞðDgbðtÞxAÞ ð28Þ

where Za is the learning rate for the weights of the Actor network.

2.15. Explorer training

Only the weights W1E , and W2

Ea and W2Eb are adjusted during the

training of the Explorer network. The internal weights of theSTN–GPe loop are assumed to be constant. The connections fromthe STN to the output layer, W2

Ea and W2Eb are trained using d. The

first layer weights, W1E , are trained by equations similar to that

used in associative reward prediction (ARP) approach [6].

Page 9: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 213

Weight adaptation equations are given as,For the first layer,

If d40 W1E,ijk ¼W1

E,ijkþZ1xE,kðUGPeij �tanhðUGPe

ij ÞÞ

else W1E,ijk ¼W1

E,ijkþZ2xE,kð�UGPeij �tanhðUGPe

ij ÞÞð29Þ

where Z1, Z2(Z25Z1) denote learning rates, xE,k is the kthcomponent of the input to the Explorer and UGPe

ij is the outputof the (i,j)th neuron of the GPe layer, W1

Eijk is the weightconnecting the kth input component with the (i,j)th neuron inthe GPe layer. Numerical values of Z1 and Z2 used in simulationsare 2�10�3 and 2�10�6, respectively. The equation useddepends on the sign of d.

For the second layer,

W2Ea,ij ¼W2

Ea,ijþZ3dðtÞUSTNij ðDgaðtÞ�Dgaðt�1ÞÞ ð30Þ

W2Eb,ij ¼W2

Eb,ijþZ3dðtÞUSTNij ðDgbðtÞ�Dgbðt�1ÞÞ ð31Þ

where Z3 is the learning rate, Dga and Dgb are the output of theExplorer, W2

Ea,ij is the weight connecting (i, j)th neuron in STN withthe output ‘a’ and is the weight connecting (i, j)th neuron in STNwith the output ‘b’. Numerical value of Z3 is used in simulations is3�10�3.

Note that the STN–GPe system is treated mostly as a singleunit in this work. So the weights that characterize the indirectpathway in the model are the ones from striatum-GPe andSTN-GPi. Therefore, these two weight stages are trained.

3. Simulations

In this section, the results of simulations, in which the BGmodel drives a simple arm model to reach targets, under ‘‘normal’’and ‘‘dopamine-deficient’’ conditions is presented. All differentialequations are simulated using the simple forward Euler integra-tion method.

3.1. Effect of order of training

The performance is significantly improved if the Actor andExplorer are trained in tandem in the next stage, as opposed totraining only the Explorer in the early stages and subsequentlytraining the Actor and Explorer together (Fig. 11).

Fig. 11. Comparison between (1) training the Explorer alone for a few epochs and

then combined training (upper curve) and (2) always training the Explorer and

Actor in tandem (lower curve). y-axis: no. of steps to reach the target.

3.2. Effect of Explorer training

We now contrast the effect of a trained Explorer on reachingperformance, with that of an untrained Explorer. Consider thesituation wherein the arm is instructed to reach a target under thefollowing conditions: the Explorer network has random weights,the Critic is pretrained and the Actor is allowed to learn. In acontrasting simulation, the Explorer is also trained. Fig. 12(a) and(b) shows that training the Explorer does indeed improve theperformance of the overall system. Performance is evaluatedbased on two metrics: (1) the average number of steps required toreach the target, and (2) average distance traveled, before thetarget is reached. The role of the trained Explorer seems to be to

funnel the exploration into a region of the state space that is likely to

lead to higher reward.

3.3. Comparing Explorer dynamics during and post-learning

In the next experiment, we compare the Explorer dynamicsduring and post-learning in terms of neural activity in the GPelayer and in terms of reaching performance. We observe thatduring the initial stages of training, as the arm explores the statespace, the neural activity of the STN–GPe layer is complex,assisting this exploration (Fig. 13(a)). As training progresses, thearm makes initial guesses closer and closer to the target and theneural activity becomes less complex (Fig. 13(b)). After training,the arm moves to the target in a feed-forward manner and there isno need for exploration. Accordingly, the neural activity in STN–GPe is regular and small clusters are observed (Fig. 13(c)). Timeevolution of STN–GPe activity is shown before training (Fig. 14)and after training (Fig. 15).

However, even after learning, if the radius of tolerance, dthresh,was to suddenly change, or the target point suddenly shifted by asmall amount, resulting in the Actor no longer receiving rewardwhere it previously used to, the output of the STN and GPe layerwhich was previously exhibiting clusters, will now becomeuncorrelated so that the Actor can further explore the workspaceand find the target. The output of the STN and GPe layer thuschanges dynamically to respond to changes in the conditions ofthe environment.

Retraining the Actor with the changed target position isexpected to progress faster when the Explorer is also trained,that when the Explorer is also untrained.

3.4. Simulation results for dopamine deficient conditions

We assume that reduced dopamine affects the activity of theGPe layer neurons via the quantity, DNe. Therefore, to simulatedopamine deficient conditions, we shift and scale down thefunction that maps d onto DNe, as shown in Fig. 16. Eq. (21)describes the relationship between DNe and d. The parameterscorresponding to normal conditions are: A1¼45, A2¼5, a¼3, andb¼0.33; and the values corresponding to Parkinsonian conditionsare: A1¼15, A2¼5, a¼6 and b¼�0.007.

Remember that changes in DNe affect STN–GPe dynamicsaccording to Eqs. (22)–(25). Next, we trained the network undersuch altered conditions of STN–GPe dynamics. The changesobserved were divided into two categories: (1) primary changes,changes seen soon after dopamine reduction, and (2) secondarychanges, changes observed after the network was trained underdopamine deficient conditions.

3.4.1. Primary changes

Previously, when the Actor was untrained, the output of theSTN–GPe layer would be complex so as to assist the Actor in

Page 10: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 12. (a) Comparison between the trained (dashed line) and the untrained Explorer (solid line) in terms of average number of trials (steps) required to reach the target

and (b) comparison between the trained (dashed line), and untrained Explorer (solid line) in terms of average distance traveled before the target is reached.

Fig. 13. (a) The dynamics of the model before learning, E-dim¼93, (b) the dynamics of the model after learning for eight epochs, E-dim¼57, and (c) the dynamics of the

model at the end of learning, E-dim¼5.

Fig. 14. Time evolution of STN–GPe activity before training.

Fig. 15. Time evolution of STN–GPe activity after training.

D. Joseph et al. / Neurocomputing 74 (2010) 205–218214

Page 11: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 215

finding the target. However, now we observe that the number ofneurons in the ‘ON’ state in the STN–GPe layer is much lower,hence the complexity of the oscillatory activity is lower.

Fig. 16. The relation between norepinephrine (DNe) in the STN–GPe layer and d.

The function is shifted and scaled down to simulate dopamine deficient conditions

(for solid curve; A1¼45 and A2¼5, a¼3, and b¼0.33, for dashed curve, A1¼15 and

A2¼5, a¼6, and b¼�0.07).

Fig. 17. Primary changes due to dopamine reduction. Exploration is confined to a

narrow region (top-left). Complexity and activity levels of the STN–GPe layer are

reduced, E-dim¼32 (bottom-left).

Fig. 18. Time evolution of STN–GPe activity durin

Consequently, the arm takes small steps as it explores andexploration is confined to a narrow region (Fig. 17). Timeevolution of STN–GPe activity during this stage is shown inFig. 18.

3.4.2. Secondary changes

As the network continues to be trained under low dopamineconditions, the outputs of the STN and GPe layers begins to losewhatever complexity it had and settles down to clustered activity.This inevitably means that the complexity of the oscillatoryactivity is low. The Explorer tries to make up for paucity in theSTN–GPe activity by increasing its output weights. This meansthat the output of the Explorer is of high amplitude but of lowcomplexity, causing the Explorer to send large values of Dg to thearm, resulting in large, regular, ‘‘tremor-like’’ oscillations of thearm as shown in Fig. 19. Naturally, the arm fails to learn to reachthe target. Time evolution of STN–GPe activity during this stage isshown in Fig. 20.

4. Discussion

In an earlier work, we had hypothesized that the STN–GPesystem in the BG was responsible for exploratory behavior [55]. Inthis paper, we describe a model of BG in which the role of BG inexploration is highlighted. RL-based models of basal gangliafunction do not appear to give sufficient attention to theexploratory aspect of the RL framework. Though the contributionof the BG in activities that involve exploration, like foraging, havebeen described, a precise anatomical substrate in the BG for such

g the stage of primary changes related to PD.

Fig. 19. Secondary Changes due to dopamine reduction. Exploration is of large

amplitude but of poor quality due to large oscillations of the arm (top-right).

Dramatic loss of complexity in the activity of the STN–GPe layer, E-dim¼5

(bottom-right).

Page 12: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

Fig. 20. Time evolution of STN–GPe activity during the stage of secondary changes related to PD.

Fig. 21. Actor–Critic–Explorer architecture (adapted from, Mustapha et al. [44]).

D. Joseph et al. / Neurocomputing 74 (2010) 205–218216

exploration has not been given sufficient attention [34]. In an RLperspective of BG function, we believe that the STN–GPe part ofthe BG is the Explorer. We thus have an Actor–Critic–Explorermodel of BG.

The ACE model is coupled to an arm model and is trained on asimple reaching task, which consists of reaching a small number oftargets on instruction. The Critic supplies Value information; theExplorer explores around the initial guess provided by the Actor andthe Actor consolidates the results of successful exploration.

In the present work, special emphasis is placed on thecharacterization of the STN–GPe subsystem of the BG. Complexityof this system, it is suggested, provides the stochastic signalnecessary for exploration in RL. Loss of complexity in the activityof this system has been reported under PD conditions.

An interesting extension is put forth in this paper to Doya’s[21] suggestion that norepinephrine levels in the GP control the‘‘temperature’’ parameter in RL exploration. We suggest thatnorepinephrine levels (DNe) in the GPe controls the activity levelsof the STN–GPe layer and is in turn controlled by the dopaminelevel, d. This assumption gives rise to a mechanism by whichdopamine fluctuations can indirectly control the complexity ofthe STN–GPe layer and hence influence exploration. It also opensways to understand how the complexity of the STN–GPe activityis reduced under Parkinsonian conditions.

Considerable amount of modeling effort has been directed tochanges in the STN–GPe dynamics, particularly in connection withPD pathology. Terman et al. [58] simulate a variety of STN–GPetopologies and successfully produce several types of firing patternsfrom complex spiking to traveling waves to synchronized bursts;their model does not describe increased rhythmic firing in STNneurons under dopamine-deficient conditions. Humphries et al. [28]present an elaborate spiking neuron model of basal ganglia in whichincorporate the effect of dopamine on STN–GPe dynamics. In theirmodel, the STN–GPe network exhibits low-frequency (o1 Hz)pacemaking activity under dopamine-deficient conditions. However,the above two models simulate synchronized activity at networklevel in STN–GPe but do not link it with behavior. A distinctivefeature of the present model is that it proposes a functional/behavioral significance of complex activity in STN–GPe andconsequences of the loss of such activity in behavior; further thewhole theory is placed within the framework of reinforcementlearning. Nevertheless, it must be remembered that the interrela-tions between dopamine deficiency, synchronization, and PDsymptoms like tremor, bradykinesia are extremely intricate andnot completely understood [24]. A model that can capture allaspects of pathological synchronization in BG and its behavioralconsequences is still awaited.

Changes in PD conditions aside, our model predicts learning-dependent changes in the complexity of the STN–GPe activity,under normal conditions. Studies in Section 3.3 suggest that bothcomplexity (effective dimension) and average activity are higherin the GPe layer during acquisition of a reaching skill, as comparedto activity in the same system, post-learning. This is an interestingprediction, which can be examined experimentally.

Interesting scenarios emerge from simulations of our modelunder dopamine-deficient conditions. The primary changes upon

dopamine reduction include, a reduction in the activity level, andin the complexity of the STN–GPe system, resulting in inefficient,low-amplitude exploration. The secondary changes, which devel-op after extended training under reduced dopamine conditions,include increased activity levels in the GPe layer but with reducedcomplexity, thereby resulting in large-amplitude, tremor-likeoscillations of the arm. Interestingly, hyperactivity in the GPiand STN neurons, secondary to dopamine deficiency, is a well-observed phenomenon in the MPTP model of PD in monkeys [18].Morris et al. [38] describe pallidal firing patterns in a Go/No-gotask in normal and MPTP monkeys. Among many things, theyreport the firing patterns of pallidal neurons after the triggercompared to the cue. In the present model, since there is noseparate trigger and cue, it is not possible to account for suchobservations. However, Morris et al. [38] note that pallidalactivity is uncorrelated in a normal behaving monkey buttransforms into synchronized firing in MPTP model. This generalpattern is also seen in our model as described in the Section 3.

The general picture of motor learning that was presented inthe model is one in which basal ganglia ‘‘figures out’’ the correctresponse by combining the reward signal with exploration andpasses it on to the motor cortex for learning. Thus in the model,training first occurs in basal ganglia, and then in the motor cortex(Actor). Thus, our model is in line with the general viewpoint thatlearning in the frontal cortex can be led by learning in the basalganglia [26]. The different time-courses of learning in basalganglia and prefrontal areas (first basal ganglia and thenprefrontal) observed in saccade-related behavior in monkeysprovide experimental support to such hypothesis [43]. However,we are not aware of any experimental work that supportsanalogous results in case of a motor task like reaching.

In a Machine Learning context, a typical RL scheme will involveinteractions among the Actor, Critic and the Explorer as depictedin Fig. 21 [39]. This does differ from the architecture of the ACEmodel (Fig. 2). An important difference is that in Fig. 17, the Criticoutputs a value, Q(t), which serves as input to the Explorer andthe Explorer generates a correction to the actions produced by the

Page 13: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218 217

Actor. In our architecture, the Critic and the Explorer operate intandem, with the Critic estimating the Value, and the Explorergenerating exploratory corrections to the actions generated by theActor. Another important difference between standard MachineLearning RL, and our model is the trainability of the Explorer. TheExplorer is not trained in typical Machine Learning RL literature(barring a few exceptions like metalearning techniques to trainthe ‘‘temperature’’ of exploration [3,20,23,28]). In our model,Explorer training is found to improve performance, probablybecause it confines exploration to a region that most probably hasa solution. Perhaps the simple reaching task studied in this paperis not sufficient to bring out the advantage involved in a trainedExplorer. It will be interesting to see if a trainable Explorer doesindeed reduce training time and improve performance in standardRL benchmark problems like the Cart–Pole system [56] or theAcrobot problem [56].

The growing significance of the STN–GPe to BG function becomesevident from the evolution of deep brain stimulation (DBS) protocolsused for PD treatment [48]. Early DBS techniques to controlmovement disorders targeted the ventral intermediate nucleus(Vim) of the thalamus [41]. In the 1990s, it was found that DBS ofVim is effective even in the treatment of Parkinsonian tremor [7].Subsequently, the focus shifted from Vim to the GPi, an outputnucleus of BG, and later to the STN. The STN is currently thepreferred target for DBS in PD according to experts and hasadvantages over both the GPi and Vim [30]. In addition to the shiftin the anatomical site for DBS, one may note a certain shift in theunderstanding of the functional goal of DBS. In the early stages, DBSwas mainly considered as an operational substitute to lesioning.Instead of eliminating neural tissue, one simply blocks the signalsarising thereof by inhibiting it and producing hyperpolarization. Butthis simple description in terms of depolarizing action of the DBS ontarget nuclei came to be questioned [42]. With the evolution of DBSprotocols, the language shifted from simple ‘‘hyperpolarization’’ and‘‘inhibition’’ to more sophisticated forms of signal shaping. Accord-ingly, more recent literature on DBS talks about the desynchroniza-tion of activities of neural ensembles and not mere inhibition [57].On the whole, there is a growing realization that the action of DBSon target nuclei involves much more complex mechanisms thanwhat was thought of earlier. Therefore, an optimization of DBSprotocols via a traditional, empirical and trial-and-error style ofdevelopment might take too long. Pursuing a firm functionalunderstanding of signaling in the BG and the development ofaccurate computational models seems to be indispensable to designDBS protocols for PD with minimal side effects.

Appendix A

Existence of limit cycleThe proof for the system governed by the Eqs. (i)–(iii) has a

‘‘limit cycle’’.

dx

dt¼�xþv�sþ I ðiÞ

v¼ tahnðlxÞ ðiiÞ

ds

dt¼�sþv ðiiiÞ

We can use the Lienard’s theorem for existence of limit cycle.Follow the steps given below to convert Eqs. (i), (ii) and (iii) toLienard’s equation.

Differentiating Eq. (i)

€x ¼� _xþl sec h2ðlxÞ _x�_s ðivÞ

Substituting Eqs. (ii) and (iii) in (iv)

€x ¼� _xþl sec h2ðlxÞ _x�ð�sþtanhðlxÞÞ ðvÞ

Using Eqs. (i) and (v)

€x ¼� _xþl sec h2ðlxÞ _x�ð�ð� _x�xþtanhðlxÞþ IÞþtanhðlxÞÞ,

On rearranging

€xþ _xð2�l sec h2ðlxÞÞþðx�IÞ ¼ 0 ðviÞ

is similar to Lienard’s equation €xþ _xf ðxÞþgðxÞ ¼ 0 where f ðxÞ ¼

2�l sec h2ðlxÞ, and g(x)¼x� I.Checking for the Lienard’s conditions: Let us assume I¼0.

Both f(x) and g(x) are continuously differentiable for all x;g(�x)¼�g(x) for all x(i.e., g(x) is an odd function);g(x)40 for x40;f(�x)¼ f(x) for all x(i.e., f(x) is an even function);

The odd function FðxÞ ¼R x

0 f ðuÞdu¼ 2x�tanhðlxÞ has exactly onepositive zero at x¼xo, is negative for 0oxoxo, is positive andnon-decreasing for x4xo, and F(x)-N as x-N.(one canestimate xo from graph of F(x)).

So the system has a unique stable limit cycle surrounding theorigin in the phase plane

Appendix B

Effective dimension as the measure of complexityEffective dimension is a measure of effective number of the

degrees of freedom of the activity, v(t), of a system. Let lk andlmax are the kth and the highest eigenvalues of the autocorrela-tion matrix of the activity, v(t), over a duration of interest, suchthat lk/lmax¼1/2. Then ‘k’ is the effective dimension.

References

[1] G.E. Alexander, Basal ganglia, in: M.A. Arbib (Ed.), The Handbook of BrainTheory and Neural Networks, MIT Press, Cambridge, 1998, pp. 139–144.

[2] R.W. Angel, et al., L-dopa and error correction time in Parkinson’s disease,Neurology 21 (1971) 1255–1260.

[3] G. Aston-Jones, J. Rajkowski, P. Kubiak, T. Alexinsky, Locus coeruleus neuronsin monkey are selectively activated by attended cues in a vigilance task,Journal of Neuroscience 14 (1994) 4467–4480.

[4] R.S. Bapi, K.P. Miyapuram, F.X. Graydon, K. Doya, fMRI investigation of corticaland subcortical networks in the learning of abstract and effector-specificrepresentations of motor sequences, Neuroimage 32 (2006) 714–727.

[5] A.G. Barto, Reinforcement Learning, in: M.A. Arbib (Ed.), The Handbook ofBrain Theory and Neural Networks, MIT Press, Cambridge, 1994, pp. 804–809.

[6] A.G. Barto, P. Anandan, Pattern recognizing stochastic learning automata, IEEETransactions on Systems, Man and Cybernetics 15 (1985) 360–374.

[7] A.L. Benabid, P. Pollak, C. Gervason, D. Hoffmann, D.M. Gao, M. Hommel,Long-term suppression of tremor by chronic stimulation of the ventralintermediate thalamic nucleus, Lancet 337 (1991) 403–406.

[8] H. Bergman, T. Whichman, B. Karmon, M.R. DeLong, The primate subthalamicnucleus. II. Neural activity in MPTP model of Parkinsonism, Journal ofNeurophysiology 72 (1994) 507–520.

[9] G.S. Berns, T.J. Sejnowski, How the basal ganglia make decisions, in: A.Damasio, H. Damasio, Y. Christen (Eds.), The Neurobiology of DecisionMaking, Springer-Verlag, Berlin, 1995, pp. 101–114.

[10] G.S. Berns, T.J. Sejnowski, A Computational Model of How the Basal GangliaProduce Sequences, Journal of Cognitive Neuroscience 10 (1998) 108–121.

[11] M.D. Bevan, P.J. Magill, D. Terman, J.P. Bolam, C.J. Wilson, Move to therhythm: oscillations in the subthalamic nucleus-external globus pallidusnetwork, Trends in Neurosciences 25 (10) (2002) 525–531.

[12] B. Bilney, M.E. Morris, S. Denisenko, Physiotherapy for people with move-ment disorders arising from basal ganglia dysfunction, New Zealand Journalof Physiotherapy 31 (2003) 94–100.

[13] P. Brown, A. Olivero, P. Mazzone, A. Insola, P. Tonali, V.D. Lazzaro, Dopaminedependency of oscillations in between subthalamic nucleus and pallidum inParkinson’s disease, Journal of Neuroscience 21 (2001) 1033–1038.

[14] C.V. Buhusi, W.H. Meck, What makes us tick? Functional and neuralmechanisms of interval timing, Nature Reviews Neuroscience 6 (2005) 755–756.

[15] N.D. Daw, J.P. O’Doherty, P. Dayan, B. Seymour, R.J. Dolan, Cortical substratesfor exploratory decisions in humans, Nature 441 (2006) 876–879.

Page 14: ACE (Actor–Critic–Explorer) paradigm for reinforcement learning in basal ganglia: Highlighting the role of subthalamic and pallidal nuclei

D. Joseph et al. / Neurocomputing 74 (2010) 205–218218

[16] P. Dayan, C.J.C.H. Watkins, Reinforcement learning, in: Encyclopedia ofCognitive Science, MacMillan Press, London, 2001.

[17] P. Dayan, B.W. Balleine, Reward, Motivation and Reinforcement Learning,Neuron 36 (2002) 285–298.

[18] M.R. DeLong, Primate models of movement disorders of basal ganglia origin,Trends in Neurosciences 13 (1990) 281–285.

[19] M. Desmurget, C.M. Epstein, R.S. Turner, C. Prablanc, G.E. Alexander, S.T.Grafton, Role of the posterior parietal cortex in updating reaching move-ments to a visual target, Nature Neuroscience 2 (1999) 563–567.

[20] K. Doya, Reinforcement learning in continuous time and space, NeuralComputation 12 (2000) 215–245.

[21] K. Doya, Metalearning and neuromodulation, Neural Networks 15 (2002)495–506.

[22] E.P. Gardner, E.R. Kandel, Touch, in: E.R. Kandel, J.H. Schwartz, T.M. Jessell(Eds.), Principles of Neural Science, Fourth Edition, McGraw-Hill, New York,2000, pp. 451–472.

[23] V. Gullapalli, A stochastic reinforcement-learning algorithm for learning real-valued functions, Neural Networks 3 (1990) 671–692.

[24] C. Hammond, H. Hagai Bergman, P. Brown, Pathological synchronization inParkinson’s disease: networks, models and treatments, Trends in Neuros-ciences 30 (7) (2007) 357–364.

[25] J.C. Houk, J.L. Davis, D.G. Beiser, Models of Information Processing in the BasalGanglia, MIT Press, Cambridge, MA, 1995.

[26] J. C Houk, S.P. Wise, Distributed modular architectures linking basal ganglia,cerebellum, and cerebral cortex: their role in planning and controlling action,Cerebral Cortex 5 (1995) 95–110.

[27] M.D. Humphries, R.D. Stewart, K.N. Gurney, A physiologically plausible modelof action selection and oscillatory activity in the basal ganglia, The Journal ofNeuroscience 26 (50) (2006) 12921–12942 12921, December 13.

[28] S. Ishii, W. Yoshida, J. Yoshimoto, Control of exploitation–explorationmetaparameter in reinforcement learning, Neural Networks 15 (2002).

[29] C. Kertzman, U. Schwarz, T. Zeffiro, M. Hallett, The role of posterior parietalcortex in visually guided reaching movements in humans, ExperimentalBrain Research 114 (1997) 170–183.

[30] P. Krack, A. Benazzouz, P. Pollak, P. Limousin, B. Piallat, D. Hoffmann, J. Xie,A.L. Benabid, Treatment of tremor in Parkinson’s disease by subthalamicnucleus stimulation, Movement Disorders 13 (1998) 907–914.

[31] S. Kumar, Neural Networks: A Classroom Approach, Tata McGraw-Hill, India,2004.

[32] A.D. Lawrence, Error correction and the basal ganglia: similar computationsfor action, cognition and emotion? Trends in Cognitive Sciences 4 (10) (2000)365–367.

[33] M. Magnin, A. Morel, D. Jeanmond, Single unit analysis of the pallidum,thalamus and subthalamic nucleus in parkinsonian patients, Neuroscience 96(2000) 549–564.

[34] P.R. Montague, P. Dayan, C. Person, T.J. Sejnowski, Bee foraging in uncertainenvironments using predictive Hebbian learning, Nature 376 (1995) 725–728.

[35] P.R. Montague, P. Dayan, T.J. Sejnowski, A framework for mesencephalicdopamine systems based on predictive Hebbian learning, Journal ofNeuroscience 16 (1996) 1936–1947.

[36] P.R. Montague, S.E. Hyman, J.D. Cohen, Computational roles for dopamine inbehavioural control, Nature 431 (2004) 760–767.

[37] J. Moody, M. Saffell, Learning to Trade via Direct Reinforcement, IEEETransactions on Neural Networks 12 (2001) 875–889.

[38] G. Morris, Y. Hershkovitz, A. Raz, A. Nevet, H. Bergman, Physiological studiesof information processing in the normal and Parkinsonian basal ganglia:pallidal activity in GO/NO–GO task and following MPTP treatment, Progressin Brain Research 147 (2005) 285–293.

[39] S. Mustapha, G. Lachiver, A modified Actor–Critic reinforcement learningalgorithm, in: Proceedings of Canadian Conference on Electrical andComputer Engineering, Halifax, NS, 2000.

[40] A. Nini, A. Feingold, H. Slovin, H. Bergman, Neurons in the globus pallidus donot show correlated activity in the normal monkey, but phase lockedoscillations appear in the MPTP model of Parkinsonism,, Journal ofNeurophysiology 74 (1995) 1800–1905.

[41] R. Pahwa, S. Wilkinson, D. Smith, K. Lyons, E. Miyawaki, W.C. Koller, High-frequency stimulation of the globus pallidus for the treatment of Parkinson’sdisease, Neurology 49 (1997) 249–253.

[42] D. Panikar, A. Kishore, Deep brain stimulation for Parkinson’s disease,Neurology India 51 (2003) 167–175.

[43] A. Pasupathy, E.K. Miller, Different time courses of learning-related activity inthe prefrontal cortex and striatum, Nature 433 (2005) 873–876.

[44] T.J. Prescott, K. Gurney, P. Redgrave, Basal ganglia, in: M.A. Arbib (Ed.), TheHandbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA, 2002.

[45] A. Raz, E. Vaadia, H. Bergman, Firing patterns of spontaneous discharge ofpallidal neurons in the model of Parkinsonism, Journal of Neuroscience 20(2000) 8559–8571.

[46] P. Redgrave, T.J. Prescott, K. Gurney, Is the short-latency dopamine response tooshort to signal reward error? Trends in Neurosciences 22 (1999) 146–151.

[47] P. Redgrave, T.J. Prescott, K. Gurney, The basal ganglia: a vertebrate solutionto the selection problem? Neuroscience 89 (1999) 1009–1023.

[48] J.N. Reynolds, B.I. Hyland, J.R. Wickens, A cellular mechanism of reward-related learning, Nature 413 (2001) 67–70.

[49] H.E. Rosvold, The prefrontal cortex and caudate nucleus: a system foreffecting correction in response mechanisms, in: C. Rupp (Ed.), Mind as aTissue, Harper & Row, 1968, pp. 21–38.

[50] V.A. Russell, R. Allin, M.C. Lamm, J.J. Taljaard, Regional distribution ofmonoamines and dopamine D1- and D2-receptors in the striatum of the rat,Neurochemical Research 17 (1992) 387–395.

[51] W. Schultz, P. Dayan, P.R. Montague, Neural substrate of prediction andreward, Science 275 (1997) 1593–1599.

[52] W. Schultz, Predictive reward signal of dopamine neurons, Journal ofNeurophysiology 80 (1998) 1–27.

[53] S. Singh, D. Litman, M. Kearns, M. Walker, Optimizing dialogue managementwith reinforcement learning: experiments with the NJFun system, Journal ofArtificial Intelligence Research (JAIR) 16 (2002) 105–133.

[54] M.A. Smith, et al., Motor disorder in Huntington’s disease begins as adysfunction in error feedback control, Nature 403 (2000) 544–549.

[55] D. Sridharan, P.S. Prashanth, V.S. Chakravarthy, The role of the basal gangliain exploration in a neural model based on reinforcement learning, Interna-tional Journal of Neural Systems 16 (2006) 111–124.

[56] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press,Cambridge, Massachusetts, 1998.

[57] P.A. Tass, A model of desynchronizing deep brain stimulation with a demand-controlled coordinated reset of neural subpopulations, Biological Cybernetics89 (2003) 81–88.

[58] D. Terman, J.E. Rubin, A.C. Yew, C.J. Wilson, Activity patterns in a model forthe subthalamopallidal network of the basal ganglia, Journal of Neuroscience22 (2002) 2963–2976.

[59] G. Tesauro, Temporal difference learning and TD-Gammon, Communicationsof the ACM 38 (1995).

[60] M. Usher, J.D. Cohen, D. Servan-Schreiber, J. Rajkowski, G. Aston-Jones, Therole of locus coeruleus in the regulation of cognitive performance, Science283 (1999) 549–554.

[61] C.J. Vidal, G.E. Stelmach, A neural model of basal ganglia–thalamocorticalrelations in normal and Parkinsonian movement, Biological Cybernetics 73(1995) 467–476.

[62] E.H. Yeterian, D.N. Pandya, Striatal connections of the parietal association corticesin rhesus monkeys, Journal of Computational Neurology 332 (1993) 175–197.

Denny Joseph is currently pursuing his Masters degreeat the Indian Institute of Technology, Madras. Hereceived his Bachelors degree in Electrical and Electro-nics Engineering in 2005. His research interests includecomputational neuroscience, machine learning andreinforcement learning.

Gangadhar Garipelli is currently pursuing a doctoraldegree at Institute Dalle Molle d’Intelligence Artifi-cielle Perceptive (IDIAP) Research Institute, affiliated toEPFL (Ecole Polytechnique Fdrale de Lausanne) inSwitzerland. He received his Master’s degree in 2006from the Indian Institute of Technology—Madras,India. His research interests are in the areas of learningand execution of complex motor sequences (esp.handwriting), machine learning, brain computer inter-faces, and robotics.

V. Srinivasa Chakravarthy received B.Tech degree inElectrical Engineering from the Indian Institute ofTechnology, Madras in 1989, and M.S. and PhD degreesfrom the Department of Electrical Engineering, Uni-versity of Texas at Austin in 1991 and 1996, respec-tively. He was a postdoctoral fellow until 1997 in theNeuroscience Division of Baylor College of Medicine,Houston. He is currently an Associate Professor in theBiotechnology Department, Indian Institute of Tech-nology, Madras. His current research interests includecomputational neuroscience and computational biol-ogy. He is also involved in developing algorithms for

handwritten character recognition in Indian languages.