brain and reinforcement learning - jason...

Kenji Doya [email protected]

Neural Computation Unit Okinawa Institute of Science and Technology

MLSS 2012 in Kyoto

Brain and Reinforcement Learning �

Taipei!1.5 hour!

Tokyo!2.5 hour!

Seoul!2.5 hour!

Shanghai!2 hour!

Manila!

2.5 hour!

Okinawa!

Location of Okinawa �

Beijing!3 hour!

Okinawa Institute of Science & Technology

!  Apr. 2004: Initial research !  President: Sydney Brenner

!  Nov. 2011: Graduate university

!  President: Jonathan Dorfan

!  Sept. 2012: Ph.D. course

!  20 students/year

Our Research Interests How to build adaptive, autonomous systems

!   robot experiments

How the brain realizes robust, flexible adaptation

!  neurobiology

Learning to Walk (Doya & Nakano, 1985)

!  Action: cycle of 4 postures !  Reward: speed sensor output

!  Problem: a long jump followed by a fall

Need for long-term evaluation of action

Reinforcement Learning

!  Learn action policy: s → a to maximize rewards !  Value function: expected future rewards !  V(s(t)) = E[ r(t) + γr(t+1) + γ2r(t+2) + γ3r(t+3) +…]

0≤γ≤1: discount factor !  Temporal difference (TD) error: !   δ(t) = r(t) + γV(s(t+1)) - V(s(t))

environment!

reward r!

action a!

state s!

agent!

γV(s(t+1)!

Example: Grid World !  Reward field

2

4

6

2

4

6

-2

-1

0

1

2

Value function γ=0.9

2

4

6

2

4

6

-2

-1

0

1

2

γ=0.3

Cart-Pole Swing-Up !  Reward: height of pole !  Punishment: collision

!  Value in 4D state space

Learning to Stand Up �Morimoto & Doya, 2000)

!  State: joint/head angles, angular velocity !  Action: torques to motors

!  Reward: head height – tumble

Learning to Survive and Reproduce !  Catch battery packs !  survival

!  Copy ‘genes’ by IR ports !  reproduction, evolution

Markov Decision Process (MDP)

!  Markov decision process !  state s ∈ S !  action a ∈ A !  policy p(a|s) !   reward r(s,a) !  dynamics p(s’|s,a)

!  Optimal policy: maximize cumulative reward !   finite horizon: E[ r(1) + r(2) + r(3) + ... + r(T)] !   infinite horizon: E[ r(1) + γr(2) + γ2r(3) + … ]

0≤γ≤1: temporal discount factor !  average reward: E[ r(1) + r(2) + ... + r(T)]/T, T→∞

environment!

reward r!

action a!

state s!

agent!

Solving MDPs �Dynamic Programming ! p(s’|s,a) and r(s,a) are known

!  Solve Bellman equation V(s) = maxa E[ r(s,a) + γV(s’)]

! V(s): value function expected reward from state s

!  Apply optimal policy a = argmaxa E[ r(s,a) + γV*(s’)]

!  Value iteration

!  Policy iteration

Reinforcement Learning ! p(s’|s,a) and r(s,a) are unknown

!  Learn from actual experience {s,a,r,s,a,r,…}

!  Monte Carlo !  SARSA

!  Q-learning !  Actor-Critic

!  Policy gradient

!  Model-based !  learn p(s’|s,a), r(s,a) and do DP

Actor-Critic and TD learning�!  Actor: parameterized policy: P(a|s; w) !  Critic: learn value function

V(s(t)) = E[ r(t) + γr(t+1) + γ2r(t+2) +…]

!   in a table or a neural network

!  Temporal Difference (TD) error:

! δ(t) = r(t) + γ V(s(t+1)) - V(s(t))

!  Update

!  Critic: ΔV(s(t)) = α δ(t)

!  Actor: Δw = α δ(t) ∂P(a(t)|s(t);w)/∂w

… reinforce a(t) by δ(t)

SARSA and Q Learning

!  Action value function ! Q(s,a) = E[ r(t) + γr(t+1) + γ2r(t+2) …| s(t)=s,a(t)=a]

!  Action selection

! ε-greedy: a = argmaxa Q(s,a) with prob 1-ε*

! Boltzman: P(ai|s) = exp[βQ(s,ai)] / Σjexp[βQ(s,aj)] !  SARSA: on-policy update

! ΔQ(s(t),a(t)) = α{r(t)+γQ(s(t+1),a(t+1))-Q(s(t),a(t))} !  Q learning: off-policy update

! ΔQ(s(t),a(t)) = α{r(t)+γmaxa’Q(s(t+1),a’)-Q(s(t),a(t))}

“Lose to Gain” Task !  N states, 2 actions

!  if r2 >> r1 , then better take a2 �

-r1� -r1�

+r2�

+r1� +r1�

-r2�

a2�s1� s2� s3� s4�

-r1�

+r1�a1�

Reinforcement Learning

!  Predict reward: value function ! V(s) = E[ r(t) + γr(t+1) + γ2r(t+2)…| s(t)=s] ! Q(s,a) = E[ r(t) + γr(t+1) + γ2r(t+2)…| s(t)=s, a(t)=a]

!  Select action !  greedy: a = argmax Q(s,a) !  Boltzmann: P(a|s) ∝ exp[ β Q(s,a)]

!  Update prediction: TD error*! δ(t) = r(t) + γV(s(t+1)) - V(s(t)) ! ΔV(s(t)) = α δ(t) ! ΔQ(s(t),a(t)) = α δ(t)

How to implement these steps?

How to tune these parameters?

��

��

��

�� r

��

�� V

�� r

��

�� V

�� r

��

�� V

r

V

δ*

r

V

δ*

r

V

δ*

Dopamine Neurons Code TD Error δ(t) = r(t) + γV(s(t+1)) - V(s(t))

unpredicted

predicted

omitted

(Schultz et al. 1997)

W. SCHULTZ4

fails to occur, even in the absence of an immediately preced-ing stimulus (Fig. 2, bottom) . This is observed when animalsfail to obtain reward because of erroneous behavior, whenliquid flow is stopped by the experimenter despite correctbehavior, or when a valve opens audibly without deliveringliquid (Hollerman and Schultz 1996; Ljungberg et al. 1991;Schultz et al. 1993). When reward delivery is delayed for0.5 or 1.0 s, a depression of neuronal activity occurs at theregular time of the reward, and an activation follows thereward at the new time (Hollerman and Schultz 1996). Bothresponses occur only during a few repetitions until the newtime of reward delivery becomes predicted again. By con-trast, delivering reward earlier than habitual results in anactivation at the new time of reward but fails to induce adepression at the habitual time. This suggests that unusuallyearly reward delivery cancels the reward prediction for thehabitual time. Thus dopamine neurons monitor both the oc-currence and the time of reward. In the absence of stimuliimmediately preceding the omitted reward, the depressionsdo not constitute a simple neuronal response but reflect anexpectation process based on an internal clock tracking theprecise time of predicted reward.

Activation by conditioned, reward-predicting stimuliAbout 55–70% of dopamine neurons are activated by

conditioned visual and auditory stimuli in the various classi-cally or instrumentally conditioned tasks described earlier(Fig. 2, middle and bottom) (Hollerman and Schultz 1996;Ljungberg et al. 1991, 1992; Mirenowicz and Schultz 1994;Schultz 1986; Schultz and Romo 1990; P. Waelti, J. Mire-nowicz, and W. Schultz, unpublished data) . The first dopa-mine responses to conditioned light were reported by Milleret al. (1981) in rats treated with haloperidol, which increasedthe incidence and spontaneous activity of dopamine neuronsbut resulted in more sustained responses than in undruggedanimals. Although responses occur close to behavioral reac-tions (Nishino et al. 1987), they are unrelated to arm andeye movements themselves, as they occur also ipsilateral toFIG. 2. Dopamine neurons report rewards according to an error in re-

ward prediction. Top : drop of liquid occurs although no reward is predicted the moving arm and in trials without arm or eye movementsat this time. Occurrence of reward thus constitutes a positive error in the (Schultz and Romo 1990). Conditioned stimuli are some-prediction of reward. Dopamine neuron is activated by the unpredicted what less effective than primary rewards in terms of responseoccurrence of the liquid. Middle : conditioned stimulus predicts a reward,

magnitude and fractions of neurons activated. Dopamineand the reward occurs according to the prediction, hence no error in theprediction of reward. Dopamine neuron fails to be activated by the predicted neurons respond only to the onset of conditioned stimuli andreward (right) . It also shows an activation after the reward-predicting stim- not to their offset, even if stimulus offset predicts the rewardulus, which occurs irrespective of an error in the prediction of the later (Schultz and Romo 1990). Dopamine neurons do not distin-reward ( left ) . Bottom : conditioned stimulus predicts a reward, but the re- guish between visual and auditory modalities of conditionedward fails to occur because of lack of reaction by the animal. Activity of

appetitive stimuli. However, they discriminate between ap-the dopamine neuron is depressed exactly at the time when the rewardwould have occurred. Note the depression occurring ú1 s after the condi- petitive and neutral or aversive stimuli as long as they aretioned stimulus without any intervening stimuli, revealing an internal pro- physically sufficiently dissimilar (Ljungberg et al. 1992;cess of reward expectation. Neuronal activity in the 3 graphs follows the P. Waelti, J. Mirenowicz, and W. Schultz, unpublishedequation: dopamine response (Reward) Å reward occurred 0 reward pre-

data) . Only 11% of dopamine neurons, most of them withdicted. CS, conditioned stimulus; R, primary reward. Reprinted fromSchultz et al. (1997) with permission by American Association for the appetitive responses, show the typical phasic activations alsoAdvancement of Science. in response to conditioned aversive visual or auditory stimuli

in active avoidance tasks in which animals release a key toavoid an air puff or a drop of hypertonic saline (Mirenowicztogether, the occurrence of reward, including its time, mustand Schultz 1996), although such avoidance may be viewedbe unpredicted to activate dopamine neurons.as ‘‘rewarding.’’ These few activations are not sufficientlystrong to induce an average population response. Thus theDepression by omission of predicted rewardphasic responses of dopamine neurons preferentially reportenvironmental stimuli with appetitive motivational value butDopamine neurons are depressed exactly at the time of

the usual occurrence of reward when a fully predicted reward without discriminating between different sensory modalities.

J857-7/ 9k2a$$jy19 06-22-98 13:43:40 neupa LP-Neurophys

Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007)

Cerebral cortex!state/action coding!

Striatum!reward prediction!

Pallidum!action selection!

Dopamine neurons!TD signal!

Thalamus!

δ �

V(s) � Q(s,a) �

state� action �

Monkey Free Choice Task (Samejima et al., 2005)

10 50 90

10

50

90 90-50

10-50

50-10 50-90 50-50

P( r

ewar

d |

Left

)= Q

L

P( reward | Right) = QR

%

%

Dissociation of action and reward !

Action Value Coding in Striatum (Samejima et al., 2005)

QL! QR!

QL! QR!-1 0 sec!

!  QL neuron

!  -QR neuron

Forced and Free Choice Task Makoto Ito

Center

Cue tone

0.5-1s 1-2s

Right

Rwd tone

No-rwd

Pellet

Left

poking Center! Right!Left!

pellet dish!

Cue tone Reward prob. (L, R)

Left tone (900Hz)

Fixed (50%,0%)

Right tone (6500Hz)

Fixed (0%, 50%)

Free-choice tone

(White noise)

Varied (90%, 50%) (50%, 90%) (50%, 10%) (10%, 50%)

Time Course of Choice

� ��

�

�

��

��

Trials

PL

10-50 10-50 10-50 50-10 90-50 50-90 90-50 50-10 50-90 50-10

Tone B

Left for tone A

Right for tone A

0.9 0.5 0.1 0.1

0.5

0.9 P(r|a=L)

P(r|a=R)

Generalized Q-learning Model (Ito & Doya, 2009) �

!  Action selection P(a(t)=L) = expQL(t)/(expQL(t)+expQR(t))

!  Action value update: i�{L,R} Qi(t+1) = (1-α1)Qi(t) + α1κ1 if a(t)=i, r(t)=1

(1-α1)Qi(t) - α1κ2 if a(t)=i, r(t)=0 (1-α2)Qi(t) if a(t)≠i, r(t)=1 (1-α2)Qi(t) if a(t)≠i, r(t)=0

!  Parameters !   α1: learning rate !   α2: forgetting rate !   κ1: reward reinforcement !   κ2: no-reward aversion

��

��

��

�

�

�

�

�

�

�

��

��

��

��

��

�

Left, reward!Left, no-reward!

Right, reward!Right, no-reward!

QL!QR!

(90 50)! (50 90)! (50 10)!Model Fitting by Particle Filter�

α2!

α1!

Trials!

Model Fitting�

!  Generalized Q learning !  α1: learning

!  α2: forgetting !  κ1: reinforcement

!  κ2: aversion

!  standard: α2=κ2=0 !  forgetting: κ2=0 ��

��

��

��

��

��

��

��

��

��

��

��

1st Markov model(4)!2nd Markov model(16)!3rd Markov model(64)!

4th Markov model(256)!

standard Q (const)(2)!F-Q (const)(3)!

DF-Q (const)(4)!local matching law(1)!

standard Q (variable)(2)!F-Q (variable)(2)!

DF-Q (variable)(2)!

**!

**!

*!

**!

**!

**!

**!

**!

**!

normalized likelihood!

Neural Activity in the Striatum

!  Dorsolateral

!  Dorsomedial

!  Ventral

C! R!

C! R!

Action (out of center hole)! Reward (into choice hole)!

poking C!Tone!poking L!poking R!pellet dish!

DL(122)!DM(56)!NA(59)!

sec!

bit/s

ec!

Information of Action and Reward

Action value coded by a DLS neuron Firing rate during tone presentation (blue in left panel)!

Action value for left estimated by FQ-learning!

action value for left estimated by FQ-learning!Firin

g ra

te d

urin

g to

ne p

rese

ntat

ion

(blu

e)! Trials!

Trials!

Q!

FSA!

DLS!

State value coded by a VS neuron Firing rate during tone presentation (blue in left panel)!

Action value for left estimated by FQ-learning!

action value for left estimated by FQ-learning!Firin

g ra

te d

urin

g to

ne p

rese

ntat

ion

(blu

e)!

Q!

FSA!

VS!

Hierarchy in Cortico-Striatal Network

!  Dorsolateral striatum – motor !  early action coding

!  what action to take?

! Dorsomedial striatum - frontal

!  action value

!   in what context?

!  Ventral striatum - limbic

!  state value

!  whether worth doing?�(Voorn et al., 2004)!

thalamus�

SN�

IO�

Cortex�

Basal�Ganglia�

Cerebellum�

target�

error �+�

-�

output�input�

Cerebellum: Supervised Learning!

reward�

output�input�

Basal Ganglia: Reinforcement Learning!

Cerebral Cortex：Unsupervised Learning!

output�input �

Specialization by Learning Algorithms (Doya, 1999) �

Multiple Action Selection Schemes !  Model-free !  a = argmaxa Q(s,a)

!  Model-based

!  a = argmaxa [r+V(f(s,a))]

forward model: f(s,a)

!  Lookup table

!  a = g(s)

s a

Q s’ a

V ai

f

s

s a

g

‘Grid Sailing’ Task Alan Fermin�

!  Move a cursor to the goal !  100 points for shortest path

!   -5 points per excess steps

! Keymap

!  only 3 directions

!  non-trivial path planning

!  Immediate or delayed start

!  4 to 6 sec for planning

!   timeout in 6 sec �

1

3 2

1 2 3

+

Task Conditions�

!"#$%&'()*$+,-./&'0)1234*!(

!"#$%&'(")*%!+,(,'-.$'(./"&0-$""'%!/'(./"&01%,"/'

,2$%2"3+",'+!'*#(%!,'4"$-.$(+!3'%'(#&2+0,2"4'!%5+3%2+.!'2%,6%789'-:;<=9>?@?'28A:B=AC DCEB=F8>?@?'(8ACGC'+GC@?'HI9=JB=;C DCEB=<CGC>?@?'6:9K='/CL8>?@?M

>!8;8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'@.A=98P8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'M%2$')C<QIG8G=C987'!:I;CEJ=:9J:'&8RC;8GC;=:E

56789:);<8=9)>?):<;<@>)8@>A?9:)B=?7):@=8>@CD)EF)>=A8;489G4<==?=D)?=)EF)6:A9H)

I9?J;<GH<)B=?7)K8:>)<LK<=A<9@<:M)N:);<8=9A9H)H?<:)?9D)8@>A?9):<;<@>A?9)78F)

=<;F)?9)K=<GA@>AO<)7<@C89A:7:D):6@C)8:)=<>=A<O8;)?B)7?>?=)7<7?=A<:)B=?7)J<;;)

<:>8E;A:C<G):<9:?=A7?>?= 78K:D)?=)@?9:>=6@>A?9)89G)6:<)?B)A9><=98;)7?G<;:)A9)

9<J;F)<9@?69><=<G)<9OA=?97<9>:M)"<A9B?=@<7<9>)P<8=9A9H)Q"PRD)8)

@?7K6>8>A?98;)>C<?=F)?B)8G8K>AO<)?K>A78;)@?9>=?;D):6HH<:>:)>J?)8@>A?9)

:<;<@>A?9)7<>C?G:)>C8>)=<:<7E;<)=<8;)C6789)E<C8OA?=S)2?G<;4T=<<)Q2TR)

7<>C?G)6:<:)8@>A?9)O8;6<)B69@>A?9:)>?)K=<GA@>)B6>6=<)=<J8=G:)E8:<G)?9)@6==<9>)

:>8><:)89G)8O8A;8E;<)8@>A?9:D)89G)2?G<;4+8:<G)Q2+R)7<>C?G)6:<:)8)B?=J8=G)

7?G<;)>?)K=<GA@>)>C<)B6>6=<):>8><:)=<8@C<G)EF)CFK?>C<>A@8;)8@>A?9:M)"<@<9>)

@?7K6>8>A?98;)7?G<;:)QU?F8D)0VVVD)U8J <>)8;D)(''WR)C8O<)CFK?>C<:AX<G)>C8>)

2+)89G)2T)8=<)A7K;<7<9><G)A9)>C<)C6789)E=8A9)>C=?6HC)@?=>A@?4E8:8;)H89H;A8)

;??K:D)A9@;6GA9H)>C<)K=<B=?9>8;)@?=><L)89G)G?=:8;):>=A8>67M)/?)><:>)JC<>C<=)89G)

C?J)C6789:)6>A;AX<)2+)89G)2T):>=8><HA<:D)J<)K<=B?=7<G)89)B2"Y <LK<=A7<9>)

6:A9H)8)ZH=AG):8A;A9H)>8:I[M

+!2$./#)2+.!

("2*./,*6E\<@>:S)0])Q()B<78;<R)C<8;>CFD)9?9476:A@A89D)=AHC>4C89G<GD)8H<)(W!3F?M))

/=8A9A9H)Q3)E;?@I:D)(^')>=A8;:R)89G)/<:>)Q()E;?@I:D)0]')>=A8;:R):<::A?9:)J<=<)

:<K8=8><G)EF)8)9AHC>):;<<K)A9><=O8;M)*6E\<@>:)J<=<)=89G?7;F)8::AH9<G)>?)?9<)?B)

>C=<<)H=?6K:)JA>C)>C<):87<)>8:I)@?9GA>A?9:M)*6E\<@>:)C8G)>?)7?O<)8)@6=:?=)

Q>=A89H;<R)B=?7)>C<):>8=>)Q=<GR)>?)>C<)>8=H<>)H?8;)QE;6<R)EF):<_6<9>A8;;F)K=<::A9H)

>C=<<)I<F:M)Y9:>=6@>A?9)J8:)>?);<8=9)8):A9H;<)?K>A78;)8@>A?9):<_6<9@<)Q:C?=><:>)

K8>CJ8FR)B?=)<8@C)?B)>C<)124*!)K8A=:D)JCA@C)J<=<)@?9:>89>)>C=?6HC?6>)>C<)

<LK<=A7<9>M)/C<):>8=>)K?:A>A?9)@?;?=):K<@ABA<G)JC<>C<=)>?):>8=>)8)=<:K?9:<)

A77<GA8><;F)?=)8B><=)8)G<;8F)Q`:<@RM

INDEX MIDDLE RING

Buttons & Fingers

Key-Maps

Start-Goal Positions

SG1 SG2 SG3

2

SG4 SG5

KM1

KM2

SG1

SG2SG4

SG5

KM1 SG1

KM2 SG2

KM3 SG3

Training Session Test Session

KM-SG combinations

Cond 2: Learned Key-Maps

Cond 3: Learned KM-SG

Test Conditions

Cond 1: New Key-Map

31

S

ITI DELAY

Goal Reach

3~5 sec 4~6 sec 6 sec

2 sec

Reward

IMMEDIATE

No Reach

S

S

S S

S

Training Schedule Test Schedule

TB1 TB2

=<<:F=8G: F:78L:F

Random Random

…

1 2

4 trials

240 trials

4 trials=<<:F=8G: F:78L:F

Random Random

19 20

4 trials 4 trials=<<:F=8G: F:78L:F

Random Random

…

1 2


Random Random

9 10

9 trials 9 trials

TB1 TB2 180 trials

Sequence of Trial Events

S

T

>U

VU

U

Time-over: 0 Fail: -5 Optimal: 100

TB3

Group 1

(6 subj.)

KM2

KM3

SG3

SG2SG5

SG4

Group 2

(6 subj.)

KM1

KM3

SG3

SG1SG5

SG4

Group 3

(6 subj.)

KM1 SG1

KM2 SG2

KM3 SG3

KM1 SG1

KM2 SG2

KM3 SG3

KM1 KM2 KM3

MODEL-FREE MODEL-BASED

SEQUENCE MEMORY

(Habit)

Q-LEARNING

KNOWLEDGEQ-a%P#Y/N/Y#bR

LEARNED

POLICY

RANDOMQ-a%P#"N/Y#bR

EXPECTEDPERFORMANCE

LEARNING STAGES

Trial-and-error learning for key-map and action sequence

Forward model for planning future actions

Sequential motor memory for fast action selection

COMPUTATIONAL STRATEGIES

TASK CONDITIONS

Condition 2

Known Key-Map

Condition 1New Key-Map

KM1

Condition 3

Learned KM-SG

SG1

KM2 SG2

KM1 SG2

KM1 SG3

KM2 SG1

KM2 SG3

KM3 SG1

KM3 SG2

KM3 SG3

'

0''

'

0''

'

0''

>A7<

Re

wa

rd

Re

wa

rd

Re

wa

rd

HIGH PERFORMANCE

BOOSTED LEARNING

SLOW LEARNING

QtR

>A7< QtR

>A7< QtR

PLANNING

Q(s,a)

P(s’,r’|s,a)

P(a|s)

$",#&2,'0 2$%+!+!3

"-*$P/*)4 /-*/N* "-*$P/*)4 /-*/N*

Immediate Delay Immediate Delay Immediate Delay0

20

40

60

80

100

% o

f T

rial T

ype D

istr

ibution

Block 1 Block 2 Block 3

Optimal

Suboptimal

Fail

05

1015

A-2A-1

A-1A-3

A-3

0200400600800

10001200140016001800200022002400

TrialAction Sequence

Reaction T

ime (

ms)

051015

A-2A-1

A-1A-3

A-3

0200400600800

10001200140016001800200022002400


Reaction T

ime (

ms)

A-2 A-1 A-1 A-3 A-30

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

Reaction T

ime (

ms)

Action Sequence

).!/+2+.!'> ).!/+2+.!'@

N($+ W /"&%D'4"$+./'%)2+5+2D).!/+2+.!'>'X').!/+2+.!'M ).!/+2+.!'@'X').!/+2+.!'M

).!/+2+.!'M'X'

Y).!/+2+.!'>?').!/+2+.!'@Z

UP%T.Q+N^`R UP%T.Q+N^`R

c<9>=8;)%2.Q+N^^d^WR

P8><=8;)%2.

20

N9><=A?=)+8:8;)!89H;A8

%?:><=A?=)+8:8;)!89H;A8%=?K<=4*2N

%?:><=A?=).<=<E<;;67

N9><=A?=).<=<E<;;67

*?78>?:<9:?=F .?=><L

*M)%8=A<>8;).?=><L

c<9>=8;)%2.Q+N^^d^WR

P8><=8;)%2.

%=?K<=4*2N

%=<4*2N%=<4*2N

20

*M)%8=A<>8;).?=><L

*?78>?:<9:?=F .?=><L

N9><=A?=)+8:8;)!89H;A8

%?:><=A?=)+8:8;)!89H;A8

N9><=A?=).<=<E<;;67

%?:><=A?=).<=<E<;;67

[\0>>?'L\]?'^\0_

)C9F=G=C9'@'`R7I:a

)C9F=G=C9'>'`;:Fa

$"3+.!'.-'+!2"$",2'%!%&D,+,

&"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$"

BGGQbccPPPd9Jd=;QdC=EGdKQc

e<)@?9BA=7<G)>C<);<8=9A9H)=<;8><G)A7K=?O<7<9>:)89G)

B?=78>A?9)?B)C8EA>):<_6<9@<:)EF):AH9ABA@89>)A7K=?O<7<9>:)

A9)8;;)K<=B?=789@<)7<8:6=<:D)JA>C)89)A9@=<8:<)A9)=<J8=G)

8@_6A:A>A?9)89G)8)G<@=<8:<)A9)>C<)967E<=)?B)7?O<:)>?)

=<8@C)>C<)>8=H<>)H?8;D)8:)J<;;)8:)A9)>C<)>A7<)>?):>8=>)89G)

BA9A:C)89)8@>A?9):<_6<9@<M)/C<):>8E;<)K<=B?=789@<)J8:)

8;:?)?E:<=O<G)A9)>C<):C?=>)K=8@>A@<)E<B?=<)>C<)B2"Y ><:>D)

89G)8B><=)?9<)9AHC>):;<<KM

).!/+2+.!'MN):<=A<:)?B)34J8F)Nb#cN):C?J<G)8)

:AH9ABA@89>)<BB<@>)?B)*>8=>)/A7<D)/8:I)

.?9GA>A?9:D)89G)/<:>)+;?I: QKf'M''0R

N)76;>A@?7K8=A:?9 898;F:A:):C?J<G)

>C8>).?9GA>A?9)0)C8G)>C<);?J<:>)

K<=B?=789@<D).?9GA>A?9)()J8:)

A9><=7<GA8><D)89G).?9GA>A?9)3)J8:)

>C<)E<:>M

N;:?D).?9GA>A?9)():C?J<G):AH9ABA@89>)

A9)=<J8=G)8@_6A:A>A?9)8B><=)>C<)G<;8F)

K<=A?GD)JCA@C)7AHC>)A9GA@8><)8)

E<9<BA@A8;)<BB<@>)?B)>C<)G<;8F)K<=A?G)

?9)>C<)A7K;<7<9>8>A?9)?B)8)B?=J8=G)

7?G<;)7<@C89A:7)B?=)8@>A?9)

:<;<@>A?9M

%<=B?=789@<)A9)).?9GA>A?9)0)

=<78A9<G)@?9:>89>)89G)CAHC)

A==<:K<@>AO<)?B)>C<):>8=>)>A7<)

@?9GA>A?9

).!/+2+.!'> ).!/+2+.!'@ ).!/+2+.!'M

).!)&#,+.!'4 e<)C8O<)G<O<;?K<G)8)ZH=AG4:8A;A9H)>8:I[)>?)><:>)C?J)C6789:)6>A;AX<)GABB<=<9>):>=8><HA<:)B?=)>C<);<8=9A9H)?B)8@>A?9)

:<;<@>A?9M))UA:>A9@>)K<=B?=789@<)K=?BA;<:)B?=)<8@C)>8:I)@?9GA>A?9)J<=<)B?69GD)89G)<8@C)8;:?)=<_6A=<G):K<@ABA@)9<6=8;)9<>J?=I: B?=)

:?;OA9H)>C<)>8:IM)#6=)9<L>):><K)A:)>?)K<=B?=7)8)=<A9B?=@<7<9>);<8=9A9H)7?G<;4E8:<G)898;F:A:)>?):<<I)JC<>C<=):6E\<@>:g)E<C8OA?=)

69G<=)<8@C)>8:I)@?9GA>A?9)BA>:)J<;;)>?):K<@ABA@)8@>A?9):<;<@>A?9):>=8><HA<:)Q2TD)2+R)89G)>C<)E=8A9)8=<8:)A7K;<7<9>A9H)>C<7M

$",#&2,'0 2",2

/8:I)@?9GA>A?9:)GABB<=<9>A8;;F)

7?G6;8><G)>C<)+#PU):AH98;)G6=A9H)

>C<)G<;8F)K<=A?GM).?9GA>A?9)()

:C?J<G):6:>8A9<G)89G):>=?9H<=)

8@>AO8>A?9)A9)UP%T.D)c%2.D)P%2.D)

89><=A?=)+!)89G)K?:><=A?=).+M))

.?9GA>A?9)3)8@>AO8><G)7?=<)7?>?=)

=<;8><G):>=6@>6=<:D)89G).?9GA>A?9)0)

:C?J<G)>C<);?J<:>D)E6>)9<8=)>?)

E8:<;A9<)8@>AO8>A?9)K8>><=9:M

!"#$%&'()*$+,-./&'0)1234*!(

!"#$%&'(")*%!+,(,'-.$'(./"&0-$""'%!/'(./"&01%,"/'

,2$%2"3+",'+!'*#(%!,'4"$-.$(+!3'%'(#&2+0,2"4'!%5+3%2+.!'2%,6%789'-:;<=9>?@?'28A:B=AC DCEB=F8>?@?'(8ACGC'+GC@?'HI9=JB=;C DCEB=<CGC>?@?'6:9K='/CL8>?@?M

>!8;8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'@.A=98P8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'M%2$')C<QIG8G=C987'!:I;CEJ=:9J:'&8RC;8GC;=:E

56789:);<8=9)>?):<;<@>)8@>A?9:)B=?7):@=8>@CD)EF)>=A8;489G4<==?=D)?=)EF)6:A9H)

I9?J;<GH<)B=?7)K8:>)<LK<=A<9@<:M)N:);<8=9A9H)H?<:)?9D)8@>A?9):<;<@>A?9)78F)

=<;F)?9)K=<GA@>AO<)7<@C89A:7:D):6@C)8:)=<>=A<O8;)?B)7?>?=)7<7?=A<:)B=?7)J<;;)

<:>8E;A:C<G):<9:?=A7?>?= 78K:D)?=)@?9:>=6@>A?9)89G)6:<)?B)A9><=98;)7?G<;:)A9)

9<J;F)<9@?69><=<G)<9OA=?97<9>:M)"<A9B?=@<7<9>)P<8=9A9H)Q"PRD)8)

@?7K6>8>A?98;)>C<?=F)?B)8G8K>AO<)?K>A78;)@?9>=?;D):6HH<:>:)>J?)8@>A?9)

:<;<@>A?9)7<>C?G:)>C8>)=<:<7E;<)=<8;)C6789)E<C8OA?=S)2?G<;4T=<<)Q2TR)

7<>C?G)6:<:)8@>A?9)O8;6<)B69@>A?9:)>?)K=<GA@>)B6>6=<)=<J8=G:)E8:<G)?9)@6==<9>)

:>8><:)89G)8O8A;8E;<)8@>A?9:D)89G)2?G<;4+8:<G)Q2+R)7<>C?G)6:<:)8)B?=J8=G)

7?G<;)>?)K=<GA@>)>C<)B6>6=<):>8><:)=<8@C<G)EF)CFK?>C<>A@8;)8@>A?9:M)"<@<9>)

@?7K6>8>A?98;)7?G<;:)QU?F8D)0VVVD)U8J <>)8;D)(''WR)C8O<)CFK?>C<:AX<G)>C8>)

2+)89G)2T)8=<)A7K;<7<9><G)A9)>C<)C6789)E=8A9)>C=?6HC)@?=>A@?4E8:8;)H89H;A8)

;??K:D)A9@;6GA9H)>C<)K=<B=?9>8;)@?=><L)89G)G?=:8;):>=A8>67M)/?)><:>)JC<>C<=)89G)

C?J)C6789:)6>A;AX<)2+)89G)2T):>=8><HA<:D)J<)K<=B?=7<G)89)B2"Y <LK<=A7<9>)

6:A9H)8)ZH=AG):8A;A9H)>8:I[M

+!2$./#)2+.!

("2*./,*6E\<@>:S)0])Q()B<78;<R)C<8;>CFD)9?9476:A@A89D)=AHC>4C89G<GD)8H<)(W!3F?M))

/=8A9A9H)Q3)E;?@I:D)(^')>=A8;:R)89G)/<:>)Q()E;?@I:D)0]')>=A8;:R):<::A?9:)J<=<)

:<K8=8><G)EF)8)9AHC>):;<<K)A9><=O8;M)*6E\<@>:)J<=<)=89G?7;F)8::AH9<G)>?)?9<)?B)

>C=<<)H=?6K:)JA>C)>C<):87<)>8:I)@?9GA>A?9:M)*6E\<@>:)C8G)>?)7?O<)8)@6=:?=)

Q>=A89H;<R)B=?7)>C<):>8=>)Q=<GR)>?)>C<)>8=H<>)H?8;)QE;6<R)EF):<_6<9>A8;;F)K=<::A9H)

>C=<<)I<F:M)Y9:>=6@>A?9)J8:)>?);<8=9)8):A9H;<)?K>A78;)8@>A?9):<_6<9@<)Q:C?=><:>)

K8>CJ8FR)B?=)<8@C)?B)>C<)124*!)K8A=:D)JCA@C)J<=<)@?9:>89>)>C=?6HC?6>)>C<)

<LK<=A7<9>M)/C<):>8=>)K?:A>A?9)@?;?=):K<@ABA<G)JC<>C<=)>?):>8=>)8)=<:K?9:<)

A77<GA8><;F)?=)8B><=)8)G<;8F)Q`:<@RM

INDEX MIDDLE RING

Buttons & Fingers

Key-Maps

Start-Goal Positions

SG1 SG2 SG3

2

SG4 SG5

KM1

KM2

SG1

SG2SG4

SG5

KM1 SG1

KM2 SG2

KM3 SG3

Training Session Test Session

KM-SG combinations

Cond 2: Learned Key-Maps

Cond 3: Learned KM-SG

Test Conditions

Cond 1: New Key-Map

31

S

ITI DELAY

Goal Reach

3~5 sec 4~6 sec 6 sec

2 sec

Reward

IMMEDIATE

No Reach

S

S

S S

S

Training Schedule Test Schedule

TB1 TB2

=<<:F=8G: F:78L:F

Random Random

…

1 2

4 trials

240 trials

4 trials=<<:F=8G: F:78L:F

Random Random

19 20


Random Random

…

1 2


Random Random

9 10

9 trials 9 trials

TB1 TB2 180 trials

Sequence of Trial Events

S

T

>U

VU

U

Time-over: 0 Fail: -5 Optimal: 100

TB3

Group 1

(6 subj.)

KM2

KM3

SG3

SG2SG5

SG4

Group 2

(6 subj.)

KM1

KM3

SG3

SG1SG5

SG4

Group 3

(6 subj.)

KM1 SG1

KM2 SG2

KM3 SG3

KM1 SG1

KM2 SG2

KM3 SG3

KM1 KM2 KM3

MODEL-FREE MODEL-BASED

SEQUENCE MEMORY

(Habit)

Q-LEARNING

KNOWLEDGEQ-a%P#Y/N/Y#bR

LEARNED

POLICY

RANDOMQ-a%P#"N/Y#bR

EXPECTEDPERFORMANCE

LEARNING STAGES

Trial-and-error learning for key-map and action sequence

Forward model for planning future actions

Sequential motor memory for fast action selection

COMPUTATIONAL STRATEGIES

TASK CONDITIONS

Condition 2

Known Key-Map

Condition 1New Key-Map

KM1

Condition 3

Learned KM-SG

SG1

KM2 SG2

KM1 SG2

KM1 SG3

KM2 SG1

KM2 SG3

KM3 SG1

KM3 SG2

KM3 SG3

'

0''

'

0''

'

0''

>A7<

Re

wa

rd

Re

wa

rd

Re

wa

rd

HIGH PERFORMANCE

BOOSTED LEARNING

SLOW LEARNING

QtR

>A7< QtR

>A7< QtR

PLANNING

Q(s,a)

P(s’,r’|s,a)

P(a|s)

$",#&2,'0 2$%+!+!3

"-*$P/*)4 /-*/N* "-*$P/*)4 /-*/N*

Immediate Delay Immediate Delay Immediate Delay0

20

40

60

80

100

% o

f T

rial T

ype D

istr

ibution

Block 1 Block 2 Block 3

Optimal

Suboptimal

Fail

05

1015

A-2A-1

A-1A-3

A-3

0200400600800

10001200140016001800200022002400


Reaction T

ime (

ms)

051015

A-2A-1

A-1A-3

A-3

0200400600800

10001200140016001800200022002400


Reaction T

ime (

ms)

A-2 A-1 A-1 A-3 A-30

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

Reaction T

ime (

ms)

Action Sequence

).!/+2+.!'> ).!/+2+.!'@

N($+ W /"&%D'4"$+./'%)2+5+2D).!/+2+.!'>'X').!/+2+.!'M ).!/+2+.!'@'X').!/+2+.!'M

).!/+2+.!'M'X'

Y).!/+2+.!'>?').!/+2+.!'@Z

UP%T.Q+N^`R UP%T.Q+N^`R

c<9>=8;)%2.Q+N^^d^WR

P8><=8;)%2.

20

N9><=A?=)+8:8;)!89H;A8

%?:><=A?=)+8:8;)!89H;A8%=?K<=4*2N

%?:><=A?=).<=<E<;;67

N9><=A?=).<=<E<;;67

*?78>?:<9:?=F .?=><L

*M)%8=A<>8;).?=><L

c<9>=8;)%2.Q+N^^d^WR

P8><=8;)%2.

%=?K<=4*2N

%=<4*2N%=<4*2N

20

*M)%8=A<>8;).?=><L

*?78>?:<9:?=F .?=><L

N9><=A?=)+8:8;)!89H;A8

%?:><=A?=)+8:8;)!89H;A8

N9><=A?=).<=<E<;;67

%?:><=A?=).<=<E<;;67

[\0>>?'L\]?'^\0_

)C9F=G=C9'@'`R7I:a

)C9F=G=C9'>'`;:Fa

$"3+.!'.-'+!2"$",2'%!%&D,+,

&"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$"

BGGQbccPPPd9Jd=;QdC=EGdKQc

e<)@?9BA=7<G)>C<);<8=9A9H)=<;8><G)A7K=?O<7<9>:)89G)

B?=78>A?9)?B)C8EA>):<_6<9@<:)EF):AH9ABA@89>)A7K=?O<7<9>:)

A9)8;;)K<=B?=789@<)7<8:6=<:D)JA>C)89)A9@=<8:<)A9)=<J8=G)

8@_6A:A>A?9)89G)8)G<@=<8:<)A9)>C<)967E<=)?B)7?O<:)>?)

=<8@C)>C<)>8=H<>)H?8;D)8:)J<;;)8:)A9)>C<)>A7<)>?):>8=>)89G)

BA9A:C)89)8@>A?9):<_6<9@<M)/C<):>8E;<)K<=B?=789@<)J8:)

8;:?)?E:<=O<G)A9)>C<):C?=>)K=8@>A@<)E<B?=<)>C<)B2"Y ><:>D)

89G)8B><=)?9<)9AHC>):;<<KM

).!/+2+.!'MN):<=A<:)?B)34J8F)Nb#cN):C?J<G)8)

:AH9ABA@89>)<BB<@>)?B)*>8=>)/A7<D)/8:I)

.?9GA>A?9:D)89G)/<:>)+;?I: QKf'M''0R

N)76;>A@?7K8=A:?9 898;F:A:):C?J<G)

>C8>).?9GA>A?9)0)C8G)>C<);?J<:>)

K<=B?=789@<D).?9GA>A?9)()J8:)

A9><=7<GA8><D)89G).?9GA>A?9)3)J8:)

>C<)E<:>M

N;:?D).?9GA>A?9)():C?J<G):AH9ABA@89>)

A9)=<J8=G)8@_6A:A>A?9)8B><=)>C<)G<;8F)

K<=A?GD)JCA@C)7AHC>)A9GA@8><)8)

E<9<BA@A8;)<BB<@>)?B)>C<)G<;8F)K<=A?G)

?9)>C<)A7K;<7<9>8>A?9)?B)8)B?=J8=G)

7?G<;)7<@C89A:7)B?=)8@>A?9)

:<;<@>A?9M

%<=B?=789@<)A9)).?9GA>A?9)0)

=<78A9<G)@?9:>89>)89G)CAHC)

A==<:K<@>AO<)?B)>C<):>8=>)>A7<)

@?9GA>A?9

).!/+2+.!'> ).!/+2+.!'@ ).!/+2+.!'M

).!)&#,+.!'4 e<)C8O<)G<O<;?K<G)8)ZH=AG4:8A;A9H)>8:I[)>?)><:>)C?J)C6789:)6>A;AX<)GABB<=<9>):>=8><HA<:)B?=)>C<);<8=9A9H)?B)8@>A?9)

:<;<@>A?9M))UA:>A9@>)K<=B?=789@<)K=?BA;<:)B?=)<8@C)>8:I)@?9GA>A?9)J<=<)B?69GD)89G)<8@C)8;:?)=<_6A=<G):K<@ABA@)9<6=8;)9<>J?=I: B?=)

:?;OA9H)>C<)>8:IM)#6=)9<L>):><K)A:)>?)K<=B?=7)8)=<A9B?=@<7<9>);<8=9A9H)7?G<;4E8:<G)898;F:A:)>?):<<I)JC<>C<=):6E\<@>:g)E<C8OA?=)

69G<=)<8@C)>8:I)@?9GA>A?9)BA>:)J<;;)>?):K<@ABA@)8@>A?9):<;<@>A?9):>=8><HA<:)Q2TD)2+R)89G)>C<)E=8A9)8=<8:)A7K;<7<9>A9H)>C<7M

$",#&2,'0 2",2

/8:I)@?9GA>A?9:)GABB<=<9>A8;;F)

7?G6;8><G)>C<)+#PU):AH98;)G6=A9H)

>C<)G<;8F)K<=A?GM).?9GA>A?9)()

:C?J<G):6:>8A9<G)89G):>=?9H<=)

8@>AO8>A?9)A9)UP%T.D)c%2.D)P%2.D)

89><=A?=)+!)89G)K?:><=A?=).+M))

.?9GA>A?9)3)8@>AO8><G)7?=<)7?>?=)

=<;8><G):>=6@>6=<:D)89G).?9GA>A?9)0)

:C?J<G)>C<);?J<:>D)E6>)9<8=)>?)

E8:<;A9<)8@>AO8>A?9)K8>><=9:M

Effect of Pre-start Delay Time� New Learned key-map Learned �

Block 1 Block 20

5

10

15

20

25

30

35

Test Block

Mea

n Re

ward

Gain **

Condition 1Condition 2Condition 3

1 2 3 4 5 6 7 8 9 100102030405060708090

100

Trial

% O

ptim

al Go

al Re

ach

Error Suboptimal Optimal0102030405060708090

100

Trial Type%

Rew

ard

Scor

e

**

**

Condition 1 � Condition 3A Condition 2 � Condition 3B

C D

!"#$%&'(

Delay Period Activity �!  Condition 2 !  DLPFC

!  PMC

!  parietal

!  anterior striatum

!   lateral cerebellum�

�

POMDP by Cortex-Basal Ganglia Rao (2011)�

!  Belief state update by the cortex �

216

Frontiers in Computational Neuroscience www.frontiersin.org November 2010 | Volume 4 | Article 146 | 3

Rao Decision making under uncertainty

where v denotes the vector of output firing rates, o denotes the input observation vector, f

1 is a potentially non-linear function

describing the feedforward transformation of the input, M is the matrix of recurrent synaptic weights, and g

1 is a dendritic filtering

function.The above differential equation can be rewritten in discrete

form as:

v i f g M i j v jt t tj

( ) ( ) ( , ) ( )o 1 (4)

where vt(i) is the ith component of the vector v, f and g are func-

tions derived from f1 and g

1, and M(i,j) is the synaptic weight value

in the ith row and jth column of M.To make the connection between Eq. (4) and Bayesian infer-

ence, note that the belief update Eq. (2) requires a product of two sources of information (current observation and feedback) whereas Eq. (4) involves a sum of observation- and feedback-related terms. This apparent divide can be bridged by performing belief updates in the log domain:

log ( ) log | log , , ( ) logb i P o s i T j a i b j kt t t tj

t1 1

This suggests that Eq. (4) could neurally implement Bayesian inference over time as follows: the log likelihood log P(o

t|s

t = i)

is computed by the feedforward term f(ot) while the feedback

g M i j v jj

t( ( , ) ( ))1 conveys the log of the predicted distribution, i.e., log

jT(j, a

t 1,i)b

t 1(j), for the current time step. The latter is

computed from the activities vt 1

(j) from the previous time step and the recurrent weights M(i,j), which is defined for each action a. The divisive normalization in Eq. (2) reduces in the equation above to the log k term, which is subtractive and could therefore be implemented via inhibition.

A neural model as sketched above for approximate Bayesian infer-ence but using a linear recurrent network was first explored in (Rao, 2004). Here we have followed the slightly different implementation in (Rao, 2005) that uses the non-linear network given by Eq. (3). As shown in (Rao, 2005), if one interprets Eq. (3) as the membrane potential dynamics in a stochastic integrate-and-fire neuron model,

Before proceeding to the model, we note that the space of beliefs is continuous (each component of the belief state vector is a prob-ability between 0 and 1) and typically high-dimensional (number of dimensions is one less than the number of states). This makes the problem of finding optimal policies very difficult. In fact, finding exact solutions to general POMDP problems has been proved to be a computationally hard problem (e.g., the finite-horizon case is “PSPACE-hard”; Papadimitriou and Tsitsiklis, 1987). However, one can typically find approximate solutions, many of which work well in practice. Our model is most closely related to a popular class of approximation algorithms known as point-based POMDP solvers (Hauskrecht, 2000; Pineau et al., 2003; Spaan and Vlassis, 2005; Kurniawati et al., 2008). The idea is to discretize the belief space with a finite set of belief points and compute value for these belief points rather than the entire belief space. For learning value, our model relies on the temporal-difference (TD) framework (Sutton and Barto, 1981; Sutton, 1988; Sutton and Barto, 1998) in rein-forcement learning theory, a framework that has also proved useful in understanding dopaminergic responses in the primate brain (Schultz et al., 1997).

Neural computation of beliefA prerequisite for a neural POMDP model is being able to compute the belief state b

t in neural circuitry. Several models have been pro-

posed for neural implementation of Bayesian inference (see Rao, 2007 for a review). We focus here on one potential implementation. Recall that the belief state is updated at each time step according to the following equation:

b i k P o s i T j a i b jt t t tj

t( ) ( | ) , , ( )1 1 (2)

where bt(i) is the ith component of the belief vector b

t and repre-

sents the posterior probability of state i.Equation (2) combines information from the current observa-

tion (P(ot|s

t)) with feedback from the past time step (b

t 1), suggest-

ing a neural implementation based on a recurrent network, for example, a leaky integrator network:

ddt

f gv

v o v1 1( ( ),) M (3)

Agent

World: T(s’,a,s)Observation o

Reward r

Action a

World

SESESEbt

SE : belief state estimatorbt : belief state

Agent

actionobservation & reward

A B

FIGURE 1 | The POMDP model. (A) When the animal executes an action a in the state s’, the environment (“World”) generates a new state s according to the transition probability T(s’,a,s). The animal receives an observation o of the new state according to P(o|s) and a reward r = R(s’,a). (B) In order to solve the POMDP

problem, the animal maintains a belief bt which is a probability distribution over states of the world. This belief is computed iteratively using Bayesian inference by the belief state estimator SE. An action for the current time step is provided by the learned policy , which maps belief states to actions.



We model the task using a POMDP as follows: there are two underlying hidden states representing the two possible directions of coherent motion (leftward or rightward). In each trial, the experi-menter chooses one of these hidden states (either leftward or right-ward) and provides the animal with observations of this hidden state in the form of an image sequence of random dots at the chosen coherence. Note that the hidden state remains the same until the end of the trial. Using only the sequence of observed images seen so far, the animal must choose one of the following actions: sample one more time step (to reduce uncertainty), make a leftward eye move-ment (indicating choice of leftward motion), or make a rightward eye movement (indicating choice of rightward motion).

We use the notation SL to represent the state corresponding

to leftward motion and SR to represent rightward motion. Thus,

at any given time t, the state st can be either S

L or S

R (although

within a trial, the state once selected remains unchanged). The animal receives noisy measurements or observations o

t of the

hidden state based on P(ot|s

t, c

t), where c

t is the current coher-

ence value. We assume the coherence value is randomly chosen for each trial from a set {C

1, C

2, , C

Q} of possible coherence

values, and remains the same within a trial. At each time step t, the animal must choose from one of three actions {A

S, A

L,

AR} denoting sample, leftward eye movement, and rightward eye

movement respectively.The animal receives a reward for choosing the correct action,

i.e., action AL when the true state is S

L and action A

R when the

true state is SR. We model this reward as a positive number (e.g.,

between 10 and 30; here, 20). An incorrect choice produces a large penalty (e.g., between 100 and 400; here, 400) simulating the time-out used for errors in monkey experiments (Roitman and Shadlen, 2002). We assume the animal is motivated by hunger or thirst to make a decision as quickly as possible. This is modeled using a small negative reward (penalty of 1) for each time step spent sampling. We have experimented with a range of reward/punishment values and found that the results remain qualitatively the same as long as there is a large penalty for incorrect decisions, a moderate positive reward for correct decisions, and a small penalty for each time step spent sampling.

The transition probabilities P(st|s

t 1, a

t 1) for the task are as

follows: the state remains unchanged (self-transitions have prob-ability 1) as long as the sample action A

S is executed. Likewise,

P(ct|c

t 1, A

S) = 1. When the animal chooses A

L or A

R, a new trial

begins, with a new state (SL or S

R) and a new coherence C

k (from

{C1, C

2, , C

Q}) chosen uniformly at random.

In the first set of experiments, we trained the model on 6000 trials of leftward or rightward motion. Inputs o

t were generated

according to P(ot|s

t, c

t) based on the current coherence value and

state (direction). For these simulations, ot was one of two values

OL and O

R corresponding to observing leftward and rightward

motion respectively. The probability P(o = OL|s = S

L, c = C

k)

was fixed to a value between 0.5 and 1 based on the coherence value C

k, with 0.5 corresponding to 0% coherence and 1 corre-

sponding to 100%. The probability P(o = OR|s = S

R, c = C

k) was

defined similarly.The belief state b

t over the unknown direction of motion

was computed using a slight variant of Eq. (2) using the current input o

t, the known coherence value c

t = C

k, the known transition

responses. Interestingly, results from such an experiment have recently been published by Nomoto et al. (2010). We compare their results to the model’s predictions in a section below.

RESULTSTHE RANDOM DOTS TASKWe tested the neural POMDP model derived above in the well-known random dots motion discrimination task used to study decision making in primates (Shadlen and Newsome, 2001). We focus specifically on the reaction-time version of the task (Roitman and Shadlen, 2002) where the animal can choose to make a decision at any time. In this task, the stimulus consists of an image sequence showing a group of moving dots, a fixed fraction of which are randomly selected at each frame and moved in a fixed direction (for example, either left or right). The rest of the dots are moved in random directions. The fraction of dots moving in the same direc-tion is called the motion strength or coherence of the stimulus.

The animal’s task is to decide the direction of motion of the coher-ently moving dots for a given input sequence. The animal learns the task by being rewarded if it makes an eye movement to a target on the left side of its fixation point if the motion is to the left, and to a target on the right if the motion is to the right. A wealth of data exists on the psychophysical performance of humans and monkeys on this task, as well as the neural responses observed in brain areas such as MT and LIP in monkeys performing this task (see Roitman and Shadlen, 2002; Shadlen and Newsome, 2001 and references therein).

EXAMPLE I: RANDOM DOTS WITH KNOWN COHERENCEIn the first set of experiments, we illustrate the model using a sim-plified version of the random dots task where the coherence value chosen at the beginning of the trial is known. This reduces the problem to that of deciding from noisy observations the underlying direction of coherent motion, given a fixed known coherence. We tackle the case of unknown coherence in a later section.

Cortex

Striatum

GPi/SNr

Thalamus

STN

GPe SNc/VTA

Belief

Basis belief points *ib

TD error

Action

Value

DA

FIGURE 3 | Suggested mapping of elements of the model to components of the cortex-basal ganglia network. STN, subthalamic nucleus; GPe, globus pallidus, external segment; GPi, Globus pallidus, internal segment; SNc, substantia nigra pars compacta; SNr, substantia nigra pars reticulata; VTA, ventral tegmental area; DA, dopaminergic responses. The model suggests that cortical circuits compute belief states, which are provided as input to the striatum (and STN). Cortico-striatal synapses are assumed to maintain a compact learned representation of cortical belief space (“basis belief points”). The dopaminergic responses of SNc/VTA neurons are assumed to represent reward prediction (TD) errors based on striatal activations. The GPe-GPi/SNr network, in conjunction with the thalamus, is assumed to implement action selection based on the compact belief representation in the striatum/STN.



g t eit ti i( ) ( ) /* 2 2

where i2is a variance parameter. In the simulations, i

2 was set to progressively larger values for larger ti loosely inspired by the fact that an animal’s uncertainty about time increases with elapsed time (Leon and Shadlen, 2003). Specifically, both ti and i

2 were set arbitrarily to 1.25i for i = 1, , 65. The number of hidden units for direction, coherence, and time was 65. The deadline T was set to 20 time steps, with a large penalty ( 2000) if a left/right decision was not reached before the deadline. Other parameters included: 20 reward for correct decisions, 400 for errors, 1 for each sampling action,

1 = 2.5 10 5,

2 = 4 10 8,

3 = 1 10 5, = 1, = 1.5,

2 = 0.08. The model was trained on 6000 trials, with motion direc-tion (Left/Right) and coherence (Easy/Hard) selected uniformly at random for each trial.

Figures 18B,C show the learned value function for the beginning (t = 1) and near the end of a trial (t = 19) before the deadline at t = 20. The shape of the value function remains approximately the same, but the overall value drops noticeably over time. Figure 18D illustrates this progressive drop in value over time for a slice through the value function at Belief(E) = 1.

Finally, in the case of an error trial, the model predicts that the absence of reward (or presence of a negative reward/penalty as in the simulations) should cause a negative reward prediction error and this error should be slightly larger for the higher coherence case due to its higher expected value (see Figure 13). This prediction is shown in Figure 17C, which compares average reward prediction (TD) error at the end of error trials for the 60% coherence case (left) and the 8% coherence case (right). The population dopamine responses for error trials are depicted by red traces in Figure 17B.

EXAMPLE III: DECISION MAKING UNDER A DEADLINEOur final set of results illustrates how the model can be extended to learn time-varying policies for tasks with a deadline. Suppose a task has to be solved by time T (otherwise, a large penalty is incurred). We will examine this situation in the context of a random dots task where the animal has to make a decision by time T in each trial (in contrast, both the experimental data and simulations discussed above involved the dots task with no deadline).

Figure 18A shows the network used for learning the value func-tion. Note the additional input node representing elapsed time t, measured from the start of the trial. The network includes basis neurons for elapsed time, each neuron preferring a particular time t i

*. The activation function is the same as before:

A

B

Time steps Time (ms)A

vg. T

D e

rror

Avg

. firi

ng ra

te (s

p/s)

Avg

. TD

err

or

Avg

. firi

ng ra

te (s

p/s)

Time steps Time (ms)

60%

8%

50%

5%

Model Monkey

FIGURE 16 | Reward prediction error and dopamine responses in the random dots task. (A) The plot on the left shows the temporal evolution of reward prediction (TD) error in the model, averaged over trials with “Easy” motion coherence (coherence = 60%). The dotted line shows the average reaction time. The plot on the right shows the average firing rate of dopamine neurons in SNc in a monkey performing the random dots task at 50% motion coherence. The dashed line shows average time of saccade onset. The gray trace and line show the case where the amount of reward was reduced (not modeled here; see Nomoto et al., 2010). (B) (Left) Reward prediction (TD) error in the model averaged over trials with “Hard” motion coherence (coherence = 8%). (Right) Average firing rate of the same dopamine neurons as in (A) but for trials with 5% motion coherence. (Dopamine plots adapted from Nomoto et al., 2010).

A

B

C

Avg

. TD

err

or

Time steps from reward

Avg

. TD

err

or

Time steps from reward

Avg

. TD

err

or

Time steps from error

60% 8%

60% 8%

Avg

. firi

ng ra

te (s

p/s)

Time steps from reward tone

Avg

. TD

err

or

Time steps from error

15% 5%50%

FIGURE 17 | Reward prediction error at the end of a trial. (A) Average reward prediction (TD) error at the end of correct trials for the 60% coherence case (left) and the 8% coherence case (right). The vertical line denotes the time of reward delivery. Right after reward delivery, TD error is larger for the 8% coherence case due to its smaller expected value (see Figure 13). (B) Population dopamine responses of SNc neurons in the same monkey as in Figure 16 but after the monkey has made a decision and a feedback tone indicating reward or no reward is presented. (Plots adapted from Nomoto et al., 2010). The black and red traces show dopamine response for correct and error trials respectively. The gray traces show the case where the amount of reward was reduced (see (Nomoto et al., 2010) for details). (C) Average reward prediction (TD) error at the end of error trials for the 60% coherence case (left) and the 8% coherence case (right). The absence of reward (or negative reward/penalty in the current model) causes a negative reward prediction error, with a slightly larger error for the higher coherence case due to its higher expected value (see Figure 13).

Robots

Virtual agents 15-25

Population

w1, w2, …, wn

Genes

Embodied Evolution (Elfwing et al., 2011)

Weights for top layer NN

Weights shaping rewards

Meta-parameters

v1, v2, …, vn

αγλτkτ 0

13

(a)

(b)

Figure 1. Two physical robots with six energy sources and the neural network controller.(a) The Cyber Rodent robots used in the experiments were equipped infrared communication for theexchange of genotypes and cameras for visual detection of energy sources (blue), tail-lamps of otherrobots (green), and faces of other robots (red). (b) The control architecture consisted of a linearartificial neural network. The output of the network was the weighted sum (

Pi

w

i

x

i

) of the fivenetwork inputs (x

i

) and the five evolutionarily tuned neural network weights (wi

). In each time step, ifthe output was less or equal to zero then the foraging module was selected, otherwise the matingmodule was selected. The basic behaviors were learned from by reinforcement learning with the aid ofevolutionarily tuned additional reward signals and meta-parameters. The foraging module learned aforaging behavior for capturing energy sources. The mating module learned both a mating behavior forthe exchange of genotypes, when a face of another robot was visible, and a waiting behavior, when noface was visible.

Temporal Discount Factor γ*!   Large γ !  reach for far reward*

!   Small γ !  only to near reward*

Temporal Discount Factor γ*

!  V(t) = E[ r(t) + γr(t+1) + γ2r(t+2) + γ3r(t+3) +…] !  controls the ‘character’ of an agent

1 2 3 4 step

-20

+100

-20 -20

1

0

1 2 3 4 step

-20

+100

-20 -20

1

0

1 2 3 4 step

+50 -100

1

0

1 2 3 4 step

+50 -100

1

0

γ large� γ small

can’t resist temptation

no pain, no gain!

stay away from danger

better stay idle

V =18.7

V = -22.9

V =-25.1

V = 47.3

Depression?!

Impulsivity?!

Neuromodulators for Metalearning (Doya, 2002) �

!  Metaparameter tuning is critical in RL !   How does the brain tune them?

Dopamine: TD error δ!Acetylcholine: learning rate α!Noradrenaline: exploration β�Serotonin: temporal discount γ!

Markov Decision Task (Tanaka et al., 2004)

!  Stimulus and response

!  State transition and reward functions

+100yen+100yen

2s1s 1s

1s0.5s 0.5s

Time

action a1 action a2

�-20 -20

+20 +20

-20

+20

s1 s2 s3 �-20 -20

+20 +20

+100

-100

s1 s2 s3

Block-Design Analysis

!  Different areas for immediate/future reward prediction

SHORT vs. NO Reward (p < 0.001 uncorrected)

LONG vs. SHORT (p < 0.0001 uncorrected)

OFC Insula Striatum Cerebellum

Cerebellum Striatum Dorsal raphe DLPFC, VLPFC, IPC, PMd

Model-based Explanatory Variables !  Reward prediction

V(t)

!  Reward prediction

error δ(t)

trial

γ = 0

γ = 0.3

γ = 0.6

γ = 0.8

γ = 0.9

γ = 0.99

γ = 0

γ = 0.3

γ = 0.6

γ = 0.8

γ = 0.9

γ = 0.99 1 312

-50

10

V

-50

10

delta

-10

0

10

V

-10

0

10

delta

-10010

V

-20

010

delta

-30

010

V

-20020

delta

-60

0

40

V

-50

0

50

delta

-600

0

400

-500

0

500

Regression Analysis !  Reward prediction

V(t)

!  Reward prediction error δ(t)

mPFC Insula

x = -2 mm x = -42 mm

Striatum

z = 2

Tryptophan Depletion/Loading (Tanaka et al., 2007) �

!  Tryptophan: precursor of serotonin !  depletion/loading affect

central serotonin levels (e.g. Bjork et al. 2001, Luciana et al.�2001)�

!  100 g of amino acid mixture !  experiments after 6 hours

Day2: Tr0 Day3: Tr+ Day1: Tr-

10.3g of tryptophan (Loading)

No��Depletion)

2.3g of tryptophan (Control)

Behavioral Result (Schweighfer et al., 2008) �!  Extended sessions outside scanner �

Control!Loading!Depletion!

Modulation by Tryptophan Levels

Changes in Correlation Coefficient

!  ROI (region of interest) analysis

γ = 0.6 (28, 0, -4)

γ = 0.99 (16, 2, 28)

Tr- < Tr+ correlation with V at large γ in dorsal Putamen

Tr- > Tr+ correlation with V at small γ in ventral Putamen

Reg

ress

ion

slop

e R

egre

ssio

n sl

ope

Microdialysis Experiment (Miyazaki et al 2011 EJNS)

2 mm

Dopamine and Serotonin Responses Serotonin

Dopamine

Average in 30 minutes !   Serotonin (n=10) !   Dopamine (n=8)

Serotonin v.s. Dopamine

!  No significant positive or negative correlation

Delayed

Intermittent

Immediate

Effect of Serotonin Suppression (Miyazaki et al. JNS 2012) �

!  5-HT1A agonist in DRN !  Wait errors in long-delayed reward condition

choice error wait error �

Dorsal Raphe Neuron Recording (Miyazaki et al. 2011 JNS) �

!  Putative 5-HT neurons !  wider spikes

!   low firing rate

!  suppression by 5-HT1a agonist

Delayed Tone-Food-Tone-Water Task Food Water

Tone !  2 ~ 20 sec delays

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

�

��

��

��

Food site! Water site!

��

Tone before food!

��

Tone before water!

��

Error Trials in Extended Delay �!  Sustained firing

!  Drop before wait error �

a

b

c d

e f

Time from end of reward wait (s)

Time from reward site entry (s)

First 2s Last 2s First 2s Last 2sPeriod (2s) of reward wait

SuccessError

n = 23

SuccessErrorType of reward site entry

n = 43

n = 26 n = 23n.s.n.s. *****

*** ***

0

5

sp

ike

ss-1

0

5

sp

ike

ss-1

0

5

sp

ike

ss-1

0

5

sp

ike

ss-1

-2 00

5

sp

ike

ss-1

-2 00

5

0 20

5

sp

ike

ss-1

0 20

5

sp

ike

ss-1

0 50

10

Tone site

Time (s)

Tri

al

0 5

Food site

Time (s)0 5

0

10

Tone site

Time (s)

Tri

al

0 5

Water site

Time (s)

0 50

10

20

Tone site

Time (s)

Tri

al

0 5 10 15

Food site

Time (s)0 5

0

10

20

Tone site

Time (s)

Tri

al

0 5 10 15

Water site

Time (s)

��

��

��

��

��

��

Serotonin Experiments: Summary ! Microdialysis !  higher release for delayed reward !  no apparent opponency with dopamine !   lower release cause waiting error

Consistent with discounting hypothesis !  Serotonin neuron recording !  higher firing during waiting !   firing stops before giving up

More dynamic variable than a ‘parameter’ !  Question: regulation of serotonin neurons !  algorithm for regulation of patience

Collaborators !   ATR → Tamagawa U

!  Kazuyuki Samejima ! NAIST→CalTech→ATR→Osaka U

! Saori Tanaka !   CREST → USC

!  Nicolas Schweighofer !   OIST

!  Makoto Ito !  Kayoko Miyazaki !  Katsuhiko Miyazaki !  Takashi Nakano !  Jun Yoshimoto ! Eiji Uchibe !  Stefan Elfwing

!   Kyoto PUM !  Minoru Kimura ! Yasumasa Ueda !   Hiroshima U ! Shigeto Yamawaki ! Yasumasa Okamoto !  Go Okada ! Kazutaka Ueda ! Shuji Asahi !  Kazuhiro Shishida

!  U Otago → OIST !  Jeff Wickens

brain and reinforcement learning - jason...

Documents