neural reinforcement learning peter dayan gatsby computational neuroscience unit thanks to yael niv...
TRANSCRIPT
Neural Reinforcement Learning
Peter DayanGatsby Computational Neuroscience Unit
thanks to Yael Niv for some slides
2
Marrian Conditioning
• Ethology– optimality– appropriateness
• Psychology– classical/operant conditioning
• Computation– dynamic progr.– Kalman filtering
• Algorithm– TD/delta rules– simple weights• Neurobiology
neuromodulators; midbrain; sub-cortical;cortical structures
prediction: of important eventscontrol: in the light of those predictions
Plan
• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine
• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour
• Bayesian decision theory and computational psychiatry
= Conditioned Stimulus
= Unconditioned Stimulus
= Unconditioned Response (reflex);Conditioned Response (reflex)
Animals learn predictions
Ivan Pavlov
Animals learn predictions
Ivan Pavlov
Acquisition Extinction
020406080
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Blocks of 10 Trials
% C
Rs
very general across species, stimuli, behaviors
But do they really?
temporal contiguity is not enough - need contingencytemporal contiguity is not enough - need contingency
1. Rescorla’s control
P(food | light) > P(food | no light)
But do they really?
contingency is not enough either… need surprisecontingency is not enough either… need surprise
2. Kamin’s blocking
But do they really?
seems like stimuli compete for learningseems like stimuli compete for learning
3. Reynold’s overshadowing
Theories of prediction learning: Goals
• Explain how the CS acquires “value”• When (under what conditions) does this happen?• Basic phenomena: gradual learning and extinction curves• More elaborate behavioral phenomena• (Neural data)
P.S. Why are we looking at old-fashioned Pavlovian conditioning?
it is the perfect uncontaminated test case for examining prediction learning on its own
error-driven learning: change in value is proportional to the differencebetween actual and predicted outcome
Assumptions:
1. learning is driven by error (formalizes notion of surprise)
2. summations of predictors is linear
A simple model - but very powerful!– explains: gradual acquisition & extinction, blocking, overshadowing,
conditioned inhibition, and more..– predicted overexpectation
note: US as “special stimulus”
Rescorla & Wagner (1972)
jCSUSCS jiVrV
• how does this explain acquisition and extinction?
• what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0
– what would V be on average after learning?
– what would the error term be on average after learning?
Rescorla-Wagner learning
Vt1 Vt rt Vt
how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …?
Rescorla-Wagner learning
Vt (1 )t i rii1
t
tttt VrVV 1
0
0.1
0.2
0.3
0.4
0.5
0.6
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
Vt1 (1 )Vt rt
recent rewards weigh more heavily
why is this sensible?
learning rate = forgetting rate!
the R-W rule estimates expected reward using a weighted average of past rewards
Summary so far
Predictions are useful for behavior
Animals (and people) learn predictions (Pavlovian conditioning = prediction learning)
Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality
Marr:
VrV
EV
VrE
VV
USCS
CS
US
jCS
i
i
j
2
But: second order conditioning
15
20
25
30
35
40
45
50
number of phase 2 pairings
cond
itio
ned
resp
ondi
ng
animals learn that a predictor of a predictor is also a predictor of reward!
not interested solely in predicting immediate reward
animals learn that a predictor of a predictor is also a predictor of reward!
not interested solely in predicting immediate reward
phase 1:
phase 2:
test: ?what do you think will happen?
what would Rescorla-Wagner learning predict here?
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward
• what’s the obvious prediction error?
• what’s the obvious problem with this?
Vt E riit
T
want to predict expected sum of
future reward in a trial/episode
(N.B. here t indexes time within a trial)
t
T
tiit Vr
CSVr RW
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward
Vt E riit
T
want to predict expected sum of
future reward in a trial/episode
tttt
tt
Tttt
Ttttt
VVrE
VrE
rrrErE
rrrrEV
1
1
21
21
...
...
Bellman eqnfor policyevaluation
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward• The algorithm: temporal difference learning
Vt E rt Vt1
Vt (1 )Vt (rt Vt1)
Vt Vt (rt Vt1 Vt )
temporal difference prediction error t
VT 1 VT rT VT compare to:
19
Dopamine
20
Rewards rather than Punishments
no prediction prediction, reward prediction, no reward
(t)=r(t)+V(t+1)- V(t)TD error
V(t)
R
RL
dopamine cells in VTA/SNc Schultz et al
Risk Experiment
< 1 sec
0.5 sec
You won40 cents
5 secISI
19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced
2-5secITI
5 stimuli:
40¢20¢
0/40¢0¢0¢
Neural results: Prediction Errorswhat would a prediction error look like (in BOLD)?
Neural results: Prediction errors in NAC
unbiased anatomical ROI in nucleus accumbens (marked per subject*)
* thanks to Laura deSouza
raw BOLD(avg over all subjects)
can actually decide between different neuroeconomic models of risk
Plan
• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine
• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour
• Bayesian decision theory and computational psychiatry
Action Selection
• Evolutionary specification• Immediate reinforcement:
– leg flexion– Thorndike puzzle box– pigeon; rat; human matching
• Delayed reinforcement:– these tasks– mazes– chess
Immediate Reinforcement
• stochastic policy:
• based on action values:
27
RL mm ;
28
Indirect Actoruse RW rule:
25.0;05.0 RL p
R
p
L rr switch every 100 trials
Direct ActorRL rRPrLPE ][][)( m
][][][
][][][
RPLPm
RPRPLP
m
LPRL
( )[ ] [ ] [ ]L L R
L
EP L r P L r P R r
m
m
( )[ ] ( )L
L
EP L r E
m
mm
( )( ) if L is chosenL
L
Er E
m
mm
(1 )( ) ( ( ))( )L R L R am m m m r E L R m
30
Direct Actor
31
Action at a (Temporal) Distance
• learning an appropriate action at S1:– depends on the actions at S2 and S3
– gains no immediate feedback
• idea: use prediction as surrogate feedback
S1
S3S2
04 2 2
32
Direct Action Propensities
S1 S2 S3
)()1()()( tVtVtt 230)(t 210)(t
1-1
start with policy: ));();(();( RSmLSmLS evaluate it: )(),(),( 321 SVSVSV
improve it: )();( RmLm
thus choose L more frequently than R )(Lm
S2
S1
S3
33
Policy
• spiraling links between striatum and dopamine system: ventral to dorsal
if 0• value is too pessimistic• action is better than average
V
S1 S2 S3
S2
S1
S3
vmPFCOFC/dACCdPFCSMAMx
Variants: SARSA CuxxVrECQ tttt ,1|)(),1( 1
**
),1(),2(),1(),1( CQuQrCQCQ actualt
Morris et al, 2006
Variants: Q learning CuxxVrECQ tttt ,1|)(),1( 1
**
),1(),2(max),1(),1( CQuQrCQCQ ut
Roesch et al, 2007
Summary
• prediction learning– Bellman evaluation
• actor-critic– asynchronous policy iteration
• indirect method (Q learning)– asynchronous value iteration
1|)()1( 1** ttt xxVrEV
CuxxVrECQ tttt ,1|)(),1( 1**
Three Decision Makers
• tree search• position evaluation• situation memory
Tree-Search/Model-Based System• Tolmanian forward model• forwards/backwards tree search• motivationally flexible
• OFC; dlPFC; dorsomedial striatum; BLA?• statistically efficient• computationally catastrophic
Habit/Model-Free System• works by minimizing inconsistency:
• model-free, cached• dopamine; CEN?• dorsolateral striatum
• motivationally insensitive• statistically inefficient• computationally congenial
...)1()())(( trtrtxV ))1(()( txVtr
))(())1(()()( txVtxVtrt
Need a Canary...
• if a c and c £££ , then do more of a or b?– MB: b– MF: a (or even no effect)
a b
c
Behaviour
• action values depend on both systems:
• expect that will vary by subject (but be fixed)
),(),(, axQaxQaxQ MBMFtot
Neural Prediction Errors (12)
• note that MB RL does not use this prediction error – training signal?
R ventral striatum
(anatomical definition)
Pavlovian Control
Keay & Bandler, 2001
44
Vigour
• Two components to choice:– what:
• lever pressing• direction to run• meal to choose
– when/how fast/how vigorous• free operant tasks
• real-valued DP
45
The model
choose(action,) = (LP,1)
1 time
CostsRewards
choose(action,)= (LP,2)
CostsRewards
cost
LP
NP
Other
?
how fast
2 timeS1 S2
V
Cvigour cost
UC
unit cost(reward)
S0
UR
PR
goal
46
The model
choose(action,) = (LP,1)
1 time
CostsRewards
choose(action,)= (LP,2)
CostsRewards
2 timeS1 S2S0
Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)
ARL
47
Compute differential values of actions
Differential value of taking action L
with latency when in state x
ρ = average rewards
minus costs, per unit time
• steady state behavior (not learning dynamics)
(Extension of Schwartz 1993)
QL,(x) = Rewards – Costs + Future Returns
Average Reward RL
)'(xV
v
uCC
48
Þ Choose action with largest expected reward minus cost
1. Which action to take?
• slow delays (all) rewards• net rate of rewards = cost of
delay (opportunity cost of time)
Þ Choose rate that balances vigour and opportunity costs
2.How fast to perform it?• slow less costly (vigour
cost)
Average Reward Cost/benefit Tradeoffs
explains faster (irrelevant) actions under hunger, etc
masochism
49
0 0.5 1 1.50
0.2
0.4pr
obab
ility
0 20 400
10
20
30
rate
per
min
ute
1st NP
LP
Optimal response rates
Exp
eri
me
nta
l da
ta
Niv, Dayan, Joel, unpublished1st Nose poke
seconds since reinforcement
Mo
de
l sim
ula
tion
1st Nose poke
seconds since reinforcement0 0.5 1 1.5
0
0.2
0.4
prob
abili
ty
0 20 400
10
20
30
rate
per
min
ute
seconds
seconds
51
Effects of motivation (in the model)RR25
low utilityhigh utility
mea
n la
ten
cy
LP Other
energizing effect
RxVC
CRpuxQ vur
)'(),,(
0),,(
2
RCuxQ v
opt
vopt
R
C
52
What about tonic dopamine?
Phasic dopamine firing = reward prediction error
moreless
Relation to Dopamine
53
Tonic dopamine hypothesis
Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992
reaction time
firin
g ra
te
…also explains effects of phasic dopamine on response times
$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫
Plan
• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine
• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour
• Bayesian decision theory and computational psychiatry
The Computational Level
• (approximate) Bayesian decision theory– states
• ignorant: have to represent and infer– combine priors & likelihoods
– actions • catastrophic calculations
– short-cuts from habitization; evolutionary programming
– utilities• state-dependent
– some approximations can’t keep up
sensory cortex. PFC
striatum; (pre-)motor cortex, PFC
hypothalamus; VTA/SNc; OFC
choice: action that maximises the expected utility over the states
How Does the Brain Fail to Work?
• wrong problem– priors; likelihoods; utilities
• right problem, wrong inference– e.g., early/over-habitization; – over-reliance on Pavlovian mechanisms
• right problem, right inference, wrong environment– learned helplessness– discounting from inconsistency
not completely independent: e.g., miscalibration
57
Conditioning
• Ethology– optimality– appropriateness
• Psychology– classical/operant conditioning
• Computation– dynamic progr.– Kalman filtering
• Algorithm– TD/delta rules– simple weights
• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum
prediction: of important eventscontrol: in the light of those predictions