reinforcement learning 2: action selection peter dayan (thanks to nathaniel daw)
TRANSCRIPT
Reinforcement learning 2: action selection
Peter Dayan
(thanks to Nathaniel Daw)
Global plan
• Reinforcement learning I: (Wednesday)– prediction
– classical conditioning
– dopamine
• Reinforcement learning II:– dynamic programming; action selection
– sequential sensory decisions
– vigor
– Pavlovian misbehaviour
• Chapter 9 of Theoretical Neuroscience
Learning and Inference
• Learning: predict; control
∆ weight (learning rate) x (error) x (stimulus)
– dopamine
phasic prediction error for future reward– serotonin
phasic prediction error for future punishment
– acetylcholine
expected uncertainty boosts learning– norepinephrine
unexpected uncertainty boosts learning
Action Selection
• Evolutionary specification• Immediate reinforcement:
– leg flexion– Thorndike puzzle box– pigeon; rat; human matching
• Delayed reinforcement:– these tasks– mazes– chess
Bandler;Blanchard
Immediate Reinforcement
• stochastic policy:
• based on action values:
5
RL mm ;
6
Indirect Actoruse RW rule:
25.0;05.0 RL p
R
p
L rr switch every 100 trials
Direct ActorRL rRPrLPE ][][)( m
][][][
][][][
RPLPm
RPRPLP
m
LPRL
RLL
rrRPLPm
E
][][)( m
LLL
rLPRPrLPLPm
E][][][1][
)( m
chosen is R if
chosen is L if
][
][1)(R
L
L rLP
rLP
m
E
m
][2))(1( LPrmmmm aLRLRL
8
Direct Actor
Could we Tell?
• correlate past rewards, actions with present choice
• indirect actor (separate clocks):
• direct actor (single clock):
Matching: Concurrent VI-VI
Lau, Glimcher, Corrado, Sugrue, Newsome
Matching
• income not return
• approximately exponential in r
• alternation choice kernel
12
Action at a (Temporal) Distance
• learning an appropriate action at u=1:– depends on the actions at u=2 and u=3 – gains no immediate feedback
• idea: use prediction as surrogate feedback
13
Action Selection
)()1()()( tvtvtrt
start with policy: ))()((];[ umumuLP RL
evaluate it: )3(),2(),1( vvv
improve it:
thus choose L more frequently than R Lm
0.025-0.175
-0.125
0.125
14
Policyif 0
• value is too pessimistic• action is better than average
vP
actor/critic
m1
m2
m3
mn
dopamine signals to both motivational & motor striatum appear, surprisingly the same
suggestion: training both values & policies
Variants: SARSA CauuvrECQ tttt ,1|)(),1( 1
**
),1(),2(),1(),1( CQaQrCQCQ actualt
Morris et al, 2006
Variants: Q learning CauuvrECQ tttt ,1|)(),1( 1
**
),1(),2(max),1(),1( CQaQrCQCQ at
Roesch et al, 2007
Summary
• prediction learning– Bellman evaluation
• actor-critic– asynchronous policy iteration
• indirect method (Q learning)– asynchronous value iteration
1|)()1( 1** ttt uuvrEv
CauuvrECQ tttt ,1|)(),1( 1**
Sensory Decisions as Optimal Stopping
• consider listening to:
• decision: choose, or sample
Optimal Stopping
• equivalent of state u=1 is
• and states u=2,3 is
11 nu
212 2
1nnu
1.05.2
23
23
C
R
L
r
n
n
Transition Probabilities
Evidence Accumulation
Gold & Shadlen, 2007
Current Topics
• Vigour & tonic dopamine
• Priors over decision problems (LH)
• Pavlovian-instrumental interactions
– impulsivity
– behavioural inhibition
– framing
• Model-based, model-free and episodic control
• Exploration vs exploitation
• Game theoretic interactions (inequity aversion)
24
Vigour
• Two components to choice:– what:
• lever pressing• direction to run• meal to choose
– when/how fast/how vigorous• free operant tasks
• real-valued DP
25
The model
choose(action,) = (LP,1)
1 time
CostsRewards
choose(action,)= (LP,2)
CostsRewards
cost
LP
NP
Other
?
how fast
2 timeS1 S2
V
Cvigour cost
UC
unit cost(reward)
S0
UR
PR
goal
26
The model
choose(action,) = (LP,1)
1 time
CostsRewards
choose(action,)= (LP,2)
CostsRewards
2 timeS1 S2S0
Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)
ARL
27
Compute differential values of actions
Differential value of taking action L
with latency when in state x
ρ = average rewards
minus costs, per unit time
• steady state behavior (not learning dynamics)
(Extension of Schwartz 1993)
QL,(x) = Rewards – Costs + Future Returns
v
u
CC
Average Reward RL
)'(xV
28
Choose action with largest expected reward minus cost
1. Which action to take?
• slow delays (all) rewards
• net rate of rewards = cost of delay (opportunity cost of time)
Choose rate that balances vigour and opportunity costs
2.How fast to perform it?
• slow less costly (vigour cost)
Average Reward Cost/benefit Tradeoffs
explains faster (irrelevant) actions under hunger, etc
masochism
29
0 0.5 1 1.50
0.2
0.4pr
obab
ility
0 20 400
10
20
30
rate
per
min
ute
1st NP
LP
Optimal response rates
Exp
eri
me
nta
l da
ta
Niv, Dayan, Joel, unpublished1st Nose poke
seconds since reinforcement
Mo
de
l sim
ula
tion
1st Nose poke
seconds since reinforcement0 0.5 1 1.5
0
0.2
0.4
prob
abili
ty
0 20 400
10
20
30
rate
per
min
ute
seconds
seconds
31
Effects of motivation (in the model)
energizing effect
RR25
low utilityhigh utility
mea
n la
ten
cy
LP Other
energizing effect
RSVC
CUpaSQ vurr
)'(),,(
02
R
CaSQ v
),,(
opt
vopt
R
C
32
Effects of motivation (in the model)
UR 50%
resp
on
se r
ate
/ min
ute
seconds from reinforcement
RR25
resp
on
se r
ate
/ min
ute
seconds from reinforcement
directing effect1
low utilityhigh utility
mea
n la
ten
cy
LP Other
energizing effect
2
33
What about tonic dopamine?
Phasic dopamine firing = reward prediction error
moreless
Relation to Dopamine
34
Tonic dopamine = Average reward rate
NB. phasic signal RPE for choice/value learning
Aberman and Salamone 1999
# L
Ps
in 3
0 m
inu
tes
1 4 16 64
500
1000
1500
2000
2500Control
DA depleted
1 4 8 160
200
400
600
800
1000
1200
Model simulation#
LP
s in
30
min
ute
s Control
DA depleted
1. explains pharmacological manipulations
2. dopamine control of vigour through BG pathways
• eating time confound• context/state dependence (motivation & drugs?)• less switching=perseveration
35
Tonic dopamine hypothesis
Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992
reaction time
firin
g ra
te
…also explains effects of phasic dopamine on response times
$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫
Pavlovian & Instrumental Conditioning
• Pavlovian– learning values and predictions– using TD error
• Instrumental– learning actions:
• by reinforcement (leg flexion)• by (TD) critic
– (actually different forms: goal directed & habitual)
Pavlovian-Instrumental Interactions
• synergistic– conditioned reinforcement– Pavlovian-instrumental transfer
• Pavlovian cue predicts the instrumental outcome• behavioural inhibition to avoid aversive outcomes
• neutral– Pavlovian-instrumental transfer
• Pavlovian cue predicts outcome with same motivational valence
• opponent– Pavlovian-instrumental transfer
• Pavlovian cue predicts opposite motivational valence
– negative automaintenance
-ve Automaintenance in Autoshaping
• simple choice task– N: nogo gives reward r=1– G: go gives reward r=0
• learn three quantities– average value– Q value for N – Q value for G
• instrumental propensity is
-ve Automaintenance in Autoshaping
• Pavlovian action– assert: Pavlovian impetus towards G is v(t)– weight Pavlovian and instrumental advantages by ω –
competitive reliability of Pavlov
• new propensities
• new action choice
-ve Automaintenance in Autoshaping
• basic –ve automaintenance effect (μ=5)
• lines are theoretical asymptotes
• equilibrium probabilities of action
Impulsivity & Hyperbolic Discounting
• humans (and animals) show impulsivity in:– diets– addiction– spending, …
• intertemporal conflict between short and long term choices• often explained via hyperbolic discount functions
• alternative is Pavlovian imperative to an immediate reinforcer
• framing, trolley dilemmas, etc
Kalman Filter
• Markov random walk (or OU process)• no punctate changes• additive model of combination• forward inference
Kalman Posterior
ε
^
Assumed Density KF
• Rescorla-Wagner error correction
• competitive allocation of learning– P&H, M
Blocking
• forward blocking: error correction
• backward blocking: -ve off-diag
Mackintosh vs P&H
• under diagonal approximation:
• for slow learning,
– effect like Mackintosh
E
predictor competition
Summary
stimulus/correlation rerepresentation (Daw)
• Kalman filter models many standard conditioning paradigms
• elements of RW, Mackintosh, P&H• but:
– downwards unblocking
– negative patterning L→r; T→r; L+T→·
– recency vs primacy (Kruschke)
Uncertainty (Yu)
• expected uncertainty - ignorance
– amygdala, cholinergic basal forebrain for conditioning– ?basal forebrain for top-down attentional allocation
• unexpected uncertainty – `set’ change– noradrenergic locus coeruleus
• part opponent; part synergistic interaction
ACh & NE have distinct behavioral effects:
• ACh boosts learning to stimuli with uncertain consequences
• NE boosts learning upon encountering global changes in the environment
(e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998)
ACh & NE have similar physiological effects
• suppress recurrent & feedback processing
• enhance thalamocortical transmission
• boost experience-dependent plasticity(e.g. Gil et al, 1997)
(e.g. Kimura et al, 1995; Kobayashi et al, 2000)
Experimental Data
(e.g. Bucci, Holland, & Gallagher, 1998)
(e.g. Devauges & Sara, 1990)
Model Schematics
z
x
ACh
expecteduncertainty
top-downprocessing
bottom-upprocessing
sensory inputs
cortical processing
context
NE
unexpecteduncertainty
prediction, learning, ...
y
Attentionattentional selection for (statistically) optimal processing,above and beyond the traditional view of resource constraint
sensoryinput
Example 1: Posner’s Task
stimulus location
cue
sensoryinput
cue
highvalidity
lowvalidity
stimulus location
(Phillips, McAlonan, Robb, & Brown, 2000)
cue
target
response
0.2-0.5s
0.1s
0.1s
0.15s
generalize to the case that cue identity changes with no notice
Formal Framework
cues: vestibular, visual, ... 4c3c2c1c
Starget: stimulus location, exit direction...
variability in quality of relevant cuevariability in identity of relevant cue
AChNE
Sensory Informationavoid representing
full uncertainty
t1 t1
it
ttt DP )|(*
1
1)|(*
hDijP ttt
Simulation Results: Posner’s Task
increase ACh
valid
ity e
ffect
% normal level
100 120 140
decrease ACh
% normal level
100 80 60
VE (1- )(NE 1-ACh)
3c2c1c
S
vary cue validity vary ACh
fix relevant cue low NE
nicotine
valid
ity e
ffect
concentration concentration
scopolamine
(Phillips, McAlonan, Robb, & Brown, 2000)
Maze Taskexample 2: attentional shift
reward
cue 1 cue 2
reward
cue 1 cue 2
relevant irrelevant
irrelevant relevant
(Devauges & Sara, 1990)
no issue of validity
Simulation Results: Maze Navigation
3c2c1c
S
fix cue validity no explicit manipulation of ACh
change relevant cue NE
% R
ats
reac
hing
cri
teri
on
No. days after shift from spatial to visual task
% R
ats
reac
hing
cri
teri
on
No. days after shift from spatial to visual task
experimental data model data
(Devauges & Sara, 1990)
Simulation Results: Full Modeltrue & estimated relevant stimuli
neuromodulation in action
trials
validity effect (VE)
Simulated Psychopharmacology
50% NE
50% ACh/NE
AChcompensation
NE cannearly catchup
Inter-trial Uncertainty
• single framework for understanding ACh, NE and some aspects of attention
• ACh/NE as expected/unexpected uncertainty signals
• experimental psychopharmacological data replicated by model simulations
• implications from complex interactions between ACh & NE
• predictions at the cellular, systems, and behavioral levels
• activity vs weight vs neuromodulatory vs population representations of uncertainty
Aston-Jones: Target Detectiondetect and react to a rare target amongst common distractors
• elevated tonic activity for reversal• activated by rare target (and reverses)• not reward/stimulus related? more response related?
• no reason to persist as unexpected uncertainty• argue for intra-trial uncertainty explanation
Vigilance Model
• variable time in start• η controls confusability
• one single run• cumulative is clearer
• exact inference• effect of 80% prior
Phasic NE
• NE reports uncertainty about current state• state in the model, not state of the model• divisively related to prior probability of that state
• NE measured relative to default state sequencestart → distractor
• temporal aspect - start → distractor
• structural aspect target versus distractor
Phasic NE
• onset response from timing uncertainty (SET)
• growth as P(target)/0.2 rises
• act when P(target)=0.95
• stop if P(target)=0.01
• arbitrarily set NE=0 after 5 timesteps(small prob of reflexive action)
Four Types of Trial
19%
1.5%
1%
77%
fall is rather arbitrary
Response Locking
slightly flatters the model – since no furtherresponse variability
Task Difficulty
• set η=0.65 rather than 0.675• information accumulates over a longer period• hits more affected than cr’s• timing not quite right
Interrupts
LC
PFC/ACC
Intra-trial Uncertainty
• phasic NE as unexpected state change within a model
• relative to prior probability; against default
• interrupts ongoing processing
• tie to ADHD?
• close to alerting (AJ) – but not necessarily tied to behavioral output (onset rise)
• close to behavioural switching (PR) – but not DA
• farther from optimal inference (EB)
• phasic ACh: aspects of known variability within a state?
Learning and Inference
• Learning: predict; control
∆ weight (learning rate) x (error) x (stimulus)
– dopamine
phasic prediction error for future reward– serotonin
phasic prediction error for future punishment
– acetylcholine
expected uncertainty boosts learning– norepinephrine
unexpected uncertainty boosts learning
69
Conditioning
• Ethology– optimality– appropriateness
• Psychology– classical/operant
conditioning
• Computation– dynamic progr.– Kalman filtering
• Algorithm– TD/delta rules– simple weights
• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum
prediction: of important eventscontrol: in the light of those predictions
Computational Neuromodulation
• dopamine– phasic: prediction error for reward– tonic: average reward (vigour)
• serotonin– phasic: prediction error for punishment?
• acetylcholine:– expected uncertainty?
• norepinephrine– unexpected uncertainty; neural interrupt?
class of stylized tasks with
states, actions & rewards– at each timestep t the world takes on
state st and delivers reward rt, and the agent chooses an action at
Markov Decision Process
World: You are in state 34.Your immediate reward is 3. You have 3 actions.
Robot: I’ll take action 2.
World: You are in state 77.Your immediate reward is -7. You have 2 actions.
Robot: I’ll take action 1.
World: You’re in state 34 (again).Your immediate reward is 3. You have 3 actions.
Markov Decision Process
Markov Decision Process
Stochastic process defined by:–reward function:
rt ~ P(rt | st)–transition function:
st ~ P(st+1 | st, at)
Markov Decision Process
Stochastic process defined by:–reward function:
rt ~ P(rt | st)–transition function:
st ~ P(st+1 | st, at)Markov property–future conditionally independent of past, given st
The optimal policy
Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies
Theorem: For every MDP there exists (at least) one deterministic optimal policy.
by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?