reinforcement learning 2: action selection peter dayan (thanks to nathaniel daw)

Reinforcement learning 2: action selection

Peter Dayan

(thanks to Nathaniel Daw)

Global plan

• Reinforcement learning I: (Wednesday)– prediction

– classical conditioning

– dopamine

• Reinforcement learning II:– dynamic programming; action selection

– sequential sensory decisions

– vigor

– Pavlovian misbehaviour

• Chapter 9 of Theoretical Neuroscience

Learning and Inference

• Learning: predict; control

∆ weight (learning rate) x (error) x (stimulus)

– dopamine

phasic prediction error for future reward– serotonin

phasic prediction error for future punishment

– acetylcholine

expected uncertainty boosts learning– norepinephrine

unexpected uncertainty boosts learning

Action Selection

• Evolutionary specification• Immediate reinforcement:

– leg flexion– Thorndike puzzle box– pigeon; rat; human matching

• Delayed reinforcement:– these tasks– mazes– chess

Bandler;Blanchard

Immediate Reinforcement

• stochastic policy:

• based on action values:

5

RL mm ;

6

Indirect Actoruse RW rule:

25.0;05.0 RL p

R

p

L rr switch every 100 trials

Direct ActorRL rRPrLPE ][][)( m

][][][

][][][

RPLPm

RPRPLP

m

LPRL

RLL

rrRPLPm

E

][][)( m

LLL

rLPRPrLPLPm

E][][][1][

)( m

chosen is R if

chosen is L if

][

][1)(R

L

L rLP

rLP

m

E

m

][2))(1( LPrmmmm aLRLRL

8

Direct Actor

Could we Tell?

• correlate past rewards, actions with present choice

• indirect actor (separate clocks):

• direct actor (single clock):

Matching: Concurrent VI-VI

Lau, Glimcher, Corrado, Sugrue, Newsome

Matching

• income not return

• approximately exponential in r

• alternation choice kernel

12

Action at a (Temporal) Distance

• learning an appropriate action at u=1:– depends on the actions at u=2 and u=3 – gains no immediate feedback

• idea: use prediction as surrogate feedback

13

Action Selection

)()1()()( tvtvtrt

start with policy: ))()((];[ umumuLP RL

evaluate it: )3(),2(),1( vvv

improve it:

thus choose L more frequently than R Lm

0.025-0.175

-0.125

0.125

14

Policyif 0

• value is too pessimistic• action is better than average

vP

actor/critic

m1

m2

m3

mn

dopamine signals to both motivational & motor striatum appear, surprisingly the same

suggestion: training both values & policies

Variants: SARSA CauuvrECQ tttt ,1|)(),1( 1

**

),1(),2(),1(),1( CQaQrCQCQ actualt

Morris et al, 2006

Variants: Q learning CauuvrECQ tttt ,1|)(),1( 1

**

),1(),2(max),1(),1( CQaQrCQCQ at

Roesch et al, 2007

Summary

• prediction learning– Bellman evaluation

• actor-critic– asynchronous policy iteration

• indirect method (Q learning)– asynchronous value iteration

1|)()1( 1** ttt uuvrEv

CauuvrECQ tttt ,1|)(),1( 1**

Sensory Decisions as Optimal Stopping

• consider listening to:

• decision: choose, or sample

Optimal Stopping

• equivalent of state u=1 is

• and states u=2,3 is

11 nu

212 2

1nnu

1.05.2

23

23

C

R

L

r

n

n

Transition Probabilities

Evidence Accumulation

Gold & Shadlen, 2007

Current Topics

• Vigour & tonic dopamine

• Priors over decision problems (LH)

• Pavlovian-instrumental interactions

– impulsivity

– behavioural inhibition

– framing

• Model-based, model-free and episodic control

• Exploration vs exploitation

• Game theoretic interactions (inequity aversion)

24

Vigour

• Two components to choice:– what:

• lever pressing• direction to run• meal to choose

– when/how fast/how vigorous• free operant tasks

• real-valued DP

25

The model

choose(action,) = (LP,1)

1 time

CostsRewards

choose(action,)= (LP,2)

CostsRewards

cost

LP

NP

Other

?

how fast

2 timeS1 S2

V

Cvigour cost

UC

unit cost(reward)

S0

UR

PR

goal

26

The model

choose(action,) = (LP,1)

1 time

CostsRewards

choose(action,)= (LP,2)

CostsRewards

2 timeS1 S2S0

Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)

ARL

27

Compute differential values of actions

Differential value of taking action L

with latency when in state x

ρ = average rewards

minus costs, per unit time

• steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)

QL,(x) = Rewards – Costs + Future Returns

v

u

CC

Average Reward RL

)'(xV

28

Choose action with largest expected reward minus cost

1. Which action to take?

• slow delays (all) rewards

• net rate of rewards = cost of delay (opportunity cost of time)

Choose rate that balances vigour and opportunity costs

2.How fast to perform it?

• slow less costly (vigour cost)

Average Reward Cost/benefit Tradeoffs

explains faster (irrelevant) actions under hunger, etc

masochism

29

0 0.5 1 1.50

0.2

0.4pr

obab

ility

0 20 400

10

20

30

rate

per

min

ute

1st NP

LP

Optimal response rates

Exp

eri

me

nta

l da

ta

Niv, Dayan, Joel, unpublished1st Nose poke

seconds since reinforcement

Mo

de

l sim

ula

tion

1st Nose poke

seconds since reinforcement0 0.5 1 1.5

0

0.2

0.4

prob

abili

ty

0 20 400

10

20

30

rate

per

min

ute

seconds

seconds

31

Effects of motivation (in the model)

energizing effect

RR25

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

RSVC

CUpaSQ vurr

)'(),,(

02

R

CaSQ v

),,(

opt

vopt

R

C

32

Effects of motivation (in the model)

UR 50%

resp

on

se r

ate

/ min

ute

seconds from reinforcement

RR25

resp

on

se r

ate

/ min

ute

seconds from reinforcement

directing effect1

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

2

33

What about tonic dopamine?

Phasic dopamine firing = reward prediction error

moreless

Relation to Dopamine

34

Tonic dopamine = Average reward rate

NB. phasic signal RPE for choice/value learning

Aberman and Salamone 1999

# L

Ps

in 3

0 m

inu

tes

1 4 16 64

500

1000

1500

2000

2500Control

DA depleted

1 4 8 160

200

400

600

800

1000

1200

Model simulation#

LP

s in

30

min

ute

s Control

DA depleted

1. explains pharmacological manipulations

2. dopamine control of vigour through BG pathways

• eating time confound• context/state dependence (motivation & drugs?)• less switching=perseveration

35

Tonic dopamine hypothesis

Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992

reaction time

firin

g ra

te

…also explains effects of phasic dopamine on response times

$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫

Pavlovian & Instrumental Conditioning

• Pavlovian– learning values and predictions– using TD error

• Instrumental– learning actions:

• by reinforcement (leg flexion)• by (TD) critic

– (actually different forms: goal directed & habitual)

Pavlovian-Instrumental Interactions

• synergistic– conditioned reinforcement– Pavlovian-instrumental transfer

• Pavlovian cue predicts the instrumental outcome• behavioural inhibition to avoid aversive outcomes

• neutral– Pavlovian-instrumental transfer

• Pavlovian cue predicts outcome with same motivational valence

• opponent– Pavlovian-instrumental transfer

• Pavlovian cue predicts opposite motivational valence

– negative automaintenance

-ve Automaintenance in Autoshaping

• simple choice task– N: nogo gives reward r=1– G: go gives reward r=0

• learn three quantities– average value– Q value for N – Q value for G

• instrumental propensity is


• Pavlovian action– assert: Pavlovian impetus towards G is v(t)– weight Pavlovian and instrumental advantages by ω –

competitive reliability of Pavlov

• new propensities

• new action choice


• basic –ve automaintenance effect (μ=5)

• lines are theoretical asymptotes

• equilibrium probabilities of action

Impulsivity & Hyperbolic Discounting

• humans (and animals) show impulsivity in:– diets– addiction– spending, …

• intertemporal conflict between short and long term choices• often explained via hyperbolic discount functions

• alternative is Pavlovian imperative to an immediate reinforcer

• framing, trolley dilemmas, etc

Kalman Filter

• Markov random walk (or OU process)• no punctate changes• additive model of combination• forward inference

Kalman Posterior

ε

^

Assumed Density KF

• Rescorla-Wagner error correction

• competitive allocation of learning– P&H, M

Blocking

• forward blocking: error correction

• backward blocking: -ve off-diag

Mackintosh vs P&H

• under diagonal approximation:

• for slow learning,

– effect like Mackintosh

E

predictor competition

Summary

stimulus/correlation rerepresentation (Daw)

• Kalman filter models many standard conditioning paradigms

• elements of RW, Mackintosh, P&H• but:

– downwards unblocking

– negative patterning L→r; T→r; L+T→·

– recency vs primacy (Kruschke)

Uncertainty (Yu)

• expected uncertainty - ignorance

– amygdala, cholinergic basal forebrain for conditioning– ?basal forebrain for top-down attentional allocation

• unexpected uncertainty – `set’ change– noradrenergic locus coeruleus

• part opponent; part synergistic interaction

ACh & NE have distinct behavioral effects:

• ACh boosts learning to stimuli with uncertain consequences

• NE boosts learning upon encountering global changes in the environment

(e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998)

ACh & NE have similar physiological effects

• suppress recurrent & feedback processing

• enhance thalamocortical transmission

• boost experience-dependent plasticity(e.g. Gil et al, 1997)

(e.g. Kimura et al, 1995; Kobayashi et al, 2000)

Experimental Data

(e.g. Bucci, Holland, & Gallagher, 1998)

(e.g. Devauges & Sara, 1990)

Model Schematics

z

x

ACh

expecteduncertainty

top-downprocessing

bottom-upprocessing

sensory inputs

cortical processing

context

NE

unexpecteduncertainty

prediction, learning, ...

y

Attentionattentional selection for (statistically) optimal processing,above and beyond the traditional view of resource constraint

sensoryinput

Example 1: Posner’s Task

stimulus location

cue

sensoryinput

cue

highvalidity

lowvalidity

stimulus location

(Phillips, McAlonan, Robb, & Brown, 2000)

cue

target

response

0.2-0.5s

0.1s

0.1s

0.15s

generalize to the case that cue identity changes with no notice

Formal Framework

cues: vestibular, visual, ... 4c3c2c1c

Starget: stimulus location, exit direction...

variability in quality of relevant cuevariability in identity of relevant cue

AChNE

Sensory Informationavoid representing

full uncertainty

t1 t1

it

ttt DP )|(*

1

1)|(*

hDijP ttt

Simulation Results: Posner’s Task

increase ACh

valid

ity e

ffect

% normal level

100 120 140

decrease ACh

% normal level

100 80 60

VE (1- )(NE 1-ACh)

3c2c1c

S

vary cue validity vary ACh

fix relevant cue low NE

nicotine

valid

ity e

ffect

concentration concentration

scopolamine

(Phillips, McAlonan, Robb, & Brown, 2000)

Maze Taskexample 2: attentional shift

reward

cue 1 cue 2

reward

cue 1 cue 2

relevant irrelevant

irrelevant relevant

(Devauges & Sara, 1990)

no issue of validity

Simulation Results: Maze Navigation

3c2c1c

S

fix cue validity no explicit manipulation of ACh

change relevant cue NE

% R

ats

reac

hing

cri

teri

on

No. days after shift from spatial to visual task

% R

ats

reac

hing

cri

teri

on

No. days after shift from spatial to visual task

experimental data model data

(Devauges & Sara, 1990)

Simulation Results: Full Modeltrue & estimated relevant stimuli

neuromodulation in action

trials

validity effect (VE)

Simulated Psychopharmacology

50% NE

50% ACh/NE

AChcompensation

NE cannearly catchup

Inter-trial Uncertainty

• single framework for understanding ACh, NE and some aspects of attention

• ACh/NE as expected/unexpected uncertainty signals

• experimental psychopharmacological data replicated by model simulations

• implications from complex interactions between ACh & NE

• predictions at the cellular, systems, and behavioral levels

• activity vs weight vs neuromodulatory vs population representations of uncertainty

Aston-Jones: Target Detectiondetect and react to a rare target amongst common distractors

• elevated tonic activity for reversal• activated by rare target (and reverses)• not reward/stimulus related? more response related?

• no reason to persist as unexpected uncertainty• argue for intra-trial uncertainty explanation

Vigilance Model

• variable time in start• η controls confusability

• one single run• cumulative is clearer

• exact inference• effect of 80% prior

Phasic NE

• NE reports uncertainty about current state• state in the model, not state of the model• divisively related to prior probability of that state

• NE measured relative to default state sequencestart → distractor

• temporal aspect - start → distractor

• structural aspect target versus distractor

Phasic NE

• onset response from timing uncertainty (SET)

• growth as P(target)/0.2 rises

• act when P(target)=0.95

• stop if P(target)=0.01

• arbitrarily set NE=0 after 5 timesteps(small prob of reflexive action)

Four Types of Trial

19%

1.5%

1%

77%

fall is rather arbitrary

Response Locking

slightly flatters the model – since no furtherresponse variability

Task Difficulty

• set η=0.65 rather than 0.675• information accumulates over a longer period• hits more affected than cr’s• timing not quite right

Interrupts

LC

PFC/ACC

Intra-trial Uncertainty

• phasic NE as unexpected state change within a model

• relative to prior probability; against default

• interrupts ongoing processing

• tie to ADHD?

• close to alerting (AJ) – but not necessarily tied to behavioral output (onset rise)

• close to behavioural switching (PR) – but not DA

• farther from optimal inference (EB)

• phasic ACh: aspects of known variability within a state?

Learning and Inference

• Learning: predict; control

∆ weight (learning rate) x (error) x (stimulus)

– dopamine

phasic prediction error for future reward– serotonin

phasic prediction error for future punishment

– acetylcholine

expected uncertainty boosts learning– norepinephrine

unexpected uncertainty boosts learning

69

Conditioning

• Ethology– optimality– appropriateness

• Psychology– classical/operant

conditioning

• Computation– dynamic progr.– Kalman filtering

• Algorithm– TD/delta rules– simple weights

• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum

prediction: of important eventscontrol: in the light of those predictions

Computational Neuromodulation

• dopamine– phasic: prediction error for reward– tonic: average reward (vigour)

• serotonin– phasic: prediction error for punishment?

• acetylcholine:– expected uncertainty?

• norepinephrine– unexpected uncertainty; neural interrupt?

class of stylized tasks with

states, actions & rewards– at each timestep t the world takes on

state st and delivers reward rt, and the agent chooses an action at

Markov Decision Process

World: You are in state 34.Your immediate reward is 3. You have 3 actions.

Robot: I’ll take action 2.

World: You are in state 77.Your immediate reward is -7. You have 2 actions.

Robot: I’ll take action 1.

World: You’re in state 34 (again).Your immediate reward is 3. You have 3 actions.



Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at)


Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at)Markov property–future conditionally independent of past, given st

The optimal policy

Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies

Theorem: For every MDP there exists (at least) one deterministic optimal policy.

by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?

reinforcement learning 2: action selection peter dayan (thanks to nathaniel daw)

Documents

dopamine slide

average slide

effect slide

masochism slide

newsome slide

reinforcementseconds

sample slide

time arl slide