role of dopamin in searching information

155
The Role of Dopamine in Information-Seeking A thesis by Ethan S. Bromberg-Martin Thesis Committee: Okihide Hikosaka, National Eye Institute, Advisor Barry J. Richmond, National Institute of Mental Health, Chair David L. Sheinberg, Brown University Daeyeol Lee, Yale University, Outside Reader Defended June 2, 2009 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in the Department of Neuroscience at Brown University May, 2010

Upload: amar-chulle-custic

Post on 26-Nov-2015

17 views

Category:

Documents


2 download

DESCRIPTION

zanimljivo

TRANSCRIPT

  • The Role of Dopamine in

    Information-Seeking

    A thesis by Ethan S. Bromberg-Martin

    Thesis Committee: Okihide Hikosaka, National Eye Institute, Advisor

    Barry J. Richmond, National Institute of Mental Health, Chair David L. Sheinberg, Brown University

    Daeyeol Lee, Yale University, Outside Reader

    Defended June 2, 2009

    Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in

    the Department of Neuroscience at Brown University

    May, 2010

  • Copyright 2009 by Ethan S. Bromberg-Martin. All rights reserved.

  • This dissertation by Ethan S. Bromberg-Martin is accepted in its present form by the Division of Biology and Medicine as satisfying the dissertation requirement of the degree of Doctor of Philosophy. Date___________ ___________________________________ Okihide Hikosaka, M.D., Ph.D., Advisor Laboratory of Sensorimotor Research National Eye Institute National Institutes of Health

    Recommended to the Graduate Council Date___________ ___________________________________ David L. Sheinberg, Ph.D., Reader Department of Neuroscience Brown University Date___________ ___________________________________ Barry J. Richmond, M.D., Reader Laboratory of Neuropsychology National Institute of Mental Health National Institutes of Health Date___________ ___________________________________ Daeyeol Lee, Ph.D., Reader Department of Neurobiology Yale University

    Approved by the Graduate Council Date___________ ___________________________________ Sheila Bonde, Ph.D. Dean of the Graduate School Brown University

    iii

  • CURRICULUM VITAE Ethan S. Bromberg-Martin

    Born April 27, 1983

    Contact Information

    Laboratory of Sensorimotor Research

    49 Convent Drive, Room 2A50

    Bethesda, MD 20892

    Email: [email protected]

    Education

    Summer 2005 to Current

    Candidate for Ph.D. in Neuroscience at Brown University,

    as part of the Graduate Partnership Program with the

    National Institutes of Health

    Graduate Thesis: The Role of Dopamine in Information-Seeking

    (Mentor: Dr. Okihide Hikosaka, National Eye Institute)

    Fall 2001 to Spring 2005

    Bachelor of Science in Computational Biology at Brown University,

    iv

  • magna cum laude

    Honors Thesis: Partial-Order Alignment of RNA Structures

    (Mentor: Dr. Franco Preparata, Computer Science Department)

    Other Employment

    Summer 2002, 2003, 2004

    Internship at the Laboratory of Experimental and Computational Biology,

    National Cancer Institute, at NCI-Frederick in Ft. Detrick, Maryland

    Summer 2001

    Internship at the Laboratory of Biochemical Genetics,

    National Heart, Lung, and Blood Institute, at the NIH in Bethesda, Maryland.

    Professional Memberships

    Society for Neuroscience

    Peer-Reviewed Publications

    Bromberg-Martin ES, Hikosaka O. (2009) Midbrain dopamine neurons signal

    preference for advance information about upcoming rewards. (Neuron, in press)

    Hikosaka O, Bromberg-Martin E, Hong S, Matsumoto M. (2008) New insights on

    the subcortical representation of reward. Curr. Opin. Neurobiol. 18, 203-208.

    Conference Presentations

    v

  • Bromberg-Martin E, Matsumoto M, Nakamura K, Nakahara H, Hikosaka O. (2009)

    Multiple timescales of reward memory in lateral habenula and midbrain

    dopamine neurons. Frontiers in Systems Neuroscience. Conference Abstract:

    Computational and systems neuroscience.

    Bromberg-Martin ES, Hikosaka O. Midbrain dopamine neurons signal preference for

    advance information about upcoming rewards. (2008) Society for Neuroscience

    Abstract, 34: 691.23.

    Bromberg-Martin ES, Matsumoto M, Hikosaka O. Lateral habenula neurons respond

    to the negative value of rewards and the positive value of elapsed time. (2007)

    Society for Neuroscience Abstract, 33: 530.9.

    Bromberg-Martin E, Mar Jonsson A, Marai GE, McGuire M. Hybrid Billboard

    Clouds for Model Simplification. (2004) SIGGRAPH Poster Abstract.

    vi

  • PREFACE

    Aristotle began his great work, Metaphysics, with the observation that All men by

    nature desire to know. In this thesis my aim is to take a first step toward understanding

    how the brain motivates itself to seek information about the surrounding world. Chapter

    1 introduces current knowledge about the behavioral phenomenon of information-seeking

    and about the neural basis of reward seeking, in particular the reward-signaling midbrain

    dopamine neurons that are the focus of this thesis. In chapter 2, I present the results of

    experiments that show that macaque monkeys seek advance information about the size of

    upcoming water rewards. Further, the same midbrain dopamine neurons that signal the

    expected amount of water also signal the expectation of advance information, in a manner

    that is correlated with behavioral preferences. In chapter 3, I present an analysis of an

    existing dataset showing that in a task in which reward information is delivered at a

    predictable time, dopamine neurons can use two distinct prediction strategies one based

    on a one-trial memory of past rewards, expressed at the start of the trial, and another

    based on a long-timescale memory, expressed at the time when new reward information

    is revealed. The same pattern is present in a major source of input to dopamine neurons,

    reward-predicting cells of the lateral habenula. Chapter 4 summarizes the main results of

    this thesis and suggests new avenues for future research.

    vii

  • ACKNOWLEDGEMENTS

    This thesis was completed in the Laboratory of Sensorimotor Research in the

    National Eye Institute (NEI) at the National Institutes of Heath (NIH) and at Brown

    University between June 2005 and May 2009. It was made possible by the generous

    support of the NIH intramural research program, and by the Graduate Partnerships

    Program (GPP) between the NIH and the Brown University Department of Neuroscience.

    I thank my advisor, Okihide Hikosaka, for his continual support and excellent

    mentorship, especially his keen judgment of experimental design and advice about

    building a program of research. He is as good an example of a scientist and a gentleman

    as a budding neuroscientist could hope to have as a mentor. Most of all, he makes the

    process of scientific discovery enjoyable. Once a visitor asked me why I was in the lab so

    late, and suggested that I should spend less time working. I replied, Its not working, its

    FUNNING! If it wasnt for Okihide, I wouldnt have been able to say that!

    I thank my thesis committee members Barry J. Richmond and David L. Sheinberg

    for always being available to discuss matters related to my thesis and for their insightful

    input, and my outside reader Daeyeol Lee for valuable advice and a keen eye for detail. I

    thank my colleagues at the NIH, especially Masayuki Matsumoto for his collaboration

    and experimental and research advice, and Kae Nakamura, Long Ding, Masaki Isoda,

    Simon Hong, Masaharu Yasuda, Shinya Yamamoto, and Yoshihisa Tachibana; fellow

    viii

  • fellows including Ilya Monosov, Alexander Lerchner, Giancarlo la Camera, Christian

    Quaia, Seiji Tanabe, Ralf Haefner, Hendrikje Neinborg, Rebecca Berman, Jason Trageser,

    Wilsaan Joiner, Kerry McAlonan, and James Cavanaugh; intrepid investigators including

    Bruce Cumming, Kirk Thompson, and Lance Optican; and excellent experts including

    Mitchell Smith, John McClurkin, Art Hays, Altah Nichols, Tom Ruffner, Leonard Jensen,

    Ginger Tansey, Denise Parker, Mark Lawson, and Bethany Nagy. I also thank Veit

    Stuphorn for advice about behavioral experiments, and I thank my voluminous

    correspondent and insightful collaborator, Hiroyuki Nakahara.

    I thank my colleagues in the Brown-NIH GPP program, especially Avinash Pujala

    and Ilya Monosov for their scholarly and other communications, and Emmette Huchison,

    Jodi Gilman, Lindsay Hayes, Cole Graydon, Caroline Graybeal and Shau-Ming Wei for

    their support and advice. And my colleagues at Brown, especially Arjun Bansal, Luke

    Woloszyn, Ryan Mruczek, Jedediah Singer, and Carlos Vargas-Irwin. I thank those who

    supported me in the Brown-NIH Graduate Partnership Program, including Diane

    Lipscombe, Jerome Sanes, John Isaac, Katherine Roche, Caroline Duffy, Mary DeLong,

    and Sharon Milgram. I also thank my professors and teachers at Brown who showed me

    how exciting cognitive brain science can be! Including Michael Black, Frank Wood,

    Lucien Elie Bienenstock, Mayank Mehta, Stuart Geman, John Stein, Michael Paradiso,

    David Sheinberg, Michael Tarr, Fulvio Domini, and Steven Sloman.

    Finally, I thank my parents, Rob and Laura; my brother, Brett; and my parents of

    parents, Richard, Helen, Henry, and Arlene. For your love, support, advice,

    understanding, and general awesomeness, too immense and profound to be contained

    within words on a printed page.

    ix

  • TABLE OF CONTENTS

    Curriculum Vitae iv

    Preface vii

    Acknowledgements viii

    Table of Contents x

    List of Figures xii Chapter 1: Introduction .......................................................................................................... 1

    1.1 Information-seeking in humans and animals 2

    1.2 Reward-predictive signals of midbrain dopamine neurons 6

    1.3 Reward-predictive signals of lateral habenula neurons 12

    1.4 Summary 15

    Chapter 2: Midbrain dopamine neurons signal preference for advance information about upcoming rewards ..............................................................................................................................

    16

    2.1 Introduction 17

    2.2 Results 19

    Behavioral preference for advance information 19

    Dopamine neurons signal advance information 23

    2.3 Discussion 29

    Implications of information-seeking for attitudes toward uncertainty 31

    Information signals in the dopaminergic reward system 32

    Why do dopamine neurons treat information as a reward? 33

    2.4 Methods 35

    2.5 Acknowledgements 40

    Note 2.A: Aversion to information by reinforcement learning algorithms based on TD() 41

    Chapter 3: Multiple timescales of memory in a network for reward prediction .......................... 52

    3.1 Introduction 53

    3.2 Results 55

    Behavioral task and optimal timescale of memory 55

    Behavioral and neural memory for a single previous reward outcome 58

    Multiple timescales of memory 62

    x

  • Multiple timescales of memory in tonic neural activity 67

    Timecourse of the timescale of memory 71

    3.3 Discussion 72

    3.4 Methods 77

    3.5 Acknowledgements 81

    Note 3.A: Optimal reward prediction in the pseudorandom reward schedule 82

    Note 3.B: Formal model of reward expectation during the task 85

    Note 3.C: Single neurons accessed both short- and long-timescale memories 90

    Note 3.D: Reward memory in lateral habenula response to predictable reward omission 95

    Note 3.E: Rough independence of lateral habenula tonic and phasic memory effects 97

    Note 3.F: V-Shaped timecourse of the timescale of reward memory 98

    Note 3.G: No evidence for memory-change-induced prediction-errors 99

    Chapter 4: Summary and Conclusions ............................................................................................. 104

    4.1 Chapter 2: Summary and future directions 104

    4.2 Chapter 3: Summary and future directions 115

    4.3 Summary of contributions 123

    Bibliography 125

    xi

  • LIST OF FIGURES

    Figure 1.1: A dopamine neuron following the prediction-error theory 9

    Figure 1.2: Lateral habenula neurons carry negative reward signals 14

    Figure 2.1: Behavioral preference for advance information 20

    Figure 2.2: Control for water volume of the informative and random options 21

    Figure 2.3: Behavioral preference for immediate delivery of information 23

    Figure 2.4: Dopamine neurons signal information 25

    Figure 2.5: Analysis of the dopamine neuron population 26

    Figure 2.6: Correlation between neural discrimination and behavioral preference 28

    Figure 2.7: Aversion to information by a model using SARSA() and softmax action selection 50

    Figure 3.1: Reward-biased saccade task 56

    Figure 3.2: Behavioral performance and memory for a single past reward outcome 58

    Figure 3.3: Neural responses to task events and memory for a single previous outcome 60

    Figure 3.4: Multiple timescales of memory 64

    Figure 3.5: Quantifying of the timescale of memory 66

    Figure 3.6: Timescale of memory in tonic neural activity 69

    Figure 3.7: Timecourse of the timescale of memory 72

    Figure 3.8: Optimal reward prediction in the pseudorandom reward schedule 84

    Figure 3.9: Formal model of reward expectation during the task 88

    Figure 3.10: Single neurons accessed both short- and long-timescale memories 93

    Figure 3.11: Memory effect in lateral habenula response to predictable reward omission 96

    Figure 3.12: Rough independence of lateral habenula tonic and phasic memory effects 97

    Figure 3.13: V-Shaped timecourse of the timescale of reward memory 98

    Figure 3.14: No evidence for memory-change-induced prediction-errors during the fixation period 102

    Figure 4.1: Hypothetical dopamine neuron encoding of information prediction-errors 107

    Figure 4.2: Necessity for advance information in solving the temporal credit-assignment problem 111

    Figure 4.3: Hypothetical measurement of the neural and behavioral change between short- and

    long-timescale memories

    121

    xii

  • CHAPTER 1

    Introduction

    In recent years neuroscientists have learned much about how the brain represents

    objects of desire, also known as rewards. We now know that rewards are represented by

    neurons in practically every cortical and subcortical area, from motivational brain

    systems where rewards are anticipated and predicted, to motor systems where rewards

    influence action selection and movement control, to sensory systems where rewards

    enhance the neural representation of desirable objects as they are seen, heard, touched,

    smelled, and tasted. Unfortunately, a major limitation of this research is that it is very

    rarely possible to record the activity of single neurons in behaving human beings, and

    instead recordings are largely restricted to experimental animals. So, with a few notable

    exceptions (e.g. (Williams et al., 2004; Klein et al., 2008; Zaghloul et al., 2009)), our

    knowledge about the neuronal basis of rewarding experience is largely limited to basic

    rewards like food and water (Schultz, 2000). The neural pathways that regulate eating

    and drinking are highly specialized and evolutionarily ancient, a testament to their

    importance (Lenard and Berthoud, 2008; Denton et al., 1996); on the other hand, this may

    mean that these neural pathways are not flexible enough to represent the more abstract,

    1

  • 2

    cognitive rewards that are at the heart of our mental lives. In this thesis my goal is to take

    a first step toward understanding the neuronal basis of cognitive rewards by studying a

    form of cognitive reward-seeking that is shared between humans and animals: the desire

    for advance information about future events.

    1.1 Information-seeking in humans and animals

    In 350 B.C., Aristotle began his great work, Metaphysics, with the assertion that All

    men by nature desire to know. This desire is not simply because knowledge can be put

    to use in achieving ones goals, but because knowledge is intrinsically desirable in its

    own right: For not only with a view to action, but even when we are not going to do

    anything, we prefer seeing (one might say) to everything else. The reason is that this,

    most of all the senses, makes us know and brings to light many differences between

    things. (Aristotle and Ross, 2006). This philosophical assertion appeals to the intuition

    surely it is better to seek knowledge than to seek ignorance! Yet it was more than two

    millennia before this concept of information-seeking became the subject of scientific

    study.

    This topic was first introduced to experimental psychology by L. Benjamin Wyckoff

    during his studies of discrimination learning in pigeons (Wyckoff, 1952). He coined the

    term observing response to mean any action that brought an animal into contact with a

    discriminative stimulus (a stimulus used to choose between a set of possible responses).

    For example, a pigeon in a Skinner box might turn its head toward a colored light

    (observing response), which would then tell the pigeon which button to peck to get a food

    reward (discriminative response). In this case the information is obviously beneficial

  • 3

    because it helps the animal gain a greater amount of food. However, several researchers

    found that rats and pigeons persistently observe reward-indicating cues even when their

    observations have no effect on the reward rate (Prokasy, 1956; Hendry, 1969; Daly,

    1992; Roper and Zentall, 1999). In a prototypical observing behavior experiment a rat

    is given a choice between two chambers, after which the rat is confined in the chamber

    for 30 seconds and then given a food reward with 50% probability. In one chamber, the

    color of the walls is informative, telling the rat whether the trial will be rewarded; in the

    other chamber the color of the walls is uninformative, entirely randomized. Under these

    conditions, rats express a strong preference for the informative chamber (Prokasy, 1956;

    Daly, 1992). The preference for informative cues is strongest when the possibility of

    receiving a reward is most attractive: for example, when the rats are hungry, when

    rewards are scarce, and when the potential reward size is large (Wehling and Prokasy,

    1962; McMichael et al., 1967; Mitchell et al., 1965). Rats also prefer cues that are

    informative about upcoming punishments (aversive events), showing the strongest

    preference when the intensity of the possible punishment is largest (Badia et al., 1979).

    Thus, animals do not seek all kinds of information equally, but specifically seek

    information about future events that have a large positive or negative value.

    An independent line of inquiry1 into information-seeking began in the field of

    economics with a theoretical paper by Kreps and Porteus, introducing a concept they

    called temporal resolution of uncertainty (Kreps and Porteus, 1978). According to their

    theory, decision-makers may express a preference between two options that have an

    identical set of possible outcomes but differ in the time when the outcome becomes

    1 I cannot find any paper from the literature on observing behavior that references any paper from the literature on temporal resolution of uncertainty, or vice versa. Apparently this thesis is the first publication to link these two lines of inquiry.

  • 4

    known that is, the time when the decision-makers uncertainty about the outcome is

    resolved. According to this theory, decision-makers can prefer early resolution of

    uncertainty (behaving like the rats and pigeons in the above experiments), or late

    resolution of uncertainty (ignoring information for as long as possible). Later experiments

    showed that humans generally express a strong preference for early resolution of

    uncertainty, seeking advance information about monetary gains and losses even when

    they cannot use this knowledge to change the outcome (Chew and Ho, 1994; Ahlbrecht

    and Weber, 1996; Lovallo and Kahneman, 2000; Eliaz and Schotter, 2007; Luhmann et

    al., 2008).

    Thus, both humans and animals seek advance information about future rewards, as if

    information was itself rewarding. Yet advance information is fundamentally different

    from conventional forms of reward. Most rewards are spoken of as discrete physical

    entities that can be delivered and consumed at a specific moment in time. A reward is

    an object, like a apple that can be eaten, or an attractive picture that can be consumed

    by visual inspection. But consider the behavioral experiment described above, in which

    rats chose between informative and uninformative chambers that led to probabilistic

    rewards. Both options presented exactly the same two physical stimuli: a chamber that

    could be white or black, and a reward that could be present or absent. Thus, the animals

    preference cannot be explained in terms of a rewarding stimulus that was present for

    one option and absent for the other. The only difference between the options was that for

    one chamber the color indicated the reward size, whereas for the other chamber it did not.

    Crucially, this informative relationship could not be detected on any single trial, but had

    to be inferred through repeated experience. Thus, the differential reward value between

  • 5

    the two options did not originate from any single external event, but from an inference or

    realization, an internal event or sequence of events in the brain. In this sense the reward

    value of advance information has a highly abstract, cognitive origin.

    How could such an abstract form of reward be generated by the brain? Is it created

    by the same neural networks that assign value to basic rewards like food and water? Or is

    it handled by an entirely separate system? Much is known about how the brain controls

    the physical act of consuming food and water but what about the analogous mental act

    of consuming information?

    In this thesis I will take a first step toward answering these questions by recording

    the activity of dopamine-releasing neurons in the midbrain. Dopamine neurons are an

    ideal candidate for this purpose because they are one of the brains major sources of food

    and water reward signals and are theorized to have a central role in reward learning. A

    crucial test of this theory will be whether dopamine neurons carry informational reward

    signals as well (Chapter 2). Furthermore, dopamine neuron activity is thought to serve as

    a reinforcement signal that teaches the brain which actions are good and which are

    bad. Thus, if these neurons carry information-related signals, then they are likely to play

    a causal role in creating the informational preference seen in behavior. Finally, dopamine

    neurons do not act as simple reward detectors, but instead appear to respond to the first

    new information about rewards in the future. Thus, they provide an ideal opportunity to

    study how reward-signaling neurons in the brain adapt their activity at the time when new

    reward information is expected and consumed (Chapter 3).

    In the next section I will describe the nature of reward signals in dopamine neurons,

    along with the theoretical framework which I will use for interpreting their activity

  • 6

    (Section 1.2). I will also describe the activity pattern of neurons in one of their major

    input nuclei, the lateral habenula, which I will analyze in detail in Chapter 3 in order to

    better understand the source of dopamine neuron reward signals (Section 1.3).

    1.2 Reward-predictive signals of midbrain dopamine neurons

    Two regions of the midbrain, the ventral tegmental area (VTA) and the substantia

    nigra pars compacta (SNc), contain a large number of neurons that release the

    neurotransmitter dopamine2. These neurons send axons to many structures involved in

    motor control, cognition, emotion, and memory - SNc neurons project primarily to the

    dorsal striatum, while VTA neurons project to many areas including the ventral striatum,

    amygdala, hippocampus, and wide swaths of frontal cortex (Oades and Halliday, 1987;

    Graybiel, 1990). These neurons are well-studied partly because they are involved in a

    variety of neurological disorders; in particular, the degeneration of SNc dopamine

    neurons is the major cause of Parkinson's disease (Dauer and Przedborski, 2003). In this

    thesis I will focus on another aspect of dopamine neurons, their role in reward learning.

    Manipulations of dopamine neurotransmission cause profound effects on reward-

    seeking. For example, when animals are infused with dopamine antagonists they

    gradually reduce the effort they put forth to get food and water rewards, behaving as

    though the reward size had suddenly decreased (Wise, 2004). Even on later days when

    the drug is no longer administered, animals still continue to put forth reduced effort, as if

    the suppression of dopamine neurotransmission had caused them to learn that the task

    2 For convenience, they will here be referred to as dopamine neurons rather than the more complete phrase midbrain dopamine neurons. In fact there are several populations of dopamine neurons in the nervous system, including cells in the retina, olfactory bulb, and hypothalamus (Bjorklund, A., and Dunnett, S.B. (2007). Dopamine neuron systems in the brain: an update. Trends in neurosciences 30, 194-202.).

  • 7

    had a low reward value. In some cases an animal's reward learning can be impaired or

    even fully blocked by interfering with dopamine transmission in very localized regions

    on the brain (e.g. (Liu et al., 2004; Dalley et al., 2005; Nakamura and Hikosaka, 2006)).

    Dopamine release is not only linked to natural rewards but also seems to be crucial for

    many 'unnatural' ones as well, as most addictive drugs potentiate dopamine release in one

    way or another (Kauer, 2004) and humans taking dopamine agonists as a medical

    treatment sometimes develop addiction-like traits like compulsive gambling and

    hypersexuality (Ondo and Lai, 2008; Bostwick et al., 2009). Dopamine release appears to

    act as a reward or reinforcement signal, as animals can be trained to perform simple tasks

    in exchange for direct burst stimulation of dopamine neuron axons (Tsai et al., 2009).

    Consistent with the importance of dopamine in reward learning, a seminal series of

    studies conducted by Wolfram Schultz and his colleagues in Fribourg found that these

    neurons emitted phasic bursts of spikes triggered by food and water rewards (Schultz,

    1998). Neurons fired bursts when monkeys drank a squirt of juice, when they touched or

    saw a morsel of food, and when they observed visual or auditory cues indicating the

    availability of these rewards. Perplexingly, each of these responses was present in some

    tasks but absent in others. Furthermore, as monkeys gained practice at a task, some of

    these bursts became stronger while other responses entirely disappeared, or could even be

    converted into brief inhibitions. A number of theories have been proposed to account for

    this puzzling pattern of dopamine neuron activity.

    In this thesis I will discuss dopamine neurons primarily in the framework of the most

    successful of these theories, based on the computational theory of temporal-difference

    learning (TD learning) (Sutton, 1988; Sutton and Barto, 1998). According to this theory,

  • 8

    we continuously predict the expected future reward value available from our present

    situation. Whenever we receive new information about the world we re-evaluate our

    situation, reacting to good news by revising our reward prediction upwards and reacting

    to bad news by revising it downwards. This mental re-evaluation triggers what is known

    as a prediction-error or TD-error signal, reflecting the difference between the new

    predicted value and the old predicted value (the temporal-difference). Dopamine

    neuron activity is hypothesized to represent this prediction-error signal. Specifically, if

    new information indicates that the situation is better than originally predicted, then it

    triggers a positive prediction-error and dopamine neurons should burst; if the situation is

    exactly as good as predicted, then there is zero prediction-error and dopamine neurons

    should not respond; and if the situation takes a turn for the worse, it triggers a negative

    prediction-error and dopamine neurons should pause (Figure 1.1). Thus dopamine

    activity serves as the first derivative of value, signaling how the value of a situation

    changes over time (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997).

  • 9

    Figure 1.1. A dopamine neuron following the prediction-error theory. A monkey was presented with a visual cue and then 2.25 seconds later a big or a small water reward was delivered. Different cues indicated a 100%, 50%, or 0% probability of a big reward. This neuron was recorded for the study described in Chapter 2. Each plot shows the dopamine neurons activity for a particular cue and reward outcome. Top: the neurons mean firing rate at each time during the trial. Bottom: the neurons spikes on the first 36 trials recorded in each condition. Each row is a trial, and each dot is a spike. When the cues indicated the upcoming reward size, the neuron fired a burst of spikes when the monkey predicted a big reward and paused when the monkey predicted a small reward (top two panels). The neuron then had little response the predictable reward outcome. When the cues were uninformative about the upcoming reward size (50% cue), the neuron had little response to the cue but then fired a burst of spikes for an unpredicted big reward and paused for an unpredicted small reward. Thus, this neuron signaled changes in the animals prediction about the trials reward value.

    This theory provides a simple explanation for a wide variety of findings about

    dopamine neuron activity (Schultz et al., 1997; Schultz, 1998). It explains why dopamine

  • 10

    neuron responses to a reward are relative to the rewards predicted size; when two

    different reward sizes are potentially available, neurons burst in response to the large

    reward and pause in response to the small reward (Figure 1.1, bottom right). If a reward

    has the same value as originally predicted, dopamine neurons have little or no response

    (Figure 1.1, top right). Similarly, when two different reward-indicating cues could

    potentially be shown, neurons burst for the big-reward cue and pause for the small-

    reward cue (Figure 1.1, top left). In general, responses to reward-indicating cues are

    positively related to the cued reward value, while responses to the later reward outcome

    are negatively related to the cued value, as if the neurons were computing the prediction-

    error, Actual Value Predicted Value (Fiorillo et al., 2003; Satoh et al., 2003; Morris

    et al., 2004; Nakahara et al., 2004). This neural prediction of reward value tracks

    behavioral measures of reward prediction during a large number of experimental

    situations, including learning and extinguishing cue-reward relationships (Schultz et al.,

    1993; Hollerman and Schultz, 1998; Takikawa et al., 2004; Day et al., 2007), estimating

    the reward delivery time (Fiorillo et al., 2008; Kobayashi and Schultz, 2008), judging the

    reward probability using complex cue-combination rules (Waelti et al., 2001; Tobler et

    al., 2003), and estimating the reward value on the current trial based on the outcomes of

    previous trials (Satoh et al., 2003; Nakahara et al., 2004; Bayer and Glimcher, 2005).

    The TD learning theory provides a very good account of reward-related dopamine

    neuron activity, good enough that in this thesis I will use the theory as the major

    framework for describing my results. However, I should emphasize that this theory is not

    a perfect or ultimate description of dopamine neuron activity. Indeed, the most important

    results in Chapters 2 and 3 are strikingly unexpected from a TD learning point of view.

  • 11

    Therefore, it is wise to bear in mind the limitations of this framework. In particular,

    although there is a good match between theory and experiment in the realm of rewards,

    the match is less good for neutral and aversive stimuli. Dopamine neurons can respond to

    neutral but surprising or startling stimuli such as bright lights and loud tones (Steinfels et

    al., 1983; Strecker and Jacobs, 1985; Horvitz et al., 1997; Horvitz, 2000; Redgrave and

    Gurney, 2006). The function of this activity is unknown. In addition, some dopamine

    neurons respond to aversive events by being strongly inhibited (as expected from the

    theory), but other dopamine neurons are strongly excited (Trulson and Preussler, 1984;

    Guarraci and Kapp, 1999; Ungless et al., 2004; Joshua et al., 2008; Brischoux et al.,

    2009; Matsumoto and Hikosaka, 2009). In this thesis I cannot tell how many of the

    neurons I studied belong to this aversive-excited subpopulation because the neurons were

    only tested with reward-related events. A second limitation of the theory is that it is only

    designed to account for the mean pattern of dopamine neuron activity, as if all dopamine

    neurons carried the same signal. Several studies have reported that there is more neuron-

    to-neuron variability in response patterns than this simple account would suggest (Schultz,

    1986; Nishino et al., 1987; Magarinos-Ascone et al., 1992; Kiyatkin and Rebec, 1998;

    Ravel and Richmond, 2006; Wilson and Bowman, 2006).

    Finally, an important limitation of dopamine neuron recording in behaving animals is

    that there is no definitive test to tell whether a recorded neuron truly releases the

    neurotransmitter dopamine. In this thesis I analyzed neurons which met standard

    electrophysiological criteria thought to be good indicators that a neuron is dopaminergic

    (Schultz, 1986) and which also carried positive reward signals like those that have been

    reported in previous studies (Schultz, 1998). So when I say dopamine neurons did X it

  • 12

    should be strictly interpreted as reward-signaling presumed dopamine neurons did X.

    There is evidence that the standard criteria are not perfect they may exclude certain

    subpopulations of dopamine neurons, and may sometimes include nearby GABA-

    releasing interneurons (Ungless et al., 2004; Margolis et al., 2006; Margolis et al., 2008;

    Lammel et al., 2008). However, the major findings in this thesis are not based on any

    single neuron, but on the predominant patterns of activity seen in the population as a

    whole. Thus the new findings in this thesis can be interpreted as the typical response

    properties of the majority of reward-responsive dopamine neurons.

    1.3 Reward-predictive signals of lateral habenula neurons

    Given that dopamine neurons have an important role in reward-seeking, it is crucial

    to discover how dopamine neurons compute their reward-predictive signals. Dopamine

    neurons receive input from many brain areas, including the pedunculopontine tegmental

    nucleus, dorsal and ventral striatum, ventral pallidum, central amygdala, dorsal raphe,

    lateral hypothalamus, and lateral habenula (Schultz, 1998; Fudge and Haber, 2000;

    Geisler et al., 2007). In this thesis I will analyze the activity of neurons in the one of these

    areas, the lateral habenula, which is an especially strong candidate for sending value

    signals to dopamine neurons.

    The lateral habenula is a small nucleus in the habenular complex of the epithalamus,

    located on the dorsal surface of the thalamus (Parent et al., 1981; Herkenham and Nauta,

    1977, 1979). The habenula sends its outputs through the fasciculus retroflexus, a bundle

    of fibers which travels anteriorly and ventrally before branching off into different regions

    of the brain, including the VTA and SNc. The full mixture of neurotransmitters is

  • 13

    unknown, although at least some of the projection neurons are glutamatergic (Geisler et

    al., 2007). The lateral habenula receives inputs from a number of brain regions, including

    reciprocal connections from VTA dopamine neurons (Oades and Halliday, 1987), but

    receives its most prominent anatomical input from the lateral hypothalamus and globus

    pallidus (Parent et al., 1981; Sutherland, 1982; Hong and Hikosaka, 2008).

    How does the lateral habenula influence midbrain dopamine neurons? In an early

    study in anesthetized rats, Christoph and colleagues demonstrated that electrical

    stimulation of the lateral habenula caused a powerful short-latency inhibition of more

    than 85% of SNc and VTA dopamine neurons (Christoph et al., 1986). This inhibition is

    likely to be disynaptic, such that lateral habenula neurons excite GABAergic interneurons

    in the midbrain, which in turn inhibit dopamine neurons (Ji and Shepard, 2007). Single-

    neuron recordings in behaving animals provide strong evidence that lateral habenula

    neurons make use of this inhibitory influence over dopamine neurons in order to help

    dopamine neurons compute their prediction-error signals (Matsumoto and Hikosaka,

    2007). Lateral habenula and SNc dopamine neurons had mirror-image responses to

    reward-related task events: whereas SNc dopamine neurons were excited by reward-

    predicting cues and unexpected rewards, habenula neurons were inhibited; and whereas

    dopamine neurons were inhibited by reward-omission cues and unexpected omissions,

    habenula neurons were strongly excited (Figure 1.2). The response to reward-omission

    cues occurred about 40 milliseconds earlier in habenula neurons than in dopamine

    neurons, suggesting that habenula excitations could be the source of their negative reward

    signals. Indeed, the dopamine neurons that were most inhibited by reward-omission cues

    were also most strongly inhibited by lateral habenula electrical stimulation. Thus, lateral

  • 14

    habenula neurons appear to carry a negative prediction-error signal that is likely to be the

    source of the positive prediction-error signals of dopamine neurons, at least the inhibitory

    dopamine response to negative events.

    Figure 1.2. Lateral habenula neurons carry negative reward signals. A monkey was presented with a visual target (cue). The monkey made a saccade to the target which was followed 200 ms later by a water reward or by no reward (outcome). The position of the target on the screen indicated whether the trial would be rewarded. On 4% of trials the position-reward mapping was reversed and so an unpredicted reward outcome was delivered (e.g. the cue indicated that a reward was coming but instead no reward was delivered). This plot shows the population average activity of reward-responsive lateral habenula and dopamine neurons from the dataset analyzed in Chapter 3. Arrows indicate cue or outcome onset. The firing rate was smoothed with a Gaussian kernel ( = 15 ms). Lateral habenula neurons were inhibited by reward cues (left) and unpredicted reward deliveries (right); they were excited by no-reward cues (left) and unpredicted reward omissions (right); and they had less response to reward outcomes that were cued and highly predictable (middle). Dopamine neurons carried a mirror-image signal.

  • 15

    1.4 Summary

    Humans and animals prefer to view informative cues that indicate the value of an

    upcoming reward in advance of its delivery. Midbrain dopamine neurons signal changes

    in an animals expectation of food and water rewards. Their food- and water-predictive

    activity may originate in lateral habenula neurons which carry a sign-reversed signal.

    Dopamine release is thought to act as a reinforcer that teaches the brain which actions are

    good and should be repeated and which are bad and should be avoided. If dopamine

    neurons truly have such a fundamental role in reward learning, then they should not only

    signal the value of basic rewards but should also signal the value of advance information

    about those rewards. The information-predictive signals of dopamine neurons would then

    be a major candidate for creating the informational preferences seen in behavior.

  • CHAPTER 2

    Midbrain dopamine neurons signal preference for

    advance information about upcoming rewards

    SUMMARY

    The desire to know what the future holds is a powerful motivator in everyday life,

    but it is unknown how this desire is created by neurons in the brain. Here we show that

    when macaque monkeys are offered a water reward of variable magnitude, they seek

    advance information about its size. Furthermore, the same midbrain dopamine neurons

    that signal the expected amount of water also signal the expectation of information, in a

    manner that is correlated with the strength of the animal's preference. Our data shows that

    single dopamine neurons process both primitive and cognitive rewards, and suggests that

    current theories of reward-seeking must be revised to include information-seeking.3

    3 The major portion of this chapter is from the paper Midbrain dopamine neurons signal preference for advance information about upcoming rewards by Ethan S. Bromberg-Martin and Okihide Hikosaka, currently accepted by Neuron.

    16

  • 17

    2.1 INTRODUCTION

    Dopamine-releasing neurons located in the substantia nigra pars compacta and

    ventral tegmental area are thought to play a crucial role in reward learning (Wise, 2004).

    Their activity bears a remarkable resemblance to prediction errors signaling changes in

    a situation's expected value (Schultz et al., 1997; Montague et al., 2004). When a reward

    or reward-predictive cue is more valuable than expected, dopamine neurons fire a burst of

    spikes; if it has the same value as expected, they have little or no response; and if it is less

    valuable than expected, they are briefly inhibited. Based on these findings, many theories

    invoke dopamine neuron activity to explain human learning and decision-making

    (Holroyd and Coles, 2002; Montague et al., 2004) and symptoms of neurological

    disorders (Redish, 2004; Frank et al., 2004), inspired by the idea that these neurons could

    encode the full range of rewarding experiences, from the primitive to the sublime.

    However, their activity has almost exclusively been studied for basic forms of reward

    such as food and water. It is unknown whether the same neurons that process these basic,

    primitive rewards are involved in processing more abstract, cognitive rewards (Schultz,

    2000).

    We therefore chose to study a form of cognitive reward that is shared between

    humans and animals. When people anticipate the possibility of a large future gain such

    as an exciting new job, a generous raise, or having their research published in a

    prestigious scientific journal they do not like to be held in suspense about their future

    fate. They want to find out now. In other words, even when people cannot take any action

    to influence the final outcome, they often prefer to receive advance information about

    upcoming rewards. Related concepts have been arrived at independently in several fields

  • 18

    of study. Economists have studied temporal resolution of uncertainty (Kreps and

    Porteus, 1978), and have shown that humans often prefer their uncertainty to be resolved

    earlier rather than later (Chew and Ho, 1994; Ahlbrecht and Weber, 1996; Eliaz and

    Schotter, 2007; Luhmann et al., 2008). Experimental psychologists have studied

    observing behavior (Wyckoff, 1952), and have shown that a class of observing

    behavior that produces reward-predictive cues can be a powerful motivator for rats,

    pigeons, and humans (Wyckoff, 1952; Prokasy, 1956; Daly, 1992; Lieberman et al.,

    1997).

    To date, however, there has not been a rigorous test of this preference in non-human

    primates, the animals in which the reward-predicting activity of dopamine neurons has

    been best described (Schultz, 2000; Schultz et al., 1997; Montague et al., 2004).

    Unfortunately, previous non-human primate studies required animals to work for rewards

    by making a costly physical response, so that information about when rewards were

    available allowed animals to save considerable physical effort (Kelleher, 1958; Steiner,

    1967; Steiner, 1970; Lieberman, 1972; Woods and Winger, 2002). Other studies used an

    unbalanced design in which animals pulled a lever to view informative cues but were not

    offered a control lever for viewing uninformative cues, leaving it ambiguous whether

    animals pulled the lever because the cues conveyed information or because of some other

    stimulus property (Steiner, 1967; Schrier et al., 1980). To perform a rigorous test, it is

    best to use a symmetrical choice procedure in which both informative and uninformative

    options are available, in which they are both selected using the same physical response,

    and in which both are matched for the timing and physical properties of both cue stimuli

    and rewards (Daly, 1992; Roper and Zentall, 1999).

  • 19

    To this end, we developed a simple decision task allowing rhesus macaque

    monkeys to choose whether to receive advance information about the size of an

    upcoming water reward. We found that monkeys expressed a strong behavioral

    preference, preferring information to its absence and preferring to receive the information

    as soon as possible. Furthermore, midbrain dopamine neurons which signaled the

    monkeys expectation of water rewards also signaled the expectation of advance

    information, in a manner that was correlated with the animals preference. These results

    show that the dopaminergic reward system processes both primitive and cognitive

    rewards, and suggest that current theories of reward-seeking must be revised to include

    information-seeking.

    2.2 RESULTS

    Behavioral preference for advance information

    We trained two monkeys to perform a simple decision task (information choice

    task, Figure 2.1A). On each trial two colored targets appeared on the left and right sides

    of a screen, and the monkey had to choose between them by making a saccadic eye

    movement. Then, after a delay of a few seconds, the monkey received either a big or a

    small water reward. The monkeys choice had no effect on the reward size both reward

    sizes were always equally probable. However, choosing one of the colored targets

    produced an informative cue - a cue whose shape indicated the size of the upcoming

    reward. Choosing the other color produced a random cue - a cue whose shape was

    randomized and therefore had no meaning. The positions of the targets were randomized

    on each trial. To familiarize monkeys with the two options, we interleaved choice trials

  • 20

    with forced-information trials and forced-random trials, in which only one of the targets

    was available.

    After only a few days of training, both monkeys expressed a strong preference to

    view informative cues (Figure 2.1B). Monkey Z chose information about 80% of the

    time, and monkey V's choice rate was even higher, close to 100%. Their preference for

    advance information cannot be explained by a difference in the amount of water reward,

    because information did not allow monkeys to obtain extra water from the reward-

    delivery apparatus and had little effect on whether they completed a trial successfully (<

    2% error rate for each target, Figure 2.2).

  • 21

    Figure 2.1. Behavioral preference for advance information (A) Information choice task. Fractions represent probabilities of different trial types. (B) Percent choice of information for each monkey. Each dot represents a single day of training. The mean number of choice trials per session was 152 for monkey V (range: 71-203) and 161 for monkey Z (range: 39-285). The gray region is the Clopper-Pearson 95% confidence interval for each day.

    Figure 2.2. Control for water volume of the informative and random options (A) Each monkey was tested in six behavioral sessions, alternating between days when only random cues were available (random sessions) and days when only informative cues were available (information sessions). Each dot represents the amount of water delivered during a single session, expressed as a percentage of the theoretical amount that should have been delivered if monkeys were unable to influence the water volume (circles: Monkey V; squares: Monkey Z). Inset: the difference in water delivered between the two types of sessions was close to zero (unpaired t-test, P = 0.28, 95% CI = -3.2% to +1.0%). (B) Error rates are shown for each trial type, combining all behavioral sessions except the early learning stage (the first eight sessions for each monkey). Both monkeys made slightly more errors on random trials than information trials, compatible with the notion that monkeys were less motivated to complete random trials because they were the least desirable trial type in the task (a phenomenon seen in many other studies, e.g. (Shidara and Richmond, 2002; Lauwereyns et al., 2002a; Roesch and Olson, 2004; Kobayashi et al., 2006)). However, the overall error rate was very low (< 2%). A 2% difference in reward rate is far too small to explain the strong behavioral preferences we observed.

    An important concern is that advance information might have allowed monkeys to

    extract a greater amount of subjective value from the water reward by physically

    preparing for its delivery for instance, by tensing their cheek muscles to swish water

  • 22

    around in their mouths in a more pleasurable fashion (Perkins, 1955). We therefore

    introduced a second task that equalized the opportunity for simple physical preparation

    (Mitchell et al., 1965) (information delay task, Figure 2.3A). Monkeys again chose

    between informative and random cues, but afterward a second cue appeared that was

    always informative on every trial. Thus, information was always available well in

    advance of reward delivery; the choice was between receiving the information

    immediately, or after a delay.

    Soon after being exposed to the new task, both monkeys expressed a clear preference

    for immediate information, comparable to their preference in original task (Figure 2.3B).

    We then reversed the relationship between cue colors and information content, and

    monkeys switched their choices to the newly informative color (Figure 2.3B). We

    conclude that monkeys treated information about rewards as if it was a reward in itself,

    preferring information to its absence and preferring to receive it as soon as possible.

  • 23

    Figure 2.3. Behavioral preference for immediate delivery of information (A) Information delay task. The fixation point and target configurations (not shown here) were the same as in the information choice task shown in Figure 2.1A. (B) Percent choice of immediate information. Conventions as in Figure 2.1B. The vertical line labeled reversal marks the time when the informative and random cue colors were switched. The mean number of choice trials per session was 151 for monkey V (range: 50-222) and 111 for monkey Z (range: 35-176). The behavioral preference started below 50% because the cue colors were re-used from a pilot experiment; the informative color had been previously trained as random, and vice versa (data not shown).

    Dopamine neurons signal advance information

    To understand the neural basis of the rewarding value of information, we recorded

    the activity of 47 presumed midbrain dopamine neurons while monkeys performed the

    information choice task shown in Figure 2.1. As in previous studies, we focused on

  • 24

    neurons that were presumed to be dopaminergic based on standard electrophysiological

    criteria and that signaled the value of water rewards (henceforth referred to as dopamine

    neurons) (Methods). Figure 2.4 shows an example neuron that carried a strong water

    reward signal. On trials when the monkey viewed informative cues, the neuron was

    phasically excited by the cue indicating a large reward and inhibited by the cue indicating

    a small reward. In contrast, on trials when the monkey was forced to view uninformative

    random cues, the neuron had little response to the cues but was strongly responsive to the

    later reward outcome, excited when the reward was large and inhibited when it was

    small. Thus, consistent with previous studies, this neuron signaled changes in the

    monkey's expectation of water rewards.

    The same neuron also responded to the targets indicating the availability of

    information. On forced trials when only one target was available, the neuron was excited

    by the informative-cue target and inhibited by the random-cue target. On choice trials

    when both targets were available, the monkey always chose to receive information, and

    the neuron responded much as it did when the informative-cue target was presented

    alone. Thus, this dopamine neuron signaled changes in both the expectation of water and

    the expectation of information.

  • 25

    Figure 2.4. Dopamine neurons signal information Top: firing rate of an example neuron. Trials are sorted separately for each task event, as follows. Target: forced-information (red), choice-information (pink), forced-random (blue). Cue: informative cues (red) indicating that the reward is big (solid) or small (dashed), random cues (blue) with the same shape as informative cues for big (solid, cross shape) or small (dashed, wave shape) rewards. Reward: informative (red) are the same trials as for the cue response, random (blue) trials where the reward was big (solid) or small (dashed). The firing rate was smoothed with a Gaussian kernel, = 20 ms. Bottom: rasters for individual trials. Each row is a trial, and each dot is a spike. Colors are the same as in the firing rate display, except that dark colors correspond to dashed lines.

    This pattern of responses was quite common in dopamine neurons. We measured

    each neurons discrimination between targets, cues, and rewards using the area under the

    receiver operating characteristic (Figure 2.5B-D, Methods). This measure ranges from

    0.5 at chance levels to 0.0 or 1.0 for perfect discrimination. As in the example, neurons

    discriminated strongly between informative reward-predicting cues and between

    randomly-sized rewards, but only weakly between uninformative random cues and

    between fully predictable rewards (Figure 2.5C,D). The same neurons also discriminated

    between the targets, with clear preferential activation by the target that predicted advance

    information (Figure 2.5B). The discrimination was highly similar when measured using

  • 26

    either forced-information or choice-information trials in independent data sets (rho =

    0.68, p < 10-4; Methods), indicating that the neural preference for information was

    reproducible and consistent across different stimulus configurations. The same pattern

    occurred in both monkeys (Figure 2.6) and could be seen in the population average firing

    rate (Figure 2.5A).

    Figure 2.5. Analysis of the dopamine neuron population (A) Population average firing rate. Conventions as in Figure 2.4. Gray bars indicate the time windows used for the ROC analysis. Colored bars indicate time points with a significant difference between selected pairs of task conditions (P < 0.01, Wilcoxon signed rank test), as follows. Target: force-info vs. force-rand (red), choice-info vs. force-rand (pink); Cue: info-big vs. info-small (red), rand-cross vs. rand-wave (blue); Reward: info-big vs. info-small (red), rand-big vs. rand-small (blue).

  • 27

    (B-D) Neural discrimination between task conditions in response to the targets (B), cues (C), and rewards (D). Each dots (x,y) coordinates represent a single neurons ROC area for discriminating between the pairs of task conditions listed on the x and y axes. A discrimination of 1 indicates perfect preference for the condition listed next to 1 (e.g. Choice info); discrimination of zero indicates perfect preference for the condition listed next to 0 (e.g. Force rand). Note that in (B) the x and y coordinates were both calculated using the same set of forced-random trials. Colored dots indicate neurons with significant discrimination between the conditions listed on the y-axis (red), x-axis (blue), or both axes (magenta) (P < 0.05, Wilcoxon rank-sum test).

    There was also a tendency for neurons to have a weak initial excitation for each task

    event (Figure 2.5A). This nonspecific response is probably due to the animals initial

    uncertainty about the stimulus identity (Kakade and Dayan, 2002; Day et al., 2007) or

    stimulus timing (Fiorillo et al., 2008; Kobayashi and Schultz, 2008). We did not observe

    a predominant tendency for neurons to have anticipatory tonic increases in activity before

    the delivery of probabilistic rewards, a phenomenon that has been reported in one study

    (Fiorillo et al., 2003) but not others (Satoh et al., 2003; Morris et al., 2006; Bayer and

    Glimcher, 2005; Matsumoto and Hikosaka, 2007; Joshua et al., 2008). This may be due

    to differences in task design such as the size of the reward or the manner in which the

    reward was signaled (Fiorillo et al., 2003).

    An important question is whether dopamine neurons signal the presence of

    information per se, or whether they truly signal how much it is preferred. In the latter

    case, there should be a correlation between the neural preference for information,

    expressed as the neural discrimination between the informative-cue target and the

    random-cue target, and the behavioral preference for information, expressed as a choice

    percentage. Such correlations were indeed present, both between-monkeys and within-

    monkey. Between-monkeys, monkey V expressed a stronger behavioral preference for

    information than monkey Z (Figure 2.1B), and also expressed a stronger neural

  • 28

    preference (P = 0.02, Figure 2.6A). Within-monkey, during the sessions in which

    monkey Zs behavioral preference was strongest, the neural preference was enhanced

    (rho = 0.44, P = 0.02, Figure 2.6D). On the other hand, behavioral preferences for

    information were not significantly correlated with neural discrimination between water-

    related cues or water rewards (all P > 0.25, Figure 2.6B,C,E,F). Thus, consistent with

    evidence that dopamine neurons signal the subjective value of liquid rewards (Morris et

    al., 2006; Roesch et al., 2007; Kobayashi and Schultz, 2008), they may also signal the

    subjective value of information.

    Figure 2.6. Correlation between neural discrimination and behavioral preference (A) Histogram of single-neuron target response discrimination between all informative trials (choice and forced trials combined) versus forced-random trials, separately for monkey V (left) and monkey Z (right). Arrows, numbers, and horizontal lines indicate the

  • 29

    mean discrimination, and the width of the arrows represent the 95% bootstrap confidence interval. Red indicates statistical significance. (B,C) Same as (A), for discrimination between informative big-reward and small-reward cues (B) or between random big and small rewards (C). (D) Plot of behavioral choice percentage against single-neuron discrimination between all informative trials versus forced-random trials in response to the target. The line was fitted by least-squares regression. Text shows Spearmans rank correlation (rho), and red indicates statistical significance. The data is from monkey Z only, because monkey V almost exclusively chose the informative target and therefore had no behavioral variability. (E-F) Same as (D), but for discrimination between informative big-reward and small-reward cues (E) or between random big and small rewards (F).

    2.3 DISCUSSION

    Here we have shown that macaque monkeys prefer to receive advance information

    about future rewards, and that their behavioral preference is paralleled by the neural

    preference of midbrain dopamine neurons. Thus, the same dopamine neurons that signal

    primitive rewards like food and water also signal the cognitive reward of advance

    information.

    Monkeys expressed a strong preference for advance information even though it had

    no effect on the final reward outcome. This is consistent with the intuitive belief that, all

    things being equal, it is better to seek knowledge than to seek ignorance. It also provides

    an explanation for the puzzling fact that the brain devotes a great deal of neural effort to

    processing reward information even when this is not required to perform the task at hand.

    For example, many studies use passive classical conditioning tasks in which informative

    cues are followed by rewards with no requirement for the subject to take any action. In

    these tasks the brain could simply ignore the cues and wait passively for rewards to

    arrive. Yet even after extensive training, many neurons continue to use the cue

    information to predict the size, probability, and timing of reward delivery (e.g. (Tobler et

    al., 2003; Joshua et al., 2008)). In other tasks, neurons persist in predicting rewards even

  • 30

    when the act of prediction is harmful, causing maladaptive behavior that interferes with

    reward consumption (e.g. refusing to perform trials with low predicted value (Shidara and

    Richmond, 2002; Lauwereyns et al., 2002a)). These observations suggest that the act of

    prediction has a special status, an intrinsic motivational or rewarding value of its own.

    Our data provides strong evidence for this hypothesis. When given an explicit choice,

    monkeys actively sought out the advance information that was necessary to make

    accurate reward predictions at the earliest possible opportunity.

    A limitation of our study is that it does not determine the precise psychological

    mechanism by which value is assigned to information. There are several possibilities.

    Theories from experimental psychology suggest that in our task the value of viewing

    informative cues would simply be the sum of the conditioned reinforcement generated by

    the individual big-reward and small-reward cues. In this view, the preference for

    information implies that the conditioned reinforcement is weighted nonlinearly, so that

    the benefit of strong reinforcement from the big-reward cue outweighs the drawback of

    weak reinforcement from the small-reward cue (Wyckoff, 1959; Fantino, 1977;

    Dinsmoor, 1983), akin to the nonlinear weighting of rewards that produces risk seeking

    (von Neumann and Morgenstern, 1944). On the other hand, theories in economics

    suggest that preference is not due to independent contributions of individual cues but

    instead comes from considering the full probability distribution of future events. In this

    view, information-seeking is due to an explicit preference for early resolution of

    uncertainty (Kreps and Porteus, 1978) or an implicit preference induced by psychological

    factors such as anticipatory emotions (Caplin and Leahy, 2001). In addition, just as the

    value assigned to conventional forms of reward (e.g. food) depends on the internal state

  • 31

    of the subject (e.g. hunger), the value assigned to information is likely to depend on

    psychological factors such as personality (Miller, 1987), emotions like hope and anxiety

    (Chew and Ho, 1994; Wu, 1999) and attitudes toward uncertainty (Lovallo and

    Kahneman, 2000; Platt and Huettel, 2008).

    Implications of information-seeking for attitudes toward uncertainty

    In the framework of decision-making under uncertainty, advance information

    reduces the amount of reward uncertainty by narrowing down the set of potential reward

    outcomes. Our data therefore suggests that in our task rhesus macaque monkeys preferred

    to reduce their reward uncertainty at the earliest possible moment, as though the

    experience of uncertainty was aversive.

    Interestingly, several previous studies using similar saccadic decision tasks came to a

    seemingly opposite conclusion: macaque monkeys appeared to prefer uncertainty,

    choosing an uncertain, variable-size reward instead of a certain, fixed-size reward

    (McCoy and Platt, 2005; Platt and Huettel, 2008). How can these results be reconciled?

    One possibility is that they can be explained by a common principle; for instance,

    perhaps monkeys treat the offer of a variable-size reward as a source of uncertainty to be

    confronted and resolved. An important point, however, is that a preference for reward

    variance can be caused by factors unrelated to uncertainty most notably, it can be

    caused by an explicit preference over the probability distribution of reward outcomes, for

    instance due to disproportionate salience of large rewards (Hayden and Platt, 2007) or a

    nonlinear utility function (Platt and Huettel, 2008). In contrast, the choice of information

    has no influence on the reward outcome; it only affects the amount of time spent in a

  • 32

    state of uncertainty before the reward outcome is revealed. In this sense the preference

    for advance information is a relatively pure measurement of attitudes toward uncertainty.

    Information signals in the dopaminergic reward system

    Dopamine neuron activity is thought to teach the brain to seek basic goals like food

    and water, reinforcing and punishing actions by adjusting synaptic connections between

    neurons in cortical and subcortical brain structures (Wise, 2004; Montague et al., 2004).

    Our data suggests that the same neural system also teaches the brain to seek advance

    information, selectively reinforcing actions that lead to knowledge about rewards in the

    future. Thus, the behavioral preference for information could be created by the

    dopaminergic reward system. At the neural level, neurons which gain sensitivity to

    rewards through a dopamine-mediated reinforcement process would come to represent

    both rewards and advance information about those rewards in a common currency,

    particularly neurons involved in reward timing, conditioned reinforcement, and decision-

    making under risk (Kim et al., 2008; Seo and Lee, 2009; Platt and Huettel, 2008). In turn,

    these signals could ultimately feed back to dopamine neurons to influence their value

    signals.

    An important goal for future research will therefore be to discover how dopamine

    neurons measure information and assign its rewarding value. One possibility is that

    dopamine neurons receive information-related input from specialized brain areas, distinct

    from those that compute the value of traditional rewards like food and water. Indeed,

    signals encoding the amount and timing of reward information, and dissociated from

    preference coding of traditional rewards, have been found in several cortical areas

  • 33

    (Nakamura, 2006; Behrens et al., 2007; Luhmann et al., 2008). How these information

    signals could be translated into a behavioral preference, and whether they are

    communicated to dopamine neurons, is unknown.

    Another possibility is that dopamine neurons receive information signals from the

    same brain areas that contribute to their food- and water-related signals, such as the

    lateral habenula (Matsumoto and Hikosaka, 2007). In this case, dopamine neurons would

    receive a highly processed input, with different forms of rewards already converted into a

    common currency by upstream brain areas. We are currently testing this possibility in

    further experiments.

    Why do dopamine neurons treat information as a reward?

    The preference for advance information, despite its intuitive appeal, is not predicted

    by current computational models of dopamine neuron function (Schultz et al., 1997;

    Montague et al., 2004), which are widely viewed as highly efficient algorithms for

    reinforcement learning. This raises an important question: could the information-

    predictive activity of dopamine neurons be a harmful bug that impairs the efficiency of

    reward learning? Or is it a useful feature that improves over existing computational

    models? Here we present our hypothesis that the positive value of advance information is

    a feature with a fundamental role in reinforcement learning.

    Specifically, modern theories of reinforcement learning recognize that animals learn

    from two types of reinforcement: primary reinforcement generated by rewards

    themselves, and secondary reinforcement generated predictively, by observing sensory

    cues in advance of reward delivery. Predictive reinforcement greatly enhances the speed

  • 34

    and reliability of learning, as demonstrated most strikingly by temporal-difference

    learning algorithms (Sutton and Barto, 1998) which have produced influential accounts of

    animal behavior (Sutton and Barto, 1981) and dopamine neuron activity (Schultz et al.,

    1997; Montague et al., 2004). This implies that animals should treat predictive

    reinforcement as an object of desire, making an active effort to seek out environments

    where reward-predictive sensory cues are plentiful. If an animal was trapped in an

    impoverished environment where reward-predictive cues were unavailable, the

    consequences would be devastating: the animal's sophisticated predictive reinforcement

    learning algorithms would be reduced to impotence. This can be seen clearly in our

    dopamine neuron data (Figure 2.5A). When an action produces informative cues,

    dopamine neurons signal its value immediately, a predictive reinforcement signal; but

    when an action produces uninformative cues, dopamine neurons must wait to signal its

    value until the reward outcome arrives, acting as little more than a primitive reward

    detector. Thus, predictive reinforcement depends entirely on obtaining advance

    information about upcoming rewards.

    In light of these considerations, we propose that any learning system driven by the

    engine of predictive reinforcement must actively seek out its fuel of advance

    information. In this view, current models of neural reinforcement learning present a

    curious paradox: their learning algorithms are vitally dependent on advance information,

    but they treat information as valueless and make no effort to obtain it. These models do

    include a form of knowledge-seeking by exploring unfamiliar actions, but they make no

    effort to obtain informative cues that would maximize learning from these new

    experiences. In fact, models using the popular TD() algorithm (Sutton and Barto, 1998)

  • 35

    are actually averse to advance information (Note 2.A). Our data shows that a new class

    of models is necessary that assign information a positive value perhaps representing the

    future reward the animal expects to receive, as a result of obtaining better fuel for its

    learning algorithm. This would be akin to the concept of intrinsically motivated

    reinforcement learning (Barto et al., 2004), in that dopamine neurons would assign an

    intrinsic value to information because it could help the animal learn to better predict and

    control its environment (Barto et al., 2004; Redgrave and Gurney, 2006). Also, although

    dopamine neurons have been best studied in the realm of rewards, they can also respond

    to salient non-rewarding stimuli (Horvitz, 2000; Redgrave and Gurney, 2006; Joshua et

    al., 2008; Matsumoto and Hikosaka, 2009). This suggests that dopamine neurons might

    be able to signal the value of information about neutral and punishing events (Herry et

    al., 2007; Badia et al., 1979; Fanselow, 1979; Tsuda et al., 1989) events, as part of a more

    general role in motivating animals to learn about the world around them.

    2.4 METHODS

    Subjects. Subjects were two male rhesus macaque monkeys (Macaca mulatta), monkey

    V (9.3 kg) and monkey Z (8.7 kg). All procedures for animal care and experimentation

    were approved by the Institute Animal Care and Use Committee and complied with the

    Public Health Service Policy on the humane care and use of laboratory animals. A plastic

    head holder, scleral search coils, and plastic recording chambers were implanted under

    general anesthesia and sterile surgical conditions.

  • 36

    Behavioral Tasks. Behavioral tasks were under the control of the REX program (Hays et

    al., 1982) adapted for the QNX operating system. Monkeys sat in a primate chair, facing

    a frontoparallel screen 31 cm from the monkeys eyes in a sound-attenuated and

    electrically shielded room. Eye movements were monitored using a scleral search coil

    system with 1 ms resolution. Stimuli generated by an active matrix liquid crystal display

    projector (PJ550, ViewSonic) were rear-projected on the screen.

    In the information choice task (Figure 2.1), each trial began with the appearance of a

    central spot of light (1 diameter), which the monkey was required to fixate. After 800

    ms, the spot disappeared and two colored targets appeared on the left and right sides of

    the screen (2.5 diameter, 10-15 eccentricity). (On forced-information and forced-

    random trials, only a single target appeared). The monkey had 710 ms to saccade to and

    fixate the chosen target, after which the non-chosen target immediately disappeared. At

    the end of the 710 ms response window a cue (14 diameter) was presented of the chosen

    color. For the informative color, the cue was a cross on large-reward trials or a wave

    pattern on small-reward trials. For the random color, the cues shape was chosen

    pseudorandomly on each trial (see below). The colors were green and orange, chosen to

    have similar luminance and counterbalanced across monkeys. Monkeys were not required

    to fixate the cue. After 2250 ms of display time, the cue disappeared and simultaneously

    a 200 ms tone sounded and reward delivery began. The inter-trial interval was 3850-4850

    ms beginning from the disappearance of the cue. Water rewards were delivered using a

    gravity-based system (Crist Instruments). Reward delivery lasted 50 ms on small-reward

    trials (0.04 ml) and 700 ms (0.88 ml, monkey V) or 825 ms (1.05 ml, monkey Z) on

  • 37

    large-reward trials. To minimize the effects of physical preparation, licking the water

    spout was not required to obtain rewards; water was delivered directly into the mouth.

    The task proceeded in blocks of 24 trials, each block containing a randomized

    sequence of all 3x2x2x2 combinations of choice type (forced-information, forced-

    random, choice), reward size (large, small), random cue shape (cross, waves), and

    informative target location (left, right). Thus, the random cues were actually quasi-

    random and could theoretically yield a small amount of information about reward size,

    but extracting that information would require a very difficult feat of working memory.

    If monkeys made an error (broke fixation on the central spot, or failed to choose a

    target, or broke fixation on the chosen target before the cue appeared), then the trial

    terminated, an error tone sounded, an additional 3 seconds were added to the inter-trial

    interval, and the trial was repeated (correction trial). If the error occurred after the

    choice, only the chosen target was available on the correction trial.

    The information delay task (Figure 2.3) was identical to the information choice

    task except the cue colors and shapes were different, and a third set of always-

    informative gray cues lasting for 1500 ms were appended to the cue period. (there were

    also minor differences in the task parameters for monkey Z: the duration of the first cue

    was 2000 ms, and the big reward volume was ~1.29 ml). The 1500 ms duration of the

    always-informative cue was chosen to allow near-optimal physical preparation for

    rewards. With a shorter cue duration (e.g. < 750 ms), there might not be enough time to

    discriminate the cue and make a physical response (e.g. compare to the latency of

    anticipatory licking in (Tobler et al., 2003)). With a longer cue duration (e.g. > 2

    seconds), physical preparation for reward delivery begins to be impaired by timing errors

  • 38

    (e.g. compare to the timecourse of anticipatory licking in (Fiorillo et al., 2008; Kobayashi

    and Schultz, 2008)). To perform a reversal (vertical lines in Figure 2.3B), the colors of

    the informative and random cues were switched.

    Neural Recording. Midbrain dopamine neurons were recorded using techniques

    described previously (Matsumoto and Hikosaka, 2007). A recording chamber was placed

    over fronto-parietal cortex, tilted laterally by 35 degrees, and aimed at the substantia

    nigra. The recording sites were determined using a grid system, which allowed recordings

    at 1 mm spacing. Single-neuron recording was performed using tungsten electrodes

    (Frederick Haer) that were inserted through a stainless steel guide tube and advanced by

    an oil-driven micro-manipulator (MO-97A, Narishige). Single neurons were isolated on-

    line using custom voltage-time window discrimination software (the MEX program

    (Hays et al., 1982) adapted for the QNX operating system).

    Neurons were recorded in and around the substantia nigra pars compacta and ventral

    tegmental area. We targeted this region based on anatomical atlases and magnetic

    resonance imaging (4.7T, Bruker). During recording sessions, we identified this region

    based on recording depth and using landmarks including the somatosensory and motor

    thalamus, subthalamic nucleus, substantia nigra pars reticulata, red nucleus, and

    oculomotor nerve. Presumed dopamine neurons were identified by their irregular tonic

    firing at 0.5-10 Hz and broad spike waveforms. We focused our recordings on presumed

    dopamine neurons that responded to the task and appeared to carry positive reward

    signals. Occasional dopamine-like neurons that upon examination showed no differential

    response to the cues and no differential response to the reward outcomes were not

  • 39

    recorded further. We then analyzed all neurons that were recorded for at least 60 trials

    and that had positive reward discrimination for both informative cues and random

    outcomes, or positive reward discrimination for cues and no discrimination for outcomes,

    or positive reward discrimination for outcomes and no discrimination for cues (P < 0.05,

    Wilcoxon rank-sum test). We were able to examine the response properties of 108

    neurons, 84 of which met our criteria for presumed dopaminergic firing rate, pattern, and

    spike waveform, and 47 of which also met our criteria for trial count and significant

    reward signals. This yielded 20 neurons from monkey V (right hemisphere) and 27

    neurons from monkey Z (left hemisphere) for our analysis.

    Data Analysis. All statistical tests were two-tailed. The neural analysis excluded error

    trials and correction trials. We analyzed neural activity in time windows 150-500 ms after

    target onset (targets), 150-300 ms after cue onset (cues), and 200-450 ms after cue offset

    (rewards). These were chosen to include the major components of the average neural

    response. The neural discrimination between a pair of task conditions was defined as the

    area under the receiver operating characteristic (ROC), which can be interpreted as the

    probability that a randomly chosen single-trial firing rate from the first condition was

    greater than a randomly chosen single-trial firing rate from the second condition (Green

    and Swets, 1966). We observed the same results using other measures of neural

    discrimination such as the signal-to-baseline ratio and signal-to-noise ratio. Confidence

    intervals and significance of the population averages of single-neuron ROC areas (Figure

    2.6A-C) were computed using a bootstrap test with 200,000 resamples (Efron and

    Tibshirani, 1993). Consistent with previous studies of reward coding (Schultz and Romo,

  • 40

    1990; Kawagoe et al., 2004; Roesch et al., 2007; Matsumoto and Hikosaka, 2007) we

    observed similar neural coding of behavioral preferences for both of the target locations

    on the screen (average ROC area, forced information vs. forced random: ipsilateral =

    0.60, P < 10-4, contralateral = 0.62, P < 10-4; choice information vs. forced random:

    ipsilateral 0.58, P < 10-4, contralateral 0.62, P < 10-4), so for all analyses the data were

    combined. We could not analyze activity on choice-random trials due to their rarity (< 3

    trials for most neurons). All correlations were computed using Spearman's rho (rank

    correlation). To compare neural discrimination measured using either forced-information

    or choice-information trials in independent data sets, we calculated the correlation

    between two values, the discrimination between forced-information trials vs. even-

    numbered forced-random trials, and the discrimination between choice-information trials

    vs. odd-numbered forced-random trials (rho = 0.68, P < 10-4). Significance of

    correlations, and of the difference in mean ROC area between the two monkeys (Figure

    2.6), were computed using permutation tests (200,000 permutations) (Efron and

    Tibshirani, 1993).

    2.5 ACKNOWLEDGEMENTS

    We thank M. Matsumoto, G. La Camera, B. J. Richmond, D. Sheinberg, and V. Stuphorn

    for valuable discussions, and G. Tansey, D. Parker, M. Lawson, B. Nagy, J.W.

    McClurkin, A.M. Nichols, T.W. Ruffner, L.P. Jensen, and M.K. Smith for technical

    assistance. This work was supported by the intramural research program at the National

    Eye Institute.

  • 41

    NOTE 2.A

    Aversion to information by reinforcement learning algorithms based on TD()

    It may be surprising that models based on temporal-difference learning (TD

    learning), which is formally indifferent to information, could show any preference in our

    task. However, TD learning is only a method for reward prediction; it does not specify

    how to use that knowledge to take action. When TD learning is coupled to a mechanism

    for action-selection (as is necessary in models of animal behavior), new behavior can

    emerge. Here we will describe the formalism of TD learning (Sutton and Barto, 1998),

    explain a simple case in which it leads to biases in selection between risky and safe

    actions, and extend this reasoning to our case of selection between informative and

    uninformative actions.

    TD learning formalism definition of reward value

    The goal of TD learning is to learn a value function that takes as input the current

    state of the environment s and predicts the amount of expected reward r that will be

    received in the future. We assume that at time t a learning agent observes the state of the

    environment st, takes an action at, and then receives feedback in the form of a reward rt+1

    and a new state of the environment st+1. The agent-environment interaction can be

    described as a Markov decision process, which is fully specified by three functions

    (,R,T): the agents policy for choosing actions (s,a) = Pr(take action a | in state s), and

    the environments mapping from (state,action) pairs to rewards and new states, the

    reward function R(s,a,r) = Pr(receive reward r | in state s and took action a) and the state

    transition function T(s,a,s) = Pr(new state is s | in state s and took action a). Given these

  • 42

    specifications, we can derive the full probability distribution over the agents future

    experiences starting from the current state st and action at and leading to any future state

    st+ and reward rt+, e.g. written as p,R,T(st+, rt+ | st, at). We can then define the value

    function V as the expectation of the sum of future rewards, starting from the state st at

    the current moment in time t and considering all future states st+:

    =

    ++ ===1

    ,