[ieee 2007 ieee symposium on foundations of computational intelligence - honolulu, hi, usa...

1
Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence (FOCI 2007) Designing the Reward System: Computational and Biological Principles Kenji Doya Okinawa Institute of Science and Technology Abstract- In the standard framework of optimal control and REFERENCES reinforcement learning, a "cost" or "reward" function is given a priori, and an intelligent agent is supposed to optimize it. In [1] Doya K (2002). Metaleaming and neuromodulation. Neural Networks, real life, control engineers and machine learning researchers are [2] Doya K., Uchibe E. (2005). The Cyber Rodent project: Exploration often faced with the problem of how to design a good objective of adaptive mechanisms for self-preservation and self-reproduction. function to let the agent attain an intended goal efficiently Adaptive Behavior, 13, 149-160. and reliably. Such meta-level tuning of adaptive processes is [3] Tanaka, S. C., Doya, K., Okada, G., Ueda, K., Okamoto, Y, Yamawaki, the major challenge in bringing intelligent agents to real-world S. (2004). Prediction of immediate and future rewards differentially applications. recruits cortico-basal ganglia loops. Nature Neuroscience, 7, 887-893. While development of theoretical frameworks for 'meta- learning' of adaptive agents is an urgent engineering problem, it is an important biological question to ask what kind of meta- learning mechanisms our brain implements to enable robust and flexible control and learning. We are putting together com- putational, neurobiological, and robotic approaches to attack these dual problems of computational theory and biological implementation of meta-learning. In the computational front, our major strategies have been parallelism and evolution. We derived a parallel learning frame- work, CLIS (concurrent learning by importance sampling), in which multiple adaptive agents sharing the same behavioral experience (i.e., many brains sharing one body) complete to take in charge of actions, but learn concurrently from fellow agents' successes and failures by appropriate weighting. We also developed an embodied evolution scheme in which reward functions for multiple objectives are optimized to match en- vironmental constraints. We are verifying these computational frameworks using a robotic platform called "Cyber Rodents", which realize self-preservation by catching battery packs and self- reproduction in software by IR transmission of their genetic codes. On the biological side, our major focus is on the selection of appropriate temporal horizon for prediction and learning. In our life, whether to enjoy immediate reward or to look ahead for long- term profit is often a difficult decision (e.g., relax on the beach or work on the lecture). In reinforcement learning theory, this trade- off is dealt with by the setting of the temporal discounting parameter. After reviewing a wealth of neurobiological studies, we conjectured that the serotonin, one of the major neuromodulators, regulates the temporal discounting parameter of the brain. In our functional brain imaging studies, we found that our brain has parallel pathways for predicting immediate and future rewards and that they are differentially modulated by the activity of serotonin. Our recent neural recording studies in rats also showed that the serotonin neurons increase firing during delay periods when rats wait for expected future rewards. These are consistent with our conjecture. To conclude, theoretical development for computational in- telligence is critical for understanding the highly adaptive mechanisms of the brain, and more knowledge of the brain as a real instantiation of fluid intelligence sets higher goals for theoretical development. 1-4244-0703-6/07/$20.00 ©2007 IEEE 645

Upload: kenji

Post on 21-Feb-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2007 IEEE Symposium on Foundations of Computational Intelligence - Honolulu, HI, USA (2007.04.1-2007.04.5)] 2007 IEEE Symposium on Foundations of Computational Intelligence -

Proceedings of the 2007 IEEE Symposium onFoundations of Computational Intelligence (FOCI 2007)

Designing the Reward System: Computational and Biological

Principles

Kenji DoyaOkinawa Institute of Science and Technology

Abstract- In the standard framework of optimal control and REFERENCESreinforcement learning, a "cost" or "reward" function is givena priori, and an intelligent agent is supposed to optimize it. In [1] Doya K (2002). Metaleaming and neuromodulation. Neural Networks,real life, control engineers and machine learning researchers are [2] Doya K., Uchibe E. (2005). The Cyber Rodent project: Explorationoften faced with the problem of how to design a good objective of adaptive mechanisms for self-preservation and self-reproduction.function to let the agent attain an intended goal efficiently Adaptive Behavior, 13, 149-160.and reliably. Such meta-level tuning of adaptive processes is [3] Tanaka, S. C., Doya, K., Okada, G., Ueda, K., Okamoto, Y, Yamawaki,the major challenge in bringing intelligent agents to real-world S. (2004). Prediction of immediate and future rewards differentiallyapplications. recruits cortico-basal ganglia loops. Nature Neuroscience, 7, 887-893.

While development of theoretical frameworks for 'meta-learning' of adaptive agents is an urgent engineering problem,it is an important biological question to ask what kind of meta-learning mechanisms our brain implements to enable robustand flexible control and learning. We are putting together com-putational, neurobiological, and robotic approaches to attackthese dual problems of computational theory and biologicalimplementation of meta-learning.

In the computational front, our major strategies have beenparallelism and evolution. We derived a parallel learning frame-work, CLIS (concurrent learning by importance sampling), inwhich multiple adaptive agents sharing the same behavioralexperience (i.e., many brains sharing one body) complete totake in charge of actions, but learn concurrently from fellowagents' successes and failures by appropriate weighting. Wealso developed an embodied evolution scheme in which rewardfunctions for multiple objectives are optimized to match en-vironmental constraints. We are verifying these computationalframeworks using a robotic platform called "Cyber Rodents",which realize self-preservation by catching battery packs andself- reproduction in software by IR transmission of theirgenetic codes.

On the biological side, our major focus is on the selectionof appropriate temporal horizon for prediction and learning.In our life, whether to enjoy immediate reward or to lookahead for long- term profit is often a difficult decision (e.g.,relax on the beach or work on the lecture). In reinforcementlearning theory, this trade- off is dealt with by the setting ofthe temporal discounting parameter. After reviewing a wealthof neurobiological studies, we conjectured that the serotonin,one of the major neuromodulators, regulates the temporaldiscounting parameter of the brain. In our functional brainimaging studies, we found that our brain has parallel pathwaysfor predicting immediate and future rewards and that theyare differentially modulated by the activity of serotonin. Ourrecent neural recording studies in rats also showed that theserotonin neurons increase firing during delay periods whenrats wait for expected future rewards. These are consistent withour conjecture.

To conclude, theoretical development for computational in-telligence is critical for understanding the highly adaptivemechanisms of the brain, and more knowledge of the brainas a real instantiation of fluid intelligence sets higher goals fortheoretical development.

1-4244-0703-6/07/$20.00 ©2007 IEEE 645