Download - A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments
A toy model of human cognition:
!
Utilizing fluctuation in uncertain and non-‐‑stationary environments
Tatsuji Takahashi1, Yu Kohno1,2 Seminar on science of complex systems (organized by Yukio-‐‑Pegio Gunji), Yukawa Institute for Theoretical
Physics, Kyoto University, Jan. 20, 2014 1Tokyo Denki University, 2JSPS (from Apr., 2014)
Contents
!2
ContentsThe loosely symmetric (LS) model
!2
ContentsThe loosely symmetric (LS) model
Cognitive properties or cognitive biases
!2
ContentsThe loosely symmetric (LS) model
Cognitive properties or cognitive biases
Analysis of reconstruction of LS
!2
ContentsThe loosely symmetric (LS) model
Cognitive properties or cognitive biases
Analysis of reconstruction of LS
Result: Efficacy in reinforcement learning
!2
ContentsThe loosely symmetric (LS) model
Cognitive properties or cognitive biases
Analysis of reconstruction of LS
Result: Efficacy in reinforcement learning
Utilization of fluctuation in non-‐‑stationary environments
!2
A toy model of human cognition
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
the differences from “machines”
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
the differences from “machines”
Principal properties implemented in a form as simple as possible
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
the differences from “machines”
Principal properties implemented in a form as simple as possible
so that it can be analyzed and applied easily
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
the differences from “machines”
Principal properties implemented in a form as simple as possible
so that it can be analyzed and applied easily
Intuition of human beings
!3
A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases
the differences from “machines”
Principal properties implemented in a form as simple as possible
so that it can be analyzed and applied easily
Intuition of human beings
as simple, again: not the policy (or strategy) that is learnt through education and culture
!3
LS as a toy model of cognition
!4
LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:
!4
LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:
models cognitive biases
!4
LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:
models cognitive biases
merely a function over co-‐‑occurrence information between two events
!4
LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:
models cognitive biases
merely a function over co-‐‑occurrence information between two events
faithfully describes the causal intuition of humans
!4
LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:
models cognitive biases
merely a function over co-‐‑occurrence information between two events
faithfully describes the causal intuition of humans
which form the basis of decision-‐‑making and action for adaptation in the world
!4
The loosely symmetric (LS) model
!5
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
!5
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
!5
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
!5
posterior eventq ¬q
prior event
p a b¬p c d
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
The relationship from p to q: LS(q|p)
!5
posterior eventq ¬q
prior event
p a b¬p c d
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
The relationship from p to q: LS(q|p)
LS describes the causal intuition of human beings the most faithfully (among more than 40 existing models).
!5
posterior eventq ¬q
prior event
p a b¬p c d
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
The relationship from p to q: LS(q|p)
LS describes the causal intuition of human beings the most faithfully (among more than 40 existing models).
!5
posterior eventq ¬q
prior event
p a b¬p c d
P (q|p) = a
a+ b
The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).
Defined over the co-‐‑occurrence information of events p and q
The relationship from p to q: LS(q|p)
LS describes the causal intuition of human beings the most faithfully (among more than 40 existing models).
!5
posterior eventq ¬q
prior event
p a b¬p c d
P (q|p) = a
a+ b
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) model
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) modelInductive inference of causal relationship
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) modelInductive inference of causal relationship
How humans form the intensity of causal relationship from p to q,
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) modelInductive inference of causal relationship
How humans form the intensity of causal relationship from p to q,
when p is the candidate cause of the effect q in focus?
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) modelInductive inference of causal relationship
How humans form the intensity of causal relationship from p to q,
when p is the candidate cause of the effect q in focus?
The function form of f(a, b, c, d) for the human causal intuition
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
The loosely symmetric (LS) modelInductive inference of causal relationship
How humans form the intensity of causal relationship from p to q,
when p is the candidate cause of the effect q in focus?
The function form of f(a, b, c, d) for the human causal intuition
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
Meta analysis as in Hattori & Oaksford (2007)
The loosely symmetric (LS) modelInductive inference of causal relationship
How humans form the intensity of causal relationship from p to q,
when p is the candidate cause of the effect q in focus?
The function form of f(a, b, c, d) for the human causal intuition
!6
posterior event
q ¬q
prior event
p a b¬p c d
LS(q|p) =a+ b
b+dd
a+ bb+dd+ b+ a
a+cc
Experiment AS95 BCC03.1 BCC03.3 H03 H06 LS00 W03.2 W03.6r for LS 0.95 0.98 0.98 0.98 0.97 0.85 0.95 0.85r for ΔP 0.88 0.92 0.84 0.00 0.71 0.88 0.28 0.46r2 for LS 0.9 0.96 0.96 0.97 0.94 0.73 0.91 0.72
Meta analysis as in Hattori & Oaksford (2007)
In 2-‐‑armed bandit problems
!7
In 2-‐‑armed bandit problems
!7
later on bandit problems
In 2-‐‑armed bandit problemsLS used as the value function in reinforcement learning:
!7
later on bandit problems
In 2-‐‑armed bandit problemsLS used as the value function in reinforcement learning:
The agent evaluates the actions according to the causal intuition of humans.
!7
later on bandit problems
In 2-‐‑armed bandit problemsLS used as the value function in reinforcement learning:
The agent evaluates the actions according to the causal intuition of humans.
!7
1 5 10 50 100 500 10000.5
0.6
0.7
0.8
0.9
1.0
step
Accuracyrate
LSCPToWH0.5LSMH0.3LSMH0.7L
later on bandit problems
In 2-‐‑armed bandit problemsLS used as the value function in reinforcement learning:
The agent evaluates the actions according to the causal intuition of humans.
Very good adaptation to the environment, both in short term and long term.
!7
1 5 10 50 100 500 10000.5
0.6
0.7
0.8
0.9
1.0
step
Accuracyrate
LSCPToWH0.5LSMH0.3LSMH0.7L
later on bandit problems
The loosely symmetric (LS) model
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
brain science: Daw et al., Nature, 2006.
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
brain science: Daw et al., Nature, 2006.
Idiosyncratic, asymmetric risk a8itude as in the prospect theory
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
brain science: Daw et al., Nature, 2006.
Idiosyncratic, asymmetric risk a8itude as in the prospect theory
Kahneman & Tversky, Am., Psy., 1984, Boorman et al., Neuron, 2009
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
brain science: Daw et al., Nature, 2006.
Idiosyncratic, asymmetric risk a8itude as in the prospect theory
Kahneman & Tversky, Am., Psy., 1984, Boorman et al., Neuron, 2009
Satisficing
The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)
Comparative valuation
psychology: Tversky & Kahneman, Science, 1974.
brain science: Daw et al., Nature, 2006.
Idiosyncratic, asymmetric risk a8itude as in the prospect theory
Kahneman & Tversky, Am., Psy., 1984, Boorman et al., Neuron, 2009
Satisficing
Simon, Psy. Rev., 1954, Kolling et al., Science, 2012.
Principal human cognitive biases
!9
Principal human cognitive biasesHumans:
!9
Principal human cognitive biasesHumans:
Satisficing: do not optimize but satisfice.
!9
Principal human cognitive biasesHumans:
Satisficing: do not optimize but satisfice.
become satisfied when it is becer than the reference level
!9
Principal human cognitive biasesHumans:
Satisficing: do not optimize but satisfice.
become satisfied when it is becer than the reference level
Comparative valuation: evaluate states and actions in a relative manner
!9
Principal human cognitive biasesHumans:
Satisficing: do not optimize but satisfice.
become satisfied when it is becer than the reference level
Comparative valuation: evaluate states and actions in a relative manner
Asymmetric risk a:itude: asymmetrically recognize gain and loss
!9
Satisficing
A1 A2 No pursuit of arms over the reference level givenall arms are over reference
Search hard for an arm over the reference levelall arms are under reference
A1 A2reference
reference
Satisficing
A1 A2 No pursuit of arms over the reference level givenall arms are over reference
Search hard for an arm over the reference levelall arms are under reference
A1 A2reference
reference
Expected value 0.75 = 75% 25% = 25%
win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×
○×○○×○××× ○×××× ×××○× ××○×○
×○××
comparison considering reliability > <
Gamble on 1/4 rather than 5/20
Risk-avoiding over the reference
Choose 15/20 than 3/4
Risk a:itude (Reliability consideration)Risk-seeking under the reference
reflection effect
Satisficing
A1 A2 No pursuit of arms over the reference level givenall arms are over reference
Search hard for an arm over the reference levelall arms are under reference
A1 A2reference
reference
Expected value 0.75 = 75% 25% = 25%
win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×
○×○○×○××× ○×××× ×××○× ××○×○
×○××
comparison considering reliability > <
Gamble on 1/4 rather than 5/20
Risk-avoiding over the reference
Choose 15/20 than 3/4
Risk a:itude (Reliability consideration)Risk-seeking under the reference
reflection effect
Choose A1 and lose
Comparative evaluation Try arms other than A1 by comparative valuation
(see-saw)value of A1 value of A2
A1 A2 absolute comparative A1 A2
Abstract image
The generalized LS with variable reference (LSVR)
Variable Reference
LSVR is a generalization of LS with an autonomously adjusted parameter of reference.
n-‐‑armed bandit problem (nABP)
!12
n-‐‑armed bandit problem (nABP)
The simplest framework in reinforcement learning, exhibiting the exploration-‐‑exploitation dilemma and the speed-‐‑accuracy tradeoff.
!12
n-‐‑armed bandit problem (nABP)
The simplest framework in reinforcement learning, exhibiting the exploration-‐‑exploitation dilemma and the speed-‐‑accuracy tradeoff.
It is to maximize the total reward acquired from n actions (sources) with unknown reward distribution.
!12
n-‐‑armed bandit problem (nABP)
The simplest framework in reinforcement learning, exhibiting the exploration-‐‑exploitation dilemma and the speed-‐‑accuracy tradeoff.
It is to maximize the total reward acquired from n actions (sources) with unknown reward distribution.
One-‐‑armed bandit is a slot machine that gives a reward (win) or not (lose).
!12
n-‐‑armed bandit problem (nABP)
The simplest framework in reinforcement learning, exhibiting the exploration-‐‑exploitation dilemma and the speed-‐‑accuracy tradeoff.
It is to maximize the total reward acquired from n actions (sources) with unknown reward distribution.
One-‐‑armed bandit is a slot machine that gives a reward (win) or not (lose).
n-‐‑armed bandit is a slot machine with n arms that have different probability of winning.
!12
Performance indices for nABP
!13
Performance indices for nABP
Accuracy:
!13
Performance indices for nABP
Accuracy:
the average percentage of choosing the optimal action
!13
Performance indices for nABP
Accuracy:
the average percentage of choosing the optimal action
Regret (expected loss):
!13
Performance indices for nABP
Accuracy:
the average percentage of choosing the optimal action
Regret (expected loss):
the difference of the actually acquired accumulated rewards from the best possible sequence of actions (where accuracy=1.0 all through the trial)
!13
Result n=100, the reward probability for each action is taken uniformly from [0,1].
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
05
1015
Steps
Expe
cted
loss
LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Accuracy: highest Regret: smallest
Kohno & Takahashi, 2012; in prep.The more there are actions, the better
the performance of LSVR becomes.
Non-‐‑stationary bandits
The reward probabilities change while playing.
!15
Result in non-stationary environment 1n=16, the reward probability is from [0,1].
The probabilities are totally reset every 10,000 steps.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
0 10000 20000 30000 40000 50000
050
100
150
200
250
300
Steps
Expe
cted
loss
LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Kohno & Takahashi, in prep.Accuracy: highest Regret: smallest
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.
If the reward is given deterministically, this is
impossible.
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.
If the reward is given deterministically, this is
impossible.
Result in non-stationary environment 2
Accuracy (the rate of the optimal action at the time chosen)
n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.
If the reward is given deterministically, this is
impossible.
Efficient search utilizing uncertainty and fluctuation
in non-stationary environments
Results
!18
Results
!18
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
stationary
ResultsThe more there are options, the becer the performance of LSVR becomes.
!18
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
stationary
ResultsThe more there are options, the becer the performance of LSVR becomes.
!18
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
stationary
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
non-stationary 2
LSVR can trace the unobserved change, amplifying fluctuation.
ResultsThe more there are options, the becer the performance of LSVR becomes.
!18
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
stationary
0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
non-stationary!synchronous
LSVR can trace the change in non-‐‑stationary environments.0 10000 20000 30000 40000 50000
0.0
0.2
0.4
0.6
0.8
1.0
Steps
Accu
racy
rate
LSLS-VRUCB1-tuned
LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999
non-stationary 2
LSVR can trace the unobserved change, amplifying fluctuation.
Discussion
!19
DiscussionThe cognitive biases of humans, when combined:
!19
DiscussionThe cognitive biases of humans, when combined:
Effectively works for adaptation under uncertainty
!19
DiscussionThe cognitive biases of humans, when combined:
Effectively works for adaptation under uncertainty
Conflates an action and the set of the actions through comparative valuation.
!19
DiscussionThe cognitive biases of humans, when combined:
Effectively works for adaptation under uncertainty
Conflates an action and the set of the actions through comparative valuation.
Symbolizes the whole situation into a virtual action.
!19
DiscussionThe cognitive biases of humans, when combined:
Effectively works for adaptation under uncertainty
Conflates an action and the set of the actions through comparative valuation.
Symbolizes the whole situation into a virtual action.
Utilizes fluctuation from uncertainty and enables adaptation to non-‐‑stationary environments.
!19
Conflating part and whole
!20
Conflating part and whole
Comparative valuation conflates the information of an action and of the whole set of actions.
!20
Conflating part and whole
Comparative valuation conflates the information of an action and of the whole set of actions.
Universal in living systems from slime molds (Lacy & Beekman, 2011) to neurons (Royer & Paré, 2003) to animals and human beings.
!20
Relative evaluation is especially important
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Relative evaluation is especially important
★ Relative evaluation:
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Relative evaluation is especially important
★ Relative evaluation: ★ is what even slime molds and real neural networks (conservation
of synaptic weights) do. Behavioral economics found that humans comparatively evaluate actions and states.
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Relative evaluation is especially important
★ Relative evaluation: ★ is what even slime molds and real neural networks (conservation
of synaptic weights) do. Behavioral economics found that humans comparatively evaluate actions and states.
★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms:
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Relative evaluation is especially important
★ Relative evaluation: ★ is what even slime molds and real neural networks (conservation
of synaptic weights) do. Behavioral economics found that humans comparatively evaluate actions and states.
★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms: ★ Through failure (low reward), choice of greedy action may quickly
trigger to the next choice of the previously second best, non-‐‑greedy arm.
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Relative evaluation is especially important
★ Relative evaluation: ★ is what even slime molds and real neural networks (conservation
of synaptic weights) do. Behavioral economics found that humans comparatively evaluate actions and states.
★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms: ★ Through failure (low reward), choice of greedy action may quickly
trigger to the next choice of the previously second best, non-‐‑greedy arm.★ Through success (high reward), choice of greedy action may quickly
trigger to focussing on the currently greedy action, lessening the possibility of choosing non-‐‑greedy arms by decreasing the value of other arms.
Try arms other than A1 by relative evaluation
(see-saw)
Choose A1 and lose
value of A1
value of A2
value of A1
value of A2
if relative
value of A1
value of A2
if absolute
Symbolization of the whole and comparative valuation with multi actions
A2
777
An
777
...
A1
777
Symbolization of the whole and comparative valuation with multi actions
A2
777
An
777
...
A1
777
Symbolization of the whole and comparative valuation with multi actions
A2
777
An
777
...
A1
777
Symbolization of the whole and comparative valuation with multi actions
A2
777
An
777
...Ag
777
Virtual machine representing the whole
A1
777
A2
777
...
Ag
777
A2
777
An
777
Comparative valuation with a virtual action representing the whole
Virtual machine representing the whole
“>” or “<”?
“>” or “<”?
“>” or “<”?
A1
777
A2
777
...
Ag
777
A2
777
An
777
Ag
777
Comparative valuation with a virtual action representing the whole
Virtual machine representing the whole
“>” or “<”?
“>” or “<”?
“>” or “<”?
A1
777
Conclusion
���24
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolution
���24
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot control
���24
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
���24
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases:
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
Kolling et al., Science, 2012.
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
Kolling et al., Science, 2012.
Comparative valuation of state-‐‑action value
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
Kolling et al., Science, 2012.
Comparative valuation of state-‐‑action value
Daw et al., Nature, 2006.
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
Kolling et al., Science, 2012.
Comparative valuation of state-‐‑action value
Daw et al., Nature, 2006.
Idiosyncratic risk evaluation
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Brain science and the three cognitive biases: Satisficing
Kolling et al., Science, 2012.
Comparative valuation of state-‐‑action value
Daw et al., Nature, 2006.
Idiosyncratic risk evaluationBoorman et al., Neuron, 2009.
ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)
Modeling PFC and vmPFC
���24
Applications of bandit problems
Game-tree
Applications of bandit problems
★Monte-‐‑Carlo tree search (Go-‐‑AI)Game-tree
Applications of bandit problems
★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement
Game-tree
Applications of bandit problems
★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement★ e.g., A/B test
Game-tree
Applications of bandit problems
★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement★ e.g., A/B test
★ Design of medical treatment
Game-tree
Applications of bandit problems
★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement★ e.g., A/B test
★ Design of medical treatment ★ Reinforcement learning
Game-tree
Robotic motion learningLearning giant-swing motion with no prior knowledge
and under coarse-grained states through trial-and-error.
free$joint�
ac,ve$joint�
Real$Robot$ Simulator$
1st$joint$(free)�
2nd$joint$(ac,ve)� 1st$link�
2nd$link�
200#
300#
400#
500#
600#
0# 20# 40# 60# 80# 100#
Learning#steps#[#/1000#steps]�
Acqu
ired#reward#pe
r#1000#step
s� Typical(case� Average(of(100(trials�
200#
300#
400#
500#
600#
0# 20# 40# 60# 80# 100#
LS>Q#Q#
Uragami, D., Takahashi, T., Matsuo, Y., Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control, BioSystems, 116, 1–9. (2014)
P10�
P0�P1�P2�
P3�P4�P5�P6�P7�
P8�P9�
P11�P12�P13�P14�
P15�P16�
P17�P18�P19�P20�P21�
P22�P23�
Posi%on'State'
3π� .3π�0
W6�W5�W4�W3�W2� W1�W0�
[rad/s]�
Velocity'State'
R0�R1�
R2�R3�R4�
0'[rad]�
5/6π'[rad]�
Posture'State'
A1�
A2�
A0�
Ac%on'
r'='0�
r'='1�
r'='|θ%p'/'π|�
Reward'
4.0'[rad/s]�
.4.0'[rad/s]�0.0'[rad/s]�