Download - A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments

A toy model of human cognition:

!

Utilizing fluctuation in uncertain and non-‐‑stationary environments

Tatsuji Takahashi1, Yu Kohno1,2 Seminar on science of complex systems (organized by Yukio-‐‑Pegio Gunji), Yukawa Institute for Theoretical

Physics, Kyoto University, Jan. 20, 2014 1Tokyo Denki University, 2JSPS (from Apr., 2014)

Contents

!2

ContentsThe loosely symmetric (LS) model

!2


Cognitive properties or cognitive biases

!2



Analysis of reconstruction of LS

!2




Result: Efficacy in reinforcement learning

!2




Result: Efficacy in reinforcement learning

Utilization of fluctuation in non-‐‑stationary environments

!2

A toy model of human cognition

!3

A toy model of human cognitionModeling focussing on deviations from rational standards: cognitive biases

!3


the differences from “machines”

!3



Principal properties implemented in a form as simple as possible

!3




so that it can be analyzed and applied easily

!3





Intuition of human beings

!3





Intuition of human beings

as simple, again: not the policy (or strategy) that is learnt through education and culture

!3

LS as a toy model of cognition

!4

LS as a toy model of cognitionWe treat the loosely symmetric (LS) model proposed by Shinohara (2007). LS:

!4


models cognitive biases

!4



merely a function over co-‐‑occurrence information between two events

!4




faithfully describes the causal intuition of humans

!4




faithfully describes the causal intuition of humans

which form the basis of decision-‐‑making and action for adaptation in the world

!4

The loosely symmetric (LS) model

!5

The loosely symmetric (LS) modelA quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional probability P(-‐‑|-‐‑).

!5


Defined over the co-‐‑occurrence information of events p and q

!5



!5

posterior eventq ¬q

prior event

p a b¬p c d



The relationship from p to q: LS(q|p)

!5


prior event

p a b¬p c d




LS describes the causal intuition of human beings the most faithfully (among more than 40 existing models).

!5


prior event

p a b¬p c d





!5


prior event

p a b¬p c d

P (q|p) = a

a+ b





!5


prior event

p a b¬p c d

P (q|p) = a

a+ b

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc


!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The loosely symmetric (LS) modelInductive inference of causal relationship

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc


How humans form the intensity of causal relationship from p to q,

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc



when p is the candidate cause of the effect q in focus?

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc




The function form of f(a, b, c, d) for the human causal intuition

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc





!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

Meta analysis as in Hattori & Oaksford (2007)





!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

Experiment AS95 BCC03.1 BCC03.3 H03 H06 LS00 W03.2 W03.6r for LS 0.95 0.98 0.98 0.98 0.97 0.85 0.95 0.85r for ΔP 0.88 0.92 0.84 0.00 0.71 0.88 0.28 0.46r2 for LS 0.9 0.96 0.96 0.97 0.94 0.73 0.91 0.72

Meta analysis as in Hattori & Oaksford (2007)

In 2-‐‑armed bandit problems

!7

In 2-‐‑armed bandit problems

!7

later on bandit problems

In 2-‐‑armed bandit problemsLS used as the value function in reinforcement learning:

!7



The agent evaluates the actions according to the causal intuition of humans.

!7




!7

1 5 10 50 100 500 10000.5

0.6

0.7

0.8

0.9

1.0

step

Accuracyrate

LSCPToWH0.5LSMH0.3LSMH0.7L




Very good adaptation to the environment, both in short term and long term.

!7

1 5 10 50 100 500 10000.5

0.6

0.7

0.8

0.9

1.0

step

Accuracyrate

LSCPToWH0.5LSMH0.3LSMH0.7L


The loosely symmetric (LS) modelFrom the analysis of LS, we found the following cognitive properties:


Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)



Comparative valuation




psychology: Tversky & Kahneman, Science, 1974.





brain science: Daw et al., Nature, 2006.






Idiosyncratic, asymmetric risk a8itude as in the prospect theory







Kahneman & Tversky, Am., Psy., 1984, Boorman et al., Neuron, 2009








Satisficing








Satisficing

Simon, Psy. Rev., 1954, Kolling et al., Science, 2012.

Principal human cognitive biases

!9

Principal human cognitive biasesHumans:

!9


Satisficing: do not optimize but satisfice.

!9



become satisfied when it is becer than the reference level

!9




Comparative valuation: evaluate states and actions in a relative manner

!9




Comparative valuation: evaluate states and actions in a relative manner

Asymmetric risk a:itude: asymmetrically recognize gain and loss

!9

Satisficing

A1 A2 No pursuit of arms over the reference level givenall arms are over reference

Search hard for an arm over the reference levelall arms are under reference

A1 A2reference

reference

Satisficing



A1 A2reference

reference

Expected value 0.75 = 75% 25% = 25%

win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×

○×○○×○××× ○×××× ×××○× ××○×○

×○××

comparison considering reliability > <

Gamble on 1/4 rather than 5/20

Risk-avoiding over the reference

Choose 15/20 than 3/4

Risk a:itude (Reliability consideration)Risk-seeking under the reference

reflection effect

Satisficing



A1 A2reference

reference

Expected value 0.75 = 75% 25% = 25%

win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×

○×○○×○××× ○×××× ×××○× ××○×○

×○××

comparison considering reliability > <

Gamble on 1/4 rather than 5/20

Risk-avoiding over the reference

Choose 15/20 than 3/4

Risk a:itude (Reliability consideration)Risk-seeking under the reference

reflection effect

Choose A1 and lose

Comparative evaluation Try arms other than A1 by comparative valuation

(see-saw)value of A1 value of A2

A1 A2 absolute comparative A1 A2

Abstract image

The generalized LS with variable reference (LSVR)

Variable Reference

LSVR is a generalization of LS with an autonomously adjusted parameter of reference.

n-‐‑armed bandit problem (nABP)

!12


The simplest framework in reinforcement learning, exhibiting the exploration-‐‑exploitation dilemma and the speed-‐‑accuracy tradeoff.

!12



It is to maximize the total reward acquired from n actions (sources) with unknown reward distribution.

!12




One-‐‑armed bandit is a slot machine that gives a reward (win) or not (lose).

!12




One-‐‑armed bandit is a slot machine that gives a reward (win) or not (lose).

n-‐‑armed bandit is a slot machine with n arms that have different probability of winning.

!12

Performance indices for nABP

!13


Accuracy:

!13


Accuracy:

the average percentage of choosing the optimal action

!13


Accuracy:


Regret (expected loss):

!13


Accuracy:


Regret (expected loss):

the difference of the actually acquired accumulated rewards from the best possible sequence of actions (where accuracy=1.0 all through the trial)

!13

Result n=100, the reward probability for each action is taken uniformly from [0,1].

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

05

1015

Steps

Expe

cted

loss

LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Accuracy: highest Regret: smallest

Kohno & Takahashi, 2012; in prep.The more there are actions, the better

the performance of LSVR becomes.

Non-‐‑stationary bandits

The reward probabilities change while playing.

!15

Result in non-stationary environment 1n=16, the reward probability is from [0,1].

The probabilities are totally reset every 10,000 steps.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


0 10000 20000 30000 40000 50000

050

100

150

200

250

300

Steps

Expe

cted

loss

LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Kohno & Takahashi, in prep.Accuracy: highest Regret: smallest

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned





0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.




0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned



If the reward is given deterministically, this is

impossible.




0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned



If the reward is given deterministically, this is

impossible.

Efficient search utilizing uncertainty and fluctuation

in non-stationary environments

Results

!18

Results

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


stationary

ResultsThe more there are options, the becer the performance of LSVR becomes.

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


stationary


!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


stationary

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


non-stationary 2

LSVR can trace the unobserved change, amplifying fluctuation.


!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


stationary

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


non-stationary!synchronous

LSVR can trace the change in non-‐‑stationary environments.0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned


non-stationary 2

LSVR can trace the unobserved change, amplifying fluctuation.

Discussion

!19

DiscussionThe cognitive biases of humans, when combined:

!19


Effectively works for adaptation under uncertainty

!19



Conflates an action and the set of the actions through comparative valuation.

!19




Symbolizes the whole situation into a virtual action.

!19




Symbolizes the whole situation into a virtual action.

Utilizes fluctuation from uncertainty and enables adaptation to non-‐‑stationary environments.

!19

Conflating part and whole

!20


Comparative valuation conflates the information of an action and of the whole set of actions.

!20


Comparative valuation conflates the information of an action and of the whole set of actions.

Universal in living systems from slime molds (Lacy & Beekman, 2011) to neurons (Royer & Paré, 2003) to animals and human beings.

!20

Relative evaluation is especially important

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute


★ Relative evaluation:


(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute


★ Relative evaluation: ★ is what even slime molds and real neural networks (conservation

of synaptic weights) do. Behavioral economics found that humans comparatively evaluate actions and states.


(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute




★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms:


(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute




★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms: ★ Through failure (low reward), choice of greedy action may quickly

trigger to the next choice of the previously second best, non-‐‑greedy arm.


(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute




★ weakens the dilemma between exploitation and exploration with the see-‐‑saw game like competition among arms: ★ Through failure (low reward), choice of greedy action may quickly

trigger to the next choice of the previously second best, non-‐‑greedy arm.★ Through success (high reward), choice of greedy action may quickly

trigger to focussing on the currently greedy action, lessening the possibility of choosing non-‐‑greedy arms by decreasing the value of other arms.


(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...

A1

777

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...Ag

777

Virtual machine representing the whole

A1

777

A2

777

...

Ag

777

A2

777

An

777

Comparative valuation with a virtual action representing the whole


“>” or “<”?

“>” or “<”?

“>” or “<”?

A1

777

A2

777

...

Ag

777

A2

777

An

777

Ag

777

Comparative valuation with a virtual action representing the whole


“>” or “<”?

“>” or “<”?

“>” or “<”?

A1

777

Conclusion

��24

ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolution

��24

ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot control

��24

ConclusionThe cognitive biases that look irrational are, when appropriately combined together as in humans, actually rational for adapting to uncertain environments and survival through evolutionApplicable in engineering, in machine learning and robot controlImplications to brain science (brain as a machine learning equipment)

��24


Modeling PFC and vmPFC

��24

Brain science and the three cognitive biases:



��24

Brain science and the three cognitive biases: Satisficing



��24


Kolling et al., Science, 2012.



��24



Comparative valuation of state-‐‑action value



��24




Daw et al., Nature, 2006.



��24





Idiosyncratic risk evaluation



��24





Idiosyncratic risk evaluationBoorman et al., Neuron, 2009.



��24

Applications of bandit problems

Game-tree


★Monte-‐‑Carlo tree search (Go-‐‑AI)Game-tree


★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement

Game-tree


★Monte-‐‑Carlo tree search (Go-‐‑AI)★ Online advertisement★ e.g., A/B test

Game-tree



★ Design of medical treatment

Game-tree



★ Design of medical treatment ★ Reinforcement learning

Game-tree

Robotic motion learningLearning giant-swing motion with no prior knowledge

and under coarse-grained states through trial-and-error.

free$joint�

ac,ve$joint�

Real$Robot$ Simulator$

1st$joint$(free)�

2nd$joint$(ac,ve)� 1st$link�

2nd$link�

200#

300#

400#

500#

600#

0# 20# 40# 60# 80# 100#

Learning#steps#[#/1000#steps]�

Acqu

ired#reward#pe

r#1000#step

s� Typical(case� Average(of(100(trials�

200#

300#

400#

500#

600#

0# 20# 40# 60# 80# 100#

LS>Q#Q#

Uragami, D., Takahashi, T., Matsuo, Y., Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control, BioSystems, 116, 1–9. (2014)

P10�

P0�P1�P2�

P3�P4�P5�P6�P7�

P8�P9�

P11�P12�P13�P14�

P15�P16�

P17�P18�P19�P20�P21�

P22�P23�

Posi%on'State'

3π� .3π�0

W6�W5�W4�W3�W2� W1�W0�

[rad/s]�

Velocity'State'

R0�R1�

R2�R3�R4�

0'[rad]�

5/6π'[rad]�

Posture'State'

A1�

A2�

A0�

Ac%on'

r'='0�

r'='1�

r'='|θ%p'/'π|�

Reward'

4.0'[rad/s]�

.4.0'[rad/s]�0.0'[rad/s]�

Download - A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments

Top Related