a toy model of human cognition: utilizing fluctuation in uncertain and non-stationary environments

Post on 04-Jul-2015

710 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

http://www.yukawa.kyoto-u.ac.jp/contents/seminar/detail.php?SNUM=51633 Tatsuji Takahashi1, Yu Kohno1,2 Seminar on science of complex systems (organized by Yukio-Pegio Gunji), Yukawa Institute for Theoretical Physics, Kyoto University, Jan. 20, 2014 1 Tokyo Denki University, 2 JSPS (from Apr., 2014)

TRANSCRIPT

A  toy  model  of    human  cognition:    

!

Utilizing  fluctuation  in  uncertain  and  non-­‐‑stationary  environments

Tatsuji  Takahashi1,  Yu  Kohno1,2  Seminar  on  science  of  complex  systems  (organized  by  Yukio-­‐‑Pegio  Gunji),  Yukawa  Institute  for  Theoretical  

Physics,  Kyoto  University,  Jan.  20,  2014  1Tokyo  Denki  University,  2JSPS  (from  Apr.,  2014)

Contents

!2

ContentsThe  loosely  symmetric  (LS)  model  

!2

ContentsThe  loosely  symmetric  (LS)  model  

Cognitive  properties  or  cognitive  biases

!2

ContentsThe  loosely  symmetric  (LS)  model  

Cognitive  properties  or  cognitive  biases

Analysis  of  reconstruction  of  LS  

!2

ContentsThe  loosely  symmetric  (LS)  model  

Cognitive  properties  or  cognitive  biases

Analysis  of  reconstruction  of  LS  

Result:  Efficacy  in  reinforcement  learning

!2

ContentsThe  loosely  symmetric  (LS)  model  

Cognitive  properties  or  cognitive  biases

Analysis  of  reconstruction  of  LS  

Result:  Efficacy  in  reinforcement  learning

Utilization  of  fluctuation  in  non-­‐‑stationary  environments

!2

A  toy  model  of  human  cognition

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

the  differences  from  “machines”

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

the  differences  from  “machines”

Principal  properties  implemented  in  a  form  as  simple  as  possible

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

the  differences  from  “machines”

Principal  properties  implemented  in  a  form  as  simple  as  possible

so  that  it  can  be  analyzed  and  applied  easily

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

the  differences  from  “machines”

Principal  properties  implemented  in  a  form  as  simple  as  possible

so  that  it  can  be  analyzed  and  applied  easily

Intuition  of  human  beings

!3

A  toy  model  of  human  cognitionModeling  focussing  on  deviations  from  rational  standards:  cognitive  biases

the  differences  from  “machines”

Principal  properties  implemented  in  a  form  as  simple  as  possible

so  that  it  can  be  analyzed  and  applied  easily

Intuition  of  human  beings

as  simple,  again:  not  the  policy  (or  strategy)  that  is  learnt  through  education  and  culture

!3

LS  as  a  toy  model  of  cognition

!4

LS  as  a  toy  model  of  cognitionWe  treat  the  loosely  symmetric  (LS)  model  proposed  by  Shinohara  (2007).  LS:  

!4

LS  as  a  toy  model  of  cognitionWe  treat  the  loosely  symmetric  (LS)  model  proposed  by  Shinohara  (2007).  LS:  

models  cognitive  biases

!4

LS  as  a  toy  model  of  cognitionWe  treat  the  loosely  symmetric  (LS)  model  proposed  by  Shinohara  (2007).  LS:  

models  cognitive  biases

merely  a  function  over  co-­‐‑occurrence  information  between  two  events

!4

LS  as  a  toy  model  of  cognitionWe  treat  the  loosely  symmetric  (LS)  model  proposed  by  Shinohara  (2007).  LS:  

models  cognitive  biases

merely  a  function  over  co-­‐‑occurrence  information  between  two  events

faithfully  describes  the  causal  intuition  of  humans

!4

LS  as  a  toy  model  of  cognitionWe  treat  the  loosely  symmetric  (LS)  model  proposed  by  Shinohara  (2007).  LS:  

models  cognitive  biases

merely  a  function  over  co-­‐‑occurrence  information  between  two  events

faithfully  describes  the  causal  intuition  of  humans

which  form  the  basis  of  decision-­‐‑making  and  action  for  adaptation  in  the  world

!4

The  loosely  symmetric  (LS)  model

!5

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

!5

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

!5

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

!5

posterior eventq ¬q

prior event

p a b¬p c d

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

The  relationship  from  p  to  q:  LS(q|p)

!5

posterior eventq ¬q

prior event

p a b¬p c d

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

The  relationship  from  p  to  q:  LS(q|p)

LS  describes  the  causal  intuition  of  human  beings  the  most  faithfully  (among  more  than  40  existing  models).  

!5

posterior eventq ¬q

prior event

p a b¬p c d

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

The  relationship  from  p  to  q:  LS(q|p)

LS  describes  the  causal  intuition  of  human  beings  the  most  faithfully  (among  more  than  40  existing  models).  

!5

posterior eventq ¬q

prior event

p a b¬p c d

P (q|p) = a

a+ b

The  loosely  symmetric  (LS)  modelA  quasi-­‐‑probability  function  LS(-­‐‑|-­‐‑)  like  conditional  probability  P(-­‐‑|-­‐‑).

Defined  over  the  co-­‐‑occurrence  information  of  events  p  and  q

The  relationship  from  p  to  q:  LS(q|p)

LS  describes  the  causal  intuition  of  human  beings  the  most  faithfully  (among  more  than  40  existing  models).  

!5

posterior eventq ¬q

prior event

p a b¬p c d

P (q|p) = a

a+ b

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  model

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

How  humans  form  the  intensity  of  causal  relationship  from  p  to  q,  

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

How  humans  form  the  intensity  of  causal  relationship  from  p  to  q,  

when  p  is  the  candidate  cause  of  the  effect  q  in  focus?

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

How  humans  form  the  intensity  of  causal  relationship  from  p  to  q,  

when  p  is  the  candidate  cause  of  the  effect  q  in  focus?

The  function  form  of  f(a,  b,  c,  d)  for  the  human  causal  intuition

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

How  humans  form  the  intensity  of  causal  relationship  from  p  to  q,  

when  p  is  the  candidate  cause  of  the  effect  q  in  focus?

The  function  form  of  f(a,  b,  c,  d)  for  the  human  causal  intuition

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

Meta analysis as in Hattori & Oaksford (2007)

The  loosely  symmetric  (LS)  modelInductive  inference  of  causal  relationship

How  humans  form  the  intensity  of  causal  relationship  from  p  to  q,  

when  p  is  the  candidate  cause  of  the  effect  q  in  focus?

The  function  form  of  f(a,  b,  c,  d)  for  the  human  causal  intuition

!6

posterior event

q ¬q

prior event

p a b¬p c d

LS(q|p) =a+ b

b+dd

a+ bb+dd+ b+ a

a+cc

Experiment AS95 BCC03.1 BCC03.3 H03 H06 LS00 W03.2 W03.6r for LS 0.95 0.98 0.98 0.98 0.97 0.85 0.95 0.85r for ΔP 0.88 0.92 0.84 0.00 0.71 0.88 0.28 0.46r2 for LS 0.9 0.96 0.96 0.97 0.94 0.73 0.91 0.72

Meta analysis as in Hattori & Oaksford (2007)

In  2-­‐‑armed  bandit  problems

!7

In  2-­‐‑armed  bandit  problems

!7

later on bandit problems

In  2-­‐‑armed  bandit  problemsLS  used  as  the  value  function  in  reinforcement  learning:  

!7

later on bandit problems

In  2-­‐‑armed  bandit  problemsLS  used  as  the  value  function  in  reinforcement  learning:  

The  agent  evaluates  the  actions  according  to  the  causal  intuition  of  humans.

!7

later on bandit problems

In  2-­‐‑armed  bandit  problemsLS  used  as  the  value  function  in  reinforcement  learning:  

The  agent  evaluates  the  actions  according  to  the  causal  intuition  of  humans.

!7

1 5 10 50 100 500 10000.5

0.6

0.7

0.8

0.9

1.0

step

Accuracyrate

LSCPToWH0.5LSMH0.3LSMH0.7L

later on bandit problems

In  2-­‐‑armed  bandit  problemsLS  used  as  the  value  function  in  reinforcement  learning:  

The  agent  evaluates  the  actions  according  to  the  causal  intuition  of  humans.

Very  good  adaptation  to  the  environment,  both  in  short  term  and  long  term.

!7

1 5 10 50 100 500 10000.5

0.6

0.7

0.8

0.9

1.0

step

Accuracyrate

LSCPToWH0.5LSMH0.3LSMH0.7L

later on bandit problems

The  loosely  symmetric  (LS)  model

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

brain  science:  Daw  et  al.,  Nature,  2006.

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

brain  science:  Daw  et  al.,  Nature,  2006.

Idiosyncratic,  asymmetric  risk  a8itude  as  in  the  prospect  theory

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

brain  science:  Daw  et  al.,  Nature,  2006.

Idiosyncratic,  asymmetric  risk  a8itude  as  in  the  prospect  theory

Kahneman  &  Tversky,  Am.,  Psy.,  1984,  Boorman  et  al.,  Neuron,  2009

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

brain  science:  Daw  et  al.,  Nature,  2006.

Idiosyncratic,  asymmetric  risk  a8itude  as  in  the  prospect  theory

Kahneman  &  Tversky,  Am.,  Psy.,  1984,  Boorman  et  al.,  Neuron,  2009

Satisficing  

The  loosely  symmetric  (LS)  modelFrom  the  analysis  of  LS,  we  found  the  following  cognitive  properties:  

Ground-­‐‑invariance  (like  visual  acention,  Takahashi  et  al.,  2010)

Comparative  valuation

psychology:  Tversky  &  Kahneman,  Science,  1974.

brain  science:  Daw  et  al.,  Nature,  2006.

Idiosyncratic,  asymmetric  risk  a8itude  as  in  the  prospect  theory

Kahneman  &  Tversky,  Am.,  Psy.,  1984,  Boorman  et  al.,  Neuron,  2009

Satisficing  

Simon,  Psy.  Rev.,  1954,  Kolling  et  al.,  Science,  2012.

Principal  human  cognitive  biases

!9

Principal  human  cognitive  biasesHumans:  

!9

Principal  human  cognitive  biasesHumans:  

Satisficing:  do  not  optimize  but  satisfice.

!9

Principal  human  cognitive  biasesHumans:  

Satisficing:  do  not  optimize  but  satisfice.

become  satisfied  when  it  is  becer  than  the  reference  level

!9

Principal  human  cognitive  biasesHumans:  

Satisficing:  do  not  optimize  but  satisfice.

become  satisfied  when  it  is  becer  than  the  reference  level

Comparative  valuation:  evaluate  states  and  actions  in  a  relative  manner

!9

Principal  human  cognitive  biasesHumans:  

Satisficing:  do  not  optimize  but  satisfice.

become  satisfied  when  it  is  becer  than  the  reference  level

Comparative  valuation:  evaluate  states  and  actions  in  a  relative  manner

Asymmetric  risk  a:itude:  asymmetrically  recognize  gain  and  loss

!9

Satisficing

A1 A2 No pursuit of arms over the reference level givenall arms are over reference

Search hard for an arm over the reference levelall arms are under reference

A1 A2reference

reference

Satisficing

A1 A2 No pursuit of arms over the reference level givenall arms are over reference

Search hard for an arm over the reference levelall arms are under reference

A1 A2reference

reference

Expected value 0.75 = 75% 25% = 25%

win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×

○×○○×○××× ○×××× ×××○× ××○×○

×○××

comparison considering reliability > <

Gamble on 1/4 rather than 5/20

Risk-avoiding over the reference

Choose 15/20 than 3/4

Risk  a:itude  (Reliability  consideration)Risk-seeking under the reference

reflection effect

Satisficing

A1 A2 No pursuit of arms over the reference level givenall arms are over reference

Search hard for an arm over the reference levelall arms are under reference

A1 A2reference

reference

Expected value 0.75 = 75% 25% = 25%

win (o) and lose (x) in the past○×○○○ ×○○○○ ○○○×○ ○○×○×

○×○○×○××× ○×××× ×××○× ××○×○

×○××

comparison considering reliability > <

Gamble on 1/4 rather than 5/20

Risk-avoiding over the reference

Choose 15/20 than 3/4

Risk  a:itude  (Reliability  consideration)Risk-seeking under the reference

reflection effect

Choose A1 and lose

Comparative  evaluation Try arms other than A1 by comparative valuation

(see-saw)value of A1 value of A2

A1 A2 absolute comparative A1 A2

Abstract image

The  generalized  LS  with  variable  reference  (LSVR)

Variable Reference

LSVR is a generalization of LS with an autonomously adjusted parameter of reference.

n-­‐‑armed  bandit  problem  (nABP)

!12

n-­‐‑armed  bandit  problem  (nABP)

The  simplest  framework  in  reinforcement  learning,  exhibiting  the  exploration-­‐‑exploitation  dilemma  and  the  speed-­‐‑accuracy  tradeoff.

!12

n-­‐‑armed  bandit  problem  (nABP)

The  simplest  framework  in  reinforcement  learning,  exhibiting  the  exploration-­‐‑exploitation  dilemma  and  the  speed-­‐‑accuracy  tradeoff.

It  is  to  maximize  the  total  reward  acquired  from  n  actions  (sources)  with  unknown  reward  distribution.

!12

n-­‐‑armed  bandit  problem  (nABP)

The  simplest  framework  in  reinforcement  learning,  exhibiting  the  exploration-­‐‑exploitation  dilemma  and  the  speed-­‐‑accuracy  tradeoff.

It  is  to  maximize  the  total  reward  acquired  from  n  actions  (sources)  with  unknown  reward  distribution.

One-­‐‑armed  bandit  is  a  slot  machine  that  gives  a  reward  (win)  or  not  (lose).

!12

n-­‐‑armed  bandit  problem  (nABP)

The  simplest  framework  in  reinforcement  learning,  exhibiting  the  exploration-­‐‑exploitation  dilemma  and  the  speed-­‐‑accuracy  tradeoff.

It  is  to  maximize  the  total  reward  acquired  from  n  actions  (sources)  with  unknown  reward  distribution.

One-­‐‑armed  bandit  is  a  slot  machine  that  gives  a  reward  (win)  or  not  (lose).

n-­‐‑armed  bandit  is  a  slot  machine  with  n  arms  that  have  different  probability  of  winning.  

!12

Performance  indices  for  nABP

!13

Performance  indices  for  nABP

Accuracy:  

!13

Performance  indices  for  nABP

Accuracy:  

the  average  percentage  of  choosing  the  optimal  action

!13

Performance  indices  for  nABP

Accuracy:  

the  average  percentage  of  choosing  the  optimal  action

Regret  (expected  loss):  

!13

Performance  indices  for  nABP

Accuracy:  

the  average  percentage  of  choosing  the  optimal  action

Regret  (expected  loss):  

the  difference  of  the  actually  acquired  accumulated  rewards  from  the  best  possible  sequence  of  actions  (where  accuracy=1.0  all  through  the  trial)

!13

Result n=100, the reward probability for each action is taken uniformly from [0,1].

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

05

1015

Steps

Expe

cted

loss

LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Accuracy: highest Regret: smallest

Kohno & Takahashi, 2012; in prep.The more there are actions, the better

the performance of LSVR becomes.

Non-­‐‑stationary  bandits

The  reward  probabilities  change  while  playing.

!15

Result in non-stationary environment 1n=16, the reward probability is from [0,1].

The probabilities are totally reset every 10,000 steps.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

0 10000 20000 30000 40000 50000

050

100

150

200

250

300

Steps

Expe

cted

loss

LSLS-VRUCB1-tunedLS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Kohno & Takahashi, in prep.Accuracy: highest Regret: smallest

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.

If the reward is given deterministically, this is

impossible.

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.

If the reward is given deterministically, this is

impossible.

Result in non-stationary environment 2

Accuracy (the rate of the optimal action at the time chosen)

n=20, the initial probability from [0,1]. The probability of each action is reset at the probability of 0.0001.

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

Even when a not well-tried action becomes the new optimal, it can switch to the optimal action.

If the reward is given deterministically, this is

impossible.

Efficient search utilizing uncertainty and fluctuation

in non-stationary environments

Results

!18

Results

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

stationary

ResultsThe  more  there  are  options,  the  becer  the  performance  of  LSVR  becomes.

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

stationary

ResultsThe  more  there  are  options,  the  becer  the  performance  of  LSVR  becomes.

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

stationary

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

non-stationary 2

LSVR  can  trace  the  unobserved  change,  amplifying  fluctuation.

ResultsThe  more  there  are  options,  the  becer  the  performance  of  LSVR  becomes.

!18

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

stationary

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

non-stationary!synchronous

LSVR  can  trace  the  change  in  non-­‐‑stationary  environments.0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

Steps

Accu

racy

rate

LSLS-VRUCB1-tuned

LS γ= 0.999LS-VR γ= 0.999UCB1-tuned γ= 0.999

non-stationary 2

LSVR  can  trace  the  unobserved  change,  amplifying  fluctuation.

Discussion

!19

DiscussionThe  cognitive  biases  of  humans,  when  combined:  

!19

DiscussionThe  cognitive  biases  of  humans,  when  combined:  

Effectively  works  for  adaptation  under  uncertainty

!19

DiscussionThe  cognitive  biases  of  humans,  when  combined:  

Effectively  works  for  adaptation  under  uncertainty

Conflates  an  action  and  the  set  of  the  actions  through  comparative  valuation.

!19

DiscussionThe  cognitive  biases  of  humans,  when  combined:  

Effectively  works  for  adaptation  under  uncertainty

Conflates  an  action  and  the  set  of  the  actions  through  comparative  valuation.

Symbolizes  the  whole  situation  into  a  virtual  action.

!19

DiscussionThe  cognitive  biases  of  humans,  when  combined:  

Effectively  works  for  adaptation  under  uncertainty

Conflates  an  action  and  the  set  of  the  actions  through  comparative  valuation.

Symbolizes  the  whole  situation  into  a  virtual  action.

Utilizes  fluctuation  from  uncertainty  and  enables  adaptation  to  non-­‐‑stationary  environments.

!19

Conflating  part  and  whole

!20

Conflating  part  and  whole

Comparative  valuation  conflates  the  information  of  an  action  and  of  the  whole  set  of  actions.

!20

Conflating  part  and  whole

Comparative  valuation  conflates  the  information  of  an  action  and  of  the  whole  set  of  actions.

Universal  in  living  systems  from  slime  molds  (Lacy  &  Beekman,  2011)  to  neurons  (Royer  &  Paré,  2003)  to  animals  and  human  beings.

!20

Relative  evaluation  is  especially  important

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Relative  evaluation  is  especially  important

★ Relative  evaluation:  

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Relative  evaluation  is  especially  important

★ Relative  evaluation:  ★ is  what  even  slime  molds  and  real  neural  networks  (conservation  

of  synaptic  weights)  do.  Behavioral  economics  found  that  humans  comparatively  evaluate  actions  and  states.

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Relative  evaluation  is  especially  important

★ Relative  evaluation:  ★ is  what  even  slime  molds  and  real  neural  networks  (conservation  

of  synaptic  weights)  do.  Behavioral  economics  found  that  humans  comparatively  evaluate  actions  and  states.

★ weakens  the  dilemma  between  exploitation  and  exploration  with  the  see-­‐‑saw  game  like  competition  among  arms:  

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Relative  evaluation  is  especially  important

★ Relative  evaluation:  ★ is  what  even  slime  molds  and  real  neural  networks  (conservation  

of  synaptic  weights)  do.  Behavioral  economics  found  that  humans  comparatively  evaluate  actions  and  states.

★ weakens  the  dilemma  between  exploitation  and  exploration  with  the  see-­‐‑saw  game  like  competition  among  arms:  ★ Through  failure  (low  reward),  choice  of  greedy  action  may  quickly  

trigger  to  the  next  choice  of  the  previously  second  best,  non-­‐‑greedy  arm.

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Relative  evaluation  is  especially  important

★ Relative  evaluation:  ★ is  what  even  slime  molds  and  real  neural  networks  (conservation  

of  synaptic  weights)  do.  Behavioral  economics  found  that  humans  comparatively  evaluate  actions  and  states.

★ weakens  the  dilemma  between  exploitation  and  exploration  with  the  see-­‐‑saw  game  like  competition  among  arms:  ★ Through  failure  (low  reward),  choice  of  greedy  action  may  quickly  

trigger  to  the  next  choice  of  the  previously  second  best,  non-­‐‑greedy  arm.★ Through  success  (high  reward),  choice  of  greedy  action  may  quickly  

trigger  to  focussing  on  the  currently  greedy  action,  lessening  the  possibility  of  choosing  non-­‐‑greedy  arms  by  decreasing  the  value  of  other  arms.

Try arms other than A1 by relative evaluation

(see-saw)

Choose A1 and lose

value of A1

value of A2

value of A1

value of A2

if relative

value of A1

value of A2

if absolute

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...

A1

777

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...

A1

777

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...

A1

777

Symbolization of the whole and comparative valuation with multi actions

A2

777

An

777

...Ag

777

Virtual machine representing the whole

A1

777

A2

777

...

Ag

777

A2

777

An

777

Comparative valuation with a virtual action representing the whole

Virtual machine representing the whole

“>” or “<”?

“>” or “<”?

“>” or “<”?

A1

777

A2

777

...

Ag

777

A2

777

An

777

Ag

777

Comparative valuation with a virtual action representing the whole

Virtual machine representing the whole

“>” or “<”?

“>” or “<”?

“>” or “<”?

A1

777

Conclusion

���24

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolution

���24

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  control

���24

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

���24

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

Kolling  et  al.,  Science,  2012.

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

Kolling  et  al.,  Science,  2012.

Comparative  valuation  of  state-­‐‑action  value

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

Kolling  et  al.,  Science,  2012.

Comparative  valuation  of  state-­‐‑action  value

Daw  et  al.,  Nature,  2006.

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

Kolling  et  al.,  Science,  2012.

Comparative  valuation  of  state-­‐‑action  value

Daw  et  al.,  Nature,  2006.

Idiosyncratic  risk  evaluation

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Brain  science  and  the  three  cognitive  biases:  Satisficing  

Kolling  et  al.,  Science,  2012.

Comparative  valuation  of  state-­‐‑action  value

Daw  et  al.,  Nature,  2006.

Idiosyncratic  risk  evaluationBoorman  et  al.,  Neuron,  2009.

ConclusionThe  cognitive  biases  that  look  irrational  are,  when  appropriately  combined  together  as  in  humans,  actually  rational  for  adapting  to  uncertain  environments  and  survival  through  evolutionApplicable  in  engineering,  in  machine  learning  and  robot  controlImplications  to  brain  science  (brain  as  a  machine  learning  equipment)

Modeling  PFC  and  vmPFC

���24

Applications  of  bandit  problems

Game-tree

Applications  of  bandit  problems

★Monte-­‐‑Carlo  tree  search  (Go-­‐‑AI)Game-tree

Applications  of  bandit  problems

★Monte-­‐‑Carlo  tree  search  (Go-­‐‑AI)★ Online  advertisement

Game-tree

Applications  of  bandit  problems

★Monte-­‐‑Carlo  tree  search  (Go-­‐‑AI)★ Online  advertisement★ e.g.,  A/B  test

Game-tree

Applications  of  bandit  problems

★Monte-­‐‑Carlo  tree  search  (Go-­‐‑AI)★ Online  advertisement★ e.g.,  A/B  test

★ Design  of  medical  treatment  

Game-tree

Applications  of  bandit  problems

★Monte-­‐‑Carlo  tree  search  (Go-­‐‑AI)★ Online  advertisement★ e.g.,  A/B  test

★ Design  of  medical  treatment  ★ Reinforcement  learning

Game-tree

Robotic motion learningLearning giant-swing motion with no prior knowledge

and under coarse-grained states through trial-and-error.

free$joint�

ac,ve$joint�

Real$Robot$ Simulator$

1st$joint$(free)�

2nd$joint$(ac,ve)� 1st$link�

2nd$link�

200#

300#

400#

500#

600#

0# 20# 40# 60# 80# 100#

Learning#steps#[#/1000#steps]�

Acqu

ired#reward#pe

r#1000#step

s� Typical(case� Average(of(100(trials�

200#

300#

400#

500#

600#

0# 20# 40# 60# 80# 100#

LS>Q#Q#

Uragami, D., Takahashi, T., Matsuo, Y., Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control, BioSystems, 116, 1–9. (2014)

P10�

P0�P1�P2�

P3�P4�P5�P6�P7�

P8�P9�

P11�P12�P13�P14�

P15�P16�

P17�P18�P19�P20�P21�

P22�P23�

Posi%on'State'

3π� .3π�0

W6�W5�W4�W3�W2� W1�W0�

[rad/s]�

Velocity'State'

R0�R1�

R2�R3�R4�

0'[rad]�

5/6π'[rad]�

Posture'State'

A1�

A2�

A0�

Ac%on'

r'='0�

r'='1�

r'='|θ%p'/'π|�

Reward'

4.0'[rad/s]�

.4.0'[rad/s]�0.0'[rad/s]�

top related