the exploration-exploitation trade-offexploitation trade-o pantelis pipergias analytis...

118
The exploration- exploitation trade-off Pantelis Pipergias Analytis Exploration- exploitation problems The multi-armed bandit framework Strategies Contextual bandits Results from a real world experiment Conclusions The exploration-exploitation trade-off Pantelis Pipergias Analytis Cornell University February 5, 2018 1 / 75

Upload: others

Post on 01-Sep-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The exploration-exploitation trade-off

Pantelis Pipergias Analytis

Cornell University

February 5, 2018

1 / 75

Page 2: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 3: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 4: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 5: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 6: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 7: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Page 8: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Multi-armed bandit (MAB) problem

Option 1

??

N(µ1, σ1)

N(12, 3)

Option 2

7.417.3

N(µ2, σ2)

N(15, 3)3 / 75

Page 9: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Page 10: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Page 11: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Page 12: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Page 13: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 14: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 15: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 16: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 17: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 18: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Page 19: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 20: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 21: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 22: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 23: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 24: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Page 25: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The Gittins index (Christian and Griffiths cpt. 2)

Possible to calculate for Bernoulli bandits with stablediscounting of future trials.

7 / 75

Page 26: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A simple example (Sutton and Barto, cpt. 2)

8 / 75

Page 27: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Performance of the e-greedy algorithm

9 / 75

Page 28: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Starting optimistically

10 / 75

Page 29: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Discussion: A/B testing andexploration-exploitation

11 / 75

Page 30: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Page 31: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Page 32: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Page 33: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Page 34: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Optimism in the face of uncertainty and the upperconfidence bound (UCB)

The more uncertain you are about the value of an optionthe more important it is to explore.That option could turn out to be really good and in thelong-term improve your overall utility.UCB: P(C = i) ∝ exp(θmi + α

√vari )

13 / 75

Page 35: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

UCB against e-greedy

14 / 75

Page 36: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Probability matching, changing environments andThompson sampling

Probability matching suggests sampling alternativesaccording to their rewards or their probability of being thebest.Thompson sampling is an implementation of theprobability matching principle.

15 / 75

Page 37: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Page 38: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Page 39: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Page 40: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

17 / 75

Page 41: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The typical bandit setting is like blind tasting...

18 / 75

Page 42: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

My grandma’s problem: Choosing the best place toswim

19 / 75

Page 43: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The machine learner’s problem

Li, L., Chu, W., Langford, J., Schapire, R. E. (2010,April). A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19thinternational conference on World Wide Web.

20 / 75

Page 44: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A contextual bandit experiment

21 / 75

Page 45: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A contextual bandit experiment: Results

22 / 75

Page 46: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Contextual multi-armed bandit (CMAB) problem

Option 1

N(f (·), σ1)N(w1x1 + w2x2, σ)

N(µ1, σ1)

Option 2

N(f (·), σ2)N(w1x1 + w2x2, σ)

N(µ2, σ2)

23 / 75

Page 47: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Realistic decision problem...

24 / 75

Page 48: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Realistic decision problem...

25 / 75

Page 49: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Page 50: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Page 51: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Page 52: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Page 53: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Page 54: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

CMAB task

27 / 75

Page 55: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

MAB task

28 / 75

Page 56: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

29 / 75

Page 57: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experimental Design

Training phase

Between subject design – CMAB or MAB

Contextual multi-armed bandit (CMAB) task – twoinformative features are visually displayed

Classic multi-armed bandit (MAB) task – control group,features are not visible

20 alternatives, 100 trials

Test phase

Designed to test the functional knowledge.

One shot choices, no outcome feedback.

3 arms in 70 trials.

30 / 75

Page 58: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Page 59: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Page 60: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Page 61: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

GP prior, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

Prior, Squared exponential kernel, l=1

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 3

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 5

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 20

32 / 75

Page 62: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

GP-Thompson, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 2

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 3

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 20

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 100

33 / 75

Page 63: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

34 / 75

Page 64: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.

35 / 75

Page 65: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.35 / 75

Page 66: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Individual 1 Individual 236 / 75

Page 67: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Mean choice rank - Exp 1b37 / 75

Page 68: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

38 / 75

Page 69: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

39 / 75

Page 70: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

One-shot choices – Lab replication

40 / 75

Page 71: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

41 / 75

Page 72: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space – First 10 trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.040.060.080.100.12

Proportion

Lab replication Clusters: Exploration

42 / 75

Page 73: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space – All trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Exp 1b - Lab replication

43 / 75

Page 74: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Inter-individual differences: Function-basedand naive learners

44 / 75

Page 75: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clustering according to the test phase performance

1

Diff/Extra

1

Diff/Inter

1

Easy/Extra

1

Easy/Inter

1

Weight test

2

Diff/Extra

2

Diff/Inter

2

Easy/Extra

2

Easy/Inter

2

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

45 / 75

Page 76: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clusters: Performance in the CMAB task

N1 = 43N2 = 53

Rtr = 7Rtr = 4.59

Rte = 1.94Rte = 1.24

CMABn

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

aaaaaa

12

46 / 75

Page 77: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clusters: Feature space, first 10 trials

CMABn

1

CMABn

2

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.060.090.120.15

Proportion

Exploration in the MAB condition

47 / 75

Page 78: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

48 / 75

Page 79: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Modeling user behavior

Learning: We model participants as function learners(GP) or as tracing mean rewards (BMT)

1 Gaussian processes (GP) function learning model:

f (x) ∼ GP(m(x),K (x, x′)), K (x, x′) = σ2f exp(− (x−x′)2

2l2 )2 Bayesian mean reward tracing (BMT)

Choices: Participants either use uncertainty in balancingthe exploration-exploitation (UCB) or not (SM).

1 Upper confidence bound (UCB):P(C = i) ∝ exp(θmi + α

√vari )

2 Softmax (SM): P(C = i) ∝ exp(θmi )

49 / 75

Page 80: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Modeling user behavior

Model BICw N # C1 C2

BMT-SM .54 (.38) 19 4 .48 (12) .72 (7)BMT-UCB .05 (.55) 0 5 .04 (0) .07 (0)GP-SM .27 (.38) 12 4 .34 (11) .09 (1)GP-UCB .02 (.04) 0 5 .03 (0) .01 (0)

RCM .11 (.32) 4 0 .11 (3) .11 (1)

There is evidence for GP models, especially for participants thatknow function well (according to the test task). Models withUCB perform poorly.

50 / 75

Page 81: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

51 / 75

Page 82: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1c – Quadratic and mixed linearfunction

Training phase

2x2 between subject design: Type of task (CMAB, MAB)and Type of function (Quadratic, Mixed)

Quadratic function:1 + 60(x1 − .02)2 + 60(x1 − .02)2 + 30x1x2, N(0, 2.5)

Mixed linear function: w1 = 40, w2 = −30, N(0, 2.5)

376 participants – Amazon Turk – monetary payoffs.

Test phase

Test items for mixed linear function are the same as forthe positive linear one

Special items for the quadratic function, testing whetherpeople detected the nonlinear nature of the relationship.

Illustration

52 / 75

Page 83: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, first 10 trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.0500.0750.100

Proportion

Mean choice rank One-shot choices

53 / 75

Page 84: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, all trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Mean choice rank One-shot choices

54 / 75

Page 85: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

55 / 75

Page 86: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 87: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.

Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 88: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.

Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 89: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.

Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 90: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 91: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)

Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 92: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.

Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 93: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Page 94: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice ranks

Random performance Random performance

Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic

57 / 75

Page 95: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, first 10 trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

One-shot choices

58 / 75

Page 96: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, all trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.30.4

Proportion

One-shot choices

59 / 75

Page 97: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Page 98: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Page 99: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Page 100: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Page 101: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Page 102: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Page 103: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Page 104: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Page 105: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Page 106: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Acknowledgments

Funding:

FPU grant, Ministry of Education, Culture and Sports,Spain

Max Planck Institute for Human Development, Berlin

Barcelona Graduate School of Economics

62 / 75

Page 107: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Quadratic function – An illustration

Experimental design

63 / 75

Page 108: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Individual behaviour in the training phase -Experiment 1a

●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0124,CMABn condition, experiment LowNoise

Mean choice rank64 / 75

Page 109: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Individual behaviour in the training phase -Experiment 1a

●●

●●

●●

●●●

●●●●●●●

●●●●

●●

●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0065,CMABn condition, experiment LowNoise

Mean choice rank65 / 75

Page 110: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Lab replication

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMAB

Mean choice rank - Exp 1a66 / 75

Page 111: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Lab replication

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

One-shot choices - Exp 1a

67 / 75

Page 112: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Feature space, all trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.3

Proportion

Back to Exp 1a

68 / 75

Page 113: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Feature space, first 10 trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.050.100.150.200.25

Proportion

Back to Exp 1a

69 / 75

Page 114: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Mixed and quadratic

Random performance Random performance

Mixed Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MAB mixedCMAB mixedMAB quadraticCMAB quadratic

Feature space, all Feature space, subset

70 / 75

Page 115: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Mixed and quadratic

CMAB mixed

Easy

CMAB mixed

Difficult

CMAB mixed

Weight test

CMAB quadratic

Max test 1

CMAB quadratic

Min test 1

CMAB quadratic

Slope test 1

CMAB quadratic

Max test 2

CMAB quadratic

Min test 2

CMAB quadratic

Slope test 2

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

71 / 75

Page 116: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Positive and quadratic

Random performance Random performance

Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic

Feature space, all Feature space, subset

72 / 75

Page 117: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Positive linear

CMAB lin

Easy

CMAB lin

Difficult

CMAB lin

Weight test

fCMAB lin

Easy

fCMAB lin

Difficult

fCMAB lin

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

73 / 75

Page 118: The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Quadratic

CMAB q

Max test 1

CMAB q

Min test 1

CMAB q

Slope test 1

CMAB q

Max test 2

CMAB q

Min test 2

CMAB q

Slope test 2

fCMAB q

Max test 1

fCMAB q

Min test 1

fCMAB q

Slope test 1

fCMAB q

Max test 2

fCMAB q

Min test 2

fCMAB q

Slope test 2

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

74 / 75