institute for theoretical computer sciencecgaide, reading uk, 10 th november 2004 reinforcement...

32
Institute for Theoretical Computer Science CGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer [email protected] Institute for Theoretical Computer Science Graz University of Technology, Austria

Upload: dulcie-jacobs

Post on 23-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Reinforcement Learning of Strategies for Settlers of Catan

Michael [email protected]

Institute for Theoretical Computer Science

Graz University of Technology, Austria

Page 2: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

2Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Motivation

Computer Game AI Mainly relies on prior

knowledge of AI designer inflexible and non-adaptive

Machine Learning in Games successfully used for

classical board games

TD Gammon [Tesauro 95] self-play reinforcement

learning playing strength of human

grandmasters Figures from Sutton, Barto: Reinforcement Learning

Page 3: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

3Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Goal of this Work Demonstrate self-play Reinforcement Learning (RL) for a large

and complex game Settlers of Catan: popular board game more in common with commercial computer strategy games than

backgammon or chess in terms of: number of players, possibilities of actions, interaction, non-

determinism, ...

New RL methods model tree-based function approximation speeding up learning

Combination of learning and knowledge Where in the learning process can we use our prior knowledge about

the game?

Page 4: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

4Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Agenda

Introduction

Settlers of Catan

Method

Results

Conclusion

Page 5: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

5Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

The Task: Settlers of Catan

Popular modern board game (1995)

Resources Production Construction Victory Points Trading Strategies

Page 6: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

6Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

What makes Settlers so difficult?

Huge state and action space

4 players

Non-deterministic environment

Interaction with opponents

Page 7: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

7Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Agenda

Introduction

Settlers of Catan

Method

Results

Conclusion

Page 8: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

8Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Reinforcement Learning

Goal: Maximize cumulative discounted rewards

Learn optimal state-action value function Q*(s,a)

Learning of strategies through interaction with the environment Try out actions to get an estimate of Q Explore new actions, exploit good actions Improve currently learned policies Various learning algorithms: Q-Learning, SARSA, ...

Page 9: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

9Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Self Play

How to simulate opponents?

Agent learns by playing against itself

Co-evolutionary approach

Most successful approach for RL in Games TD-Gammon [Tesauro 95] Apparently works better in non-deterministic games Sufficient exploration must be guaranteed

Page 10: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

10Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Typical Problems of RL in Games

State Space is too large Value Function Approximation

Action Space is too large Hierarchy of Actions

Learning Time is too long Suitable Representation and Approximation Method

Even obvious moves need to be discovered A-priori Knowledge

Page 11: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

11Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Function Approximation

Impossible to visit whole state space

Need for generalization from visited states to whole state space

Regression Task: Q(s, a) F(, a, ) ... feature representation of s ... finite parameter vector (e.g. weights of linear

functions or ANNs)

Features for Settlers of Catan: 216 high-level concept features (using knowledge)

Page 12: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

12Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Choice of Approximator

Discontinuities in value function global smoothing is undesirable

Local importance of certain features impossible with linear methods

Learning time is crucial

[Sridharan and Tesauro, 00] Tree based approximation techniques learn faster than ANNs in such scenarios

Page 13: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

13Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Model Trees Q-values are targets

Partition state space into homogeneous regions

Learn local linear regression models in leaves

Generalization via Pruning

M5 learning algorithm [Quinlan, 92]

Page 14: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

14Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Pros and Cons of Model Trees

Discrete and real-valued features

Ignores irrelevant features

Local models Feature combinations Discontinuities Easy interpretation Few parameters

Only offline learning

Need to store all training examples

Long training time

Little experience in RL context

No convergence results in RL context

+ -

Page 15: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

15Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Offline Training Algorithm

One model tree approximates Q-function for one action

1. Use current policy to play 1000 training games

2. Store game traces (states, actions, rewards, successor states) of all 4 players

3. Use current Q-function approximation (model trees) to calculate Q-values of training examples and add them to existing training set

4. Update older training examples

5. Build new model trees from the updated training set

6. Go back to step 1

Page 16: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

16Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Hierarchical RL Division of action space

3 layer model

Easier integration of a-priori knowledge

Learned information defines primitive actions e.g. placing roads, trading

Independent Rewards: high level: winning the game low level: reaching the

behavior‘s goal otherwise zero

Page 17: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

17Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Example: Trading

Select which trades to offer / accept / reject

Evaluation of a trade: What increase in low-level value would each trade bring? Select highest valued trade

Simplification of game design No economical model needed Gain in value function naturally replaces prices

Page 18: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

18Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Approaches

Apply Learning or prior Knowledge at different stages of learning architecture

Pure Learning learning at both levels

Heuristic High-Level simple hand-coded high-level strategy learning low-level policies (= AI modules)

Guided Learning use hand-coded strategy only during learning off-policy learning of high-level policy

Page 19: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

19Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Agenda

Introduction

Settlers of Catan

Method

Results

Conclusion

Page 20: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

20Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Evaluation Method

3000 – 8000 training matches per approach Long training time

1 day for 1000 training games 1 day for training of model trees

Evaluation against: random players other approaches human player (myself) no benchmark program

Page 21: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

21Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Relative Comparison

Combination of Learning and prior Knowledge yields the best results

Low-level Learning is responsible for improvement

Guided approach: worse than heuristic better than pure learning

Victories of heuristic strategies against other approaches (20 games)

Page 22: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

22Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Against Human Opponent 3 agents vs. author Performance Measure:

Avg. maximum victory points 10 VP: win every game 8 VP: close to winning in

every game

Heuristic policy wins 2 out of 10 matches

Best policies are successful against opponents not encountered in training

Demo matches confirm results (not included here)

Victory points of different strategies against a human

opponent (10 games)

Page 23: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

23Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Agenda

Introduction

Settlers of Catan

Method

Results

Conclusion

Page 24: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

24Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Conclusion

RL works in large and complex game domains Not grandmaster level like TD-Gammon, but pretty good Settlers of Catan is an interesting testbed and closer to commercial

computer games than backgammon, chess, ...

Model trees as a new approximation methodology for RL

Combination of prior knowledge with RL yields promising results Hierarchical learning allows incorporation of knowledge at multiple

points of the learning architecture Learning of AI components Knowledge speeds up learning

Page 25: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

25Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Future Work Opponent modeling

recognizing and beating certain opponent types

Reward filtering how much of the reward signal is caused by other agents

Model trees other games improvement of offline training algorithm (tree structure)

Settlers of Catan as game AI testbed trying other algorithms improving results

Page 26: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

26Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Thank you!

Page 27: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

27Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Sources M. Pfeiffer: Machine Learning Applications in Computer Games, MSc

Thesis, Graz University of Technology, 2003

J.R. Quinlan: Learning with Continuous Classes, Proceedings Australian Joint Conference on AI, 1992

M. Sridharan, G.J. Tesauro: Multi-agent Q-learning and Regression Trees for Automated Pricing Decision, Proceedings ICML 17, 2000

R. Sutton, A. Barto: Reinforcement Learning: An Introduction, MIT Press, Cambridge, 1998

G.J. Tesauro: Temporal Difference Learning and TD-Gammon, Communications of the ACM 38, 1995

K. Teuber: Die Siedler von Catan, Kosmos Verlag, Stuttgart, 1995

Page 28: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

28Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Extra Slides

Page 29: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

29Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Approaches

High-level behaviors always run until completion

Allowing high-level switches every time-step (feudal approach) did not work

Module-based Approach High-level is learned Low-level is learned

Heuristic Approach Simple hand-coded high-

level strategy during training and in game

Low-level is learned Selection of high-level

influences primitive actions

Guided Approach Hand-coded high-level

strategy during learning Off-policy learning of high-

level strategy for game Low-level is learned

Page 30: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

30Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Comparison of Approaches

Module-based: good low-level choices poor high-level strategy

Heuristic high-level: significant improvement learned low-level clearly

responsible for improvement

Guided approach: worse than heuristic better than module-based

Victories of heuristic strategies against other approaches (20 games)

Page 31: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

31Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Comparison of Approaches

Comparison of strategies in games against each other all significantly better than random heuristic is best

Page 32: Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer

32Institute for Theoretical Computer Science CGAIDE, Reading UK, 10th November 2004

Against Human Opponent

10 games of each policy vs. author 3 agents vs. human

Average victory points as measure of performance 10 VP: win every game 8 VP: close to winning in

every game

Only heuristic policy wins 2 out of 10 matches

Demo matches confirm results (not included here)

Performance of different strategies against a human

opponent (10 games)