introducing heuristic function evaluation framework · introducing heuristic function evaluation...

INTRODUCING HEURISTICFUNCTION EVALUATION

FRAMEWORK

January 2016Nera Nešic

Master of Science in Computer Science

INTRODUCINGHEURISTIC FUNCTION EVALUATIONFRAMEWORK

Nera NešicMaster of ScienceComputer ScienceJanuary 2016School of Computer ScienceReykjavík University

M.Sc. RESEARCH THESISISSN 1670-8539

Introducing Heuristic Function Evaluation Framework

by

Nera Nešic

Research thesis submitted to the School of Computer Scienceat Reykjavík University in partial fulfillment of

the requirements for the degree ofMaster of Science in Computer Science

January 2016

Research Thesis Committee:

Stephan Schiffel, SupervisorAssistant Professor, Reykjavík University, Iceland

Mark WinandsAssistant Professor, Maastricht University, Netherlands

Yngvi BjörnssonProfessor, Reykjavík University, Iceland

CopyrightNera Nešic

January 2016

Date

Stephan Schiffel, SupervisorAssistant Professor, Reykjavík University, Iceland

Mark WinandsAssistant Professor, Maastricht University, Netherlands

Yngvi BjörnssonProfessor, Reykjavík University, Iceland

The undersigned hereby certify that they recommend to the School of Com-puter Science at Reykjavík University for acceptance this research thesis en-titled Introducing Heuristic Function Evaluation Framework submittedby Nera Nešic in partial fulfillment of the requirements for the degree ofMaster of Science in Computer Science.

Date

Nera NešicMaster of Science

The undersigned hereby grants permission to the Reykjavík University Li-brary to reproduce single copies of this research thesis entitled IntroducingHeuristic Function Evaluation Framework and to lend or sell such copiesfor private, scholarly or scientific research purposes only.

The author reserves all other publication and other rights in association withthe copyright in the research thesis, and except as herein before provided, nei-ther the research thesis nor any substantial portion thereof may be printed orotherwise reproduced in any material form whatsoever without the author’sprior written permission.

Introducing Heuristic Function Evaluation Framework

Nera Nešic

January 2016

Abstract

We propose a paradigm for evaluating game heuristic functions in which wedefine a set of metrics, each measuring an aspect of heuristic’s performance,and use them to evaluate the heuristic function by comparing the function’soutput against a pre-computed benchmark containing a set of states from agame and ground-truth values of each of their moves. The advantage of ourapproach is that it is fast (once the benchmark is computed) and focused onspecific user-defined questions. While the ideal benchmark dataset wouldhave minimax action values, these values can sometimes be too difficultto obtain, so we investigate the possibility of generating datasets using theMCTS algorithm. We compare the performance of MCTS datasets for Con-nect Four to the game-theorical one, and identify a set of metric which can bereliably used with the MCTS dataset. Finally, we present two case studies, inwhich we show how our framework can be used to gain better understandingof a heuristic function’s behavior.

Kynning á Rammaaðferð til að Meta Brjóstvitsaðferðir

Nera Nešic

Janúar 2016

Útdráttur

Við leggjum til hugmyndafræði um mat á brjóstvitsaðferðum (e. heuristicfunctions) í leikjum. Hugmyndafræðin skilgreinir mæliaðferðir sem hverum sig mælir virkni hluta brjótvitsaðferðarinnar. Mæliaðferðirnar eru svonotaðar til að bera saman niðurstöður brjóstvitsaðferðanna við niðurstöðurúr forreiknuðum viðmiðum, reiknuðum út frá mengi af stöðum í leik ogsannleiksgildi hvers leiks frá þeirri stöðu. Kosturinn við þessa aðferð er aðhún er hraðvirk, eftir að forreiknuðu gildin hafa verið reiknuð, og er hön-nuð til að taka við sértækum fyrirspurnum frá notendum. Við skoðum svomöguleikann á að nota MCTS algrím til búa til forreiknuðu mengin, þó bestværi að þau innihéldu lægstu hámarks (e. minimax) gildi þar sem það tækiof langan tíma að finna þau. Við berum saman frammistöðu MCTS mengjasamanborið við leikjafræðileg mengi í Tengja 4 (e. Connect Four) leiknum,og finnum mæliaðferðir sem hægt er að nota á áreiðanleigan hátt með MCTSmengjunum. Að lokum kynnum við tvö dæmi þar sem við sýnum hvernighægt er að nota okkar rammaaðferð til að fá betri skilning á hegðun brjóstvit-saðferðanna.

Here’s to the future, for the dreams of youth.

vii

Acknowledgements

I would like to thank my advisor, Stephan Schiffel, for his patience and guidance throughthis project, for the time he spent discussing ideas with me, and for the enthusiasm withwhich he did it.

I would like to thank my other thesis examiners, Yngvi Björnsson and Mark Winands, fortheir constructive comments and valuable questions during the thesis defense.

I would like to thank Stefanía Bergljót Stefánsdóttir, Atli Sævar Guðmundsson, ArnarFreyr Bjarnason, and Ægir Már Jónsson for lending me their Connect Four heuristic touse as a case study.

Last but not least, I would like to thank all my family, friends, and colleagues, for theinexhaustible support they have given me during this project.

viii

Contents

List of Figures x

List of Tables xi

1 Introduction 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Heuristic Function Evaluation Framework . . . . . . . . . . . . . . . . . 31.3 HEF Use Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Monte Carlo Tree Search Dataset Generation . . . . . . . . . . . . . . . 51.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 General Game Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Competitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Competition Games . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Game Description Language . . . . . . . . . . . . . . . . . . . . 92.1.4 GGP and HEF . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . 162.2.4 MCTS and HEF . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Heuristic Function Evaluation Framework Overview and Implementation 203.1 Benchmark Data Generation Layer . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 The Benchmark Database . . . . . . . . . . . . . . . . . . . . . 213.1.2 Benchmark State Selection . . . . . . . . . . . . . . . . . . . . . 223.1.3 Obtaining Ground Truth Action Values . . . . . . . . . . . . . . 22

ix

3.2 Heuristic Analysis Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Data Visualization Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Proposed HEF Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.1 Game Behavior by Depth . . . . . . . . . . . . . . . . . . . . . . 293.4.2 Best-Only Move Accuracy Metrics . . . . . . . . . . . . . . . . 303.4.3 Move Ordering Metrics . . . . . . . . . . . . . . . . . . . . . . . 323.4.4 Expected Score Metric . . . . . . . . . . . . . . . . . . . . . . . 333.4.5 Move Categorization Metrics . . . . . . . . . . . . . . . . . . . . 33

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Working With MCTS Datasets 354.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Identifying the Best Move . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Depth Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Best-Only Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.1 K-Best Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4.2 Strict-Best Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Expected Score Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 Move Ordering and Categorization Metrics . . . . . . . . . . . . . . . . 464.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Case Studies 505.1 Action Heuristic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.1 The Action Heuristic Study . . . . . . . . . . . . . . . . . . . . 505.1.2 Studying Action Heuristics with HEF . . . . . . . . . . . . . . . 51

5.2 Analyzing Student Written Connect4 Heuristics . . . . . . . . . . . . . . 575.2.1 Nilli Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.2 Nilli Meets HEF . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2.3 Comparing Nilli and Action Heuristics . . . . . . . . . . . . . . 59

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions 646.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

x

List of Figures

2.1 Using random sampling to approximate the value of π . . . . . . . . . . . 152.2 Outline of a Monte Carlo Tree Search algorithm . . . . . . . . . . . . . . 172.3 Probability of failure to select the optimal move in a game . . . . . . . . 18

3.1 Design of HEF benchmark database . . . . . . . . . . . . . . . . . . . . 213.2 Class hierarchy of HEF evaluators . . . . . . . . . . . . . . . . . . . . . 253.3 Screenshot of HEF in action . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Accuracy of MCTS datasets in identifying the best move . . . . . . . . . 374.2 Performance of score differential metrics with MCTS datasets . . . . . . 384.3 Performance of score differential metrics using the N value on MCTS

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Accuracy of O-distance best move selection in MCTS datasets. . . . . . . 424.5 Accuracy of derivative based best move selection in MCTS datasets. . . . 434.6 Performance of the K-best metric on the MCTS dataset . . . . . . . . . . 444.7 Performance of the strict-best metric on the MCTS dataset . . . . . . . . 454.8 Performance of expected score metrics on MCTS datasets . . . . . . . . . 47

5.1 Best-only metrics evaluation of action heuristics . . . . . . . . . . . . . . 565.2 HEF analysis of Nilli heuristic . . . . . . . . . . . . . . . . . . . . . . . 605.3 Comparson of Nilli and Action Heuristics performance against HEF metrics 62

xi

List of Tables

4.1 Average number of simulations per state in MCTS datasets . . . . . . . . 36

5.1 Performance of GGP MCTS players using action heuristics to guide thesimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Average number of simulations per state in MCTS datasets for analyzedgames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1

Chapter 1

Introduction

Many computer scientists who took an Artificial Intelligence related course will remem-ber that one assignment which consists of writing an AI player for a game. The AI courseslike to use it to give students a chance of implementing alpha-beta search and exploringthe intricacies of designing a good heuristic function to be used in conjunction with it.Verifying that the implementation of the search itself is correct is rather simple, since weknow exactly how the algorithm is supposed to behave. When it comes to the heuristicpart, however, things get fuzzier. The students typically start with a reasonably sensi-ble heuristic, have their player play a few games using this heuristic, get terrible results,step through the game step by step and observe how the heuristic behaves in a handful ofstates, try fixing the issues they discover, run a few more games; and repeat this processuntil they feel satisfied with the results.

Of course, it is precarious to base the evaluation of a heuristic function on just a fewplay-troughs. Heuristics are fuzzy by definition; they provide an educated guess aboutthe quality of problem states which the solving agent is not able to evaluate in any moreprecise way. For this reason, many studies involving heuristics evaluate their performanceempirically across numerous instances of the problems to which they aim to solve. Forexample, in General Game Playing it is usual for researchers to evaluate their results byhaving their player play few hundred matches against some baseline player. This kindof intensive empiric evaluation methods is good for showing the final results of a studywith confidence; however, behind every such study there is a lot of trial and error duringthe development time, while the new heuristic is still being designed and fine tuned, andevery time the heuristic is changed, it is necessary to evaluate the effects of the change.While the mentioned evaluation method can also be used for such intermediate stagesof a heuristic, this can be very time intensive. A match in General Game Playing, for

2 Introducing Heuristic Function Evaluation Framework

example, can take between half an hour and several hours, and having to wait for resultsof hundreds of matches to see if the latest tweak to the heuristic is doing what it shoulddo can be a major setback in this kind of studies.

Heuristic functions are employed in a wide variety of search problems, in which they areused to help steer the search in the right direction. The function provides the search algo-rithm with an estimation of a state’s value; in A* search this can mean the approximationof distance between two nodes, in alpha-beta search the heuristic is used as an indicationof how close we believe a state to be to a certain goal value, and so on. In most cases, itis necessary to evaluate the performance of the heuristic at development time, so findinga fast way of doing so would benefit many studies in various areas.

In this thesis we present a paradigm for evaluation of heuristic functions and we provide aframework that implements it. The main idea behind our evaluation paradigm is that if wecreate a benchmark consisting of a large selection of states with pre-computed ground-truth values, we can use it to analyze specific aspects of a heuristic’s performance bydefining appropriate metrics. These metrics compare a heuristic’s evaluation of a bench-mark state to the state’s ground-truth value, and score the heuristic’s performance on eachindividual benchmark state according to their criteria. Scores across all benchmark statesare then aggregated and used for analysis. Generating the benchmark will generally bea computationally intensive job, but once the benchmark is created, the users will beable to obtain valuable information about a heuristic’s performance with very little ad-ditional computation required, which will help them in guiding the development of theheuristic without having to rely on the time consuming complete empirical evaluationmethods.

1.1 Related Work

As Anantharaman discusses in [1], the team behind Deep Thought (a chess player thatwould be succeeded by Deep Blue) also encountered the difficulty of evaluating theheuristic function for their player. The evaluation of heuristic functions for chess up tothat point has relied either on benchmarks of states, which Anantharaman criticizes for of-ten being too focused on evaluating one aspect of the gameplay (such as mating) and thusproducing over-fitting, or required use of computer-computer matches which would ratethe player using the USCF rating system. He estimates that, using the second approachto achieve the precision level of less than 10 rating points, approximately 1000 matcheswould be required (which at the time of writing meant 40-50 days of computation).

Nera Nešic 3

Instead, he proposes a position-based evaluation in which he compares the performance ofthe heuristic coupled with minimax search (called the test program) against the reference

program, consisting of the same heuristic and minimax search, but allowing much longersearch time. This evaluation scheme is based on the observation that in chess a programgains 100 points in rating by doubling search time allowed per move, so the heuristicfunction would always be evaluated against a computer player which is at least as goodas itself. Anantharaman then proposes several metrics to compare the performance of theheuristic relative to the reference program, and measures the correlation between thesemetrics and the USCF rating system, concluding that some of the metrics can be usedreliably to evaluate heuristic functions, with only one day of computation needed.

Our approach to evaluating the heuristics is similar to Anantharaman’s in that we arefocusing on evaluating the heuristic’s performance on individual states using some userdefined metrics. Our emphasis is, however, on producing a paradigm which can recyclecomputation, while the approach proposed by Anantharaman still requires a plenty ofsearch time to compute the reference. Moreover, we aim for generality of use – we’d likeour paradigm to be applicable to functions designed to be used with a variety of searchalgorithms (we can, for example, analyze functions intended for both minimax and MCTSrandom search guidance).

1.2 Heuristic Function Evaluation Framework

The goal of the Heuristic Function Evaluation Framework (HEF) is to provide all the nec-essary infrastructure to support our proposed paradigm. The framework offers utilitiesfor benchmark generation and management, data access, metric analysis, and data visu-alization, allowing users to focus only on defining the metrics that fit their study. Thedesign of HEF is divided into three layers: benchmark management, metric analysis, andvisualization. The first layer maintains a benchmark database and offers dataset gener-ation services, including selection of states to include in the dataset, ground-truth com-putation, and data persistence. By default, HEF operates on the General Game Playing(GGP) framework. GGP is a subfield of AI which is concerned with developing intelli-gent agents capable of playing whatever game they are presented with, with no need ofprior knowledge about the game. What we like about GGP is that it provides a GameDefinition Language, which can be used to specify the rules and state description of anyGGP-compliant game. This allows the various services that HEF provides to seamlesslyoperate on different games. This means that the default state selector and benchmark gen-erator will only need to be given the game rule sheet to generate the benchmark data. If


users wish to define their own data generators, on the other hand, they only need to ensurethat the data is output in the correct XML template. HEF will then store the data, andfrom here on the framework will be format agnostic.

The metric analysis layer offers the infrastructure needed to perform the analysis ofheuristic functions. While this layer is designed to be extended as needed with addi-tional user-defined metrics, we propose a set of metrics we believe will be a good startingpoint for analysis of most heuristic functions. The last layer provides a GUI throughwhich users request analysis and manage parameters for it. This layer is responsible forpresenting the results of the analysis visually. The heuristic’s scores are presented as afunction of benchmark state depth, which allows the user to gain a better understandingof the heuristic’s behavior at different game regions.

1.3 HEF Use Example

To give an example of use of our framework, let us imagine we want to design a goodheuristic function for Connect Four which evaluates actions available to a player in astate. Before we can start with the analysis of the function, we first build a benchmark forthis game, for which we select a set of states from the game and pre-compute the ground-truth values for each action available in each state. Once the benchmark is ready, we candefine our metrics. A metric could, for instance, measure how accurate is our heuristicat identifying the best move available in the state. Such metric would return 1 for everystate in which the heuristic’s best action corresponds to the benchmark’s best action, and0 otherwise. When the metric is ready, HEF runs it against individual benchmark states,and then aggregates the scores grouping by state depth (ie. the state’s distance from thesearch tree’s initial state) and graphs it.

Looking at the graph, the users could see, for example, that the heuristic’s accuracy ac-cording to this metric is lower in the initial states, or that there are some areas in the gamein which the accuracy of the heuristic plummets, and so on. After seeing the results, theusers might have further questions about the heuristic. For example, if they see that thebest-move accuracy is low, they may wonder how well the heuristic performs regardingthe ordering of the rest of the moves. HEF was designed to allow users to easily extendit and add their own metrics and heuristic functions to it, so that any follow-up questionscan be quickly implemented and evaluated.

Nera Nešic 5

1.4 Monte Carlo Tree Search Dataset Generation

Generating benchmark datasets is not always straightforward. In an ideal benchmark fora game, states and actions would have game-theoretical ground-truth values, which oftenrequire immense amount of computation, optimization, and study to obtain. If a userdoesn’t have the luck of working with a game for which an efficient way of evaluatingstates to their game-theorical values is available, generating a good benchmark can takehumongous amounts of work, which could easily cancel out the benefits of HEF.

As one of the major points of our work, we propose Monte Carlo Tree Search (MCTS)algorithm[2] as an alternative method of generating HEF benchmarks. Coupled with theUpper Bound Confidence for Trees (UCT) policy, MCTS has been successfully employedby various game players where it produced good results. Its main strength lays in beingan anytime algorithm that requires no external knowledge of the game to find good moveswhich estimates the potential of available moves based on simulations and its exploitation-exploration policy. As Kocsis et al. show, given enough time MCTS-UCT will convergeto game-theorical values while still having a small error probability if stopped prematurely[3].

In our approach, we generate a set of states to be used in the benchmark, then run theMCTS-UCT algorithm on each state for extended amount of time (an hour or more).When the algorithm terminates, it gives us two values for each move available in thestate: the number of simulations ran starting with the move, and the average score overall simulations achieved by that move. Unfortunately, as we will discuss, these valuesdo not always map directly to the game-theorical scores. We examine the reliability ofthe MCTS dataset for the game of Connect Four by comparing the dataset to its game-theorical equivalent, which we generate thanks to a pre-existing, optimized Connect Foursolver.

1.5 Outline of the Thesis

In Chapter 2 we introduce some background information relevant to this study. We in-clude a short overview of General Game Playing in Section 2.1 to familiarize the user withthe platform on which we based the default implementation of HEF, and which we usedextensively for testing HEF. In Section 2.2 we present the MCTS-UCT algorithm, usedby HEF for generating benchmarks. In Chapter 3 we discuss the implementation of HEF;in Section 3.4 we propose a set of metrics that can be used in HEF analysis. In Chapter 4


we investigate the feasibility of MCTS-generated datasets for HEF analysis. We comparethe performance of HEF metrics on such dataset for the game of Connect Four to a corre-sponding dataset with game-theorical values, and discuss the possibilities and limitationsof MCTS for benchmark generation. In Chapter 5 we present two case studies in whichwe apply HEF analysis to heuristics, and show what insight we were able to gain on theirperformance. The first heuristic we study is constructed automatically by a GGP player,and we evaluate its performance across several games. We have two snapshots of the sec-ond heuristic, which was designed by students as part of a homework assignment, and westudy the change in performance between the two versions. In Chapter 6 we summarizeour findings and propose a course for further studies.

7

Chapter 2

Background

In this chapter we provide some background to familiarize the reader with subjects thatthis thesis builds on. First, we introduce General Game Playing (GGP), a platform cen-tered on agents able to play any game that they get a description of. The conventions andgrammar established by GGP come useful in studies involving multiple games; having aunified syntax for encoding game states makes it easy to write metrics and other servicesto be game-agnostic, which is why the default implementation of our Heuristic FunctionEvaluation Framework (HEF) is written to integrate with the GGP platform.

In the second part of this chapter we explain in some detail how the Monte Carlo TreeSearch (MCTS) algorithm works and performs. The biggest problem that our proposedheuristic function evaluation paradigm faced was the fact that generating the benchmarkswith game-theorical action values required for this kind of analysis can be extremelydifficult. Instead, we propose MCTS algorithm for benchmark generation. The advantageof this approach is that MCTS does not need any external knowledge of the problemto be able to identify good moves; the disadvantage that comes with them is that themove values we obtain don’t map directly to the game-theorical ones (we dedicate anentire chapter to discussing the feasibility of working with MCTS benchmarks), and sosome work-arounds and compromises are required in order to use these datasets withHEF.

2.1 General Game Playing

Deep Blue’s victory over Kasparow was celebrated as an important milestone in devel-opment of artificial intelligence[4]. John McCarthy, one of the founders of the discipline


of artificial intelligence, however made a comment that "Computer chess has developedmuch as genetics might have if the geneticists had concentrated their efforts starting in1910 on breeding racing Drosophila. We would have some science, but mainly we wouldhave very fast fruit flies"[5]. The traditional approach to game AI has been aimed at onespecific game at a time, investing a lot of time and energy into finding good heuristicsand optimization techniques that would often not be applicable to any other games, thusnot taking us any closer to understanding how to make computers tackle unseen prob-lems. General Game Playing field brings the focus of AI back to reasoning about generalproblem solving, which does not rely on any domain specific knowledge input. The fielduses yearly competitions, which have been held since 2005, as a way of evaluating theprogress that has been made[6]. For such purposes, GGP specifies a communication pro-tocol (Section 2.1.1) and a game description language (Section 2.1.3). The approachesto solving the GGP problem have been varied and colorful, most often involving somesearch algorithms like minimax with its various optimizations, or simulation based oneslike Monte Carlo search.

2.1.1 Competitions

GGP competitions give researchers a chance to test how well their player can do againstother players. Several games are chosen to be played each year, and players are pitchedagainst each other in a series of matches which are mediated by a central server calledgamemaster.

Before the beginning of a match, gamemaster informs the players what their respectiveroles will be and sends them the game rules. It also sends them two time values: the start-clock, which specifies how much time there is until the match starts, and the playclock,dictating how long can a turn take at most. The players can use the startclock time for anykind of processing they need to do in preparation for the game. When they are ready, theycommunicate so to the gamemaster, and the match starts.

At the beginning of each turn, the gamemaster sends to all players a complete descriptionof the current state. Players then use the game rules to deduce which moves are legalduring this turn, and use the time at their disposal to evaluate the best move. They mustbe careful not to exceed the playclock, however; should the gamemaster not receive areply from a player in time, it will choose a random move for it. The playclock is usuallyset to several minutes, and it is an important constraint in GGP development, as it forcesthe field to search for game playing methods which are not only accurate, but also timeefficient.

Nera Nešic 9

When the players select their moves, they communicate them to the gamemaster, whichthen verifies whether the moves are legal and, if so, updates the official game state ac-cording to the moves. The game proceeds until a terminal state is reached, in which thegamemaster determines the winner and assigns a score to each player as specified by thegoal description[6].

2.1.2 Competition Games

The games studied in GGP are not yet completely "general". At the moment, the focus islimited to games which satisfy the following set of properties[6]:

Finiteness: GGP games define a unique start state, and terminate after a finite number ofsteps by reaching one of the terminal states, where each player is assigned a score.A game has a fixed, finite number of players, and each player has finitely manyactions available from each state. As such, we can view these games as finite stateautomaton.

Determinism: Given a state S and a set of player actions A, a game will always transitionto the same state S’. That means that currently, GGP is unable to handle any gameslike Backgammon or Yahtzee which involve die rolling or other random events.

Synchronicity: Game proceeds in turns. All players move on all turns, meaning thegames are simultaneous, and the environment updates only after receiving movesfrom all players. Non simultaneous games can easily be modeled by only allowingthe no-op action for all players if it is not their turn.

Complete Information: Every player has access to all information about the game’sstate. This rules out many card games in which the player’s hands and the per-mutation of the deck are unknown.

It is expected that in the future some of these restrictions will be lifted; a next iterationof the Game Description Language has been proposed (named GDL-II), which is ableto support non-deterministic games with incomplete information with minimal modifica-tions to the existing language definition[7].

2.1.3 Game Description Language

GGP uses Game Description Language (GDL) to specify the game rules. GDL, likeDatalog from which it is derived, is a logical language[8]. It states facts which are true

about the game, and it defines rules as a series of clauses which must be satisfied fora proposition to hold. Moreover, GGP games assume no domain-specific knowledge atall; their entire world is contained within the description of the game – even the simplestmathematical concepts, such as numbers and their ordering, must be defined in the rulesheet if they are to be used.

Logical Programming in GDL

To better understand the logical programming paradigm, let us consider how would weimplement integer comparison in GDL. We start with facts which define what a numbercan be:

(number 1)

(number 2)

(number 3)

(number 4)

We need to write a similar statement for every integer we need during the game. SinceGDL makes the closed world assumption, anything not explicitly specified as true is con-sidered false – a player receiving only the four rules above would tell us that 5 is not anumber.

Having defined what a number is, we implement the greater than operator by listingnumbers that are each immediately greater than another:

(greater 2 1)

(greater 3 2)

(greater 4 3)

These rules are still not enough to compare integers correctly; statement (greater 4 2)would be considered false as no rule explicitly allows it. One solution then would be tolist all pairs of numbers for which the greater than relation holds. A cleaner solution is to,instead, define the greater than relationship recursively, and say that a number x is greaterthan a number y if there is a number z such that x is greater than z and z is greater thany. We complete our definition of the greater than operator with the following rule:

(<= (greater ?x ?y)

(number ?z)

(greater ?x ?z)

(greater ?z ?y))

Nera Nešic 11

Describing Games

Any GGP game description needs to have enough information to let the player know somebasic information on the game, such as what are the roles in the game, how are terminalstates defined, what are the goal values of each terminal state, which moves are allowed ina state, or how does the game transition from one state to another. GDL reserves certainkeyword propositions to define these basic game properties. Following is a list of thesekeywords, along with examples of how they would be employed to describe a game ofTic-Tac-Toe [6]:

distinct(, <q>) simply means that the datums p and q are syntactically unequal.

role(<r>) indicates that r is a role. For tic-tac-toe, we define two roles, xplayer andoplayer:

(role xplayer)

(role oplayer)

init() is used to define the initial state of the game. In Tic-Tac-Toe, the initial state isa board with nine blank cells, with oplayer in control:

(init (cell 1 1 b))

(init (cell 1 2 b))

...

(init (cell 3 3 b))

(init (control oplayer))

true() is similar to init, in that it defines what is true in the current state. However,true is used by the gamemaster at the beginning of each turn to describe to theplayers what the new state looks like.

legal(<r>, <a>) defines the conditions that need to be met for role r to be able to takeaction a. For example, a player is allowed to mark a cell if the cell is blank and it isthat player’s turn:

(<= (legal ?p (mark ?x ?y))

(true (cell ?x ?y b))

(true (control ?p)))

does(<r>, <a>) , then, indicates that a player r has taken the action a.

next() is used to indicate that, given certain conditions, p will be true in the nextstate, and is thus used for state transitions. In Tic-Tac-Toe, the next state needs totoggle control from one player to the other, and it needs to update the cell whichwas just marked to no longer be blank, but contain the symbol of the player whomarked it. All other cells remain the same.

(<= (next (cell ?m ?n x))

(does xplayer (mark ?m ?n))

(true (cell ?m ?n b)))

(<= (next (cell ?m ?n o))

(does oplayer (mark ?m ?n))

(true (cell ?m ?n b)))

(<= (next (cell ?m ?n ?w))

(true (cell ?m ?n ?w))

(distinct ?w b))

(<= (next (cell ?m ?n b))

(does ?w (mark ?j ?k))

(true (cell ?m ?n b))

(or (distinct ?m ?j) (distinct ?n ?k)))

(<= (next (control xplayer))

(true (control oplayer)))

(<= (next (control oplayer))

(true (control xplayer)))

terminal describes terminal states of the game. In Tic-Tac-Toe that happens when eitherthere are no more blank cells left, or a player has arranged a winning combination:

(<= (terminal)

(line x))

(<= (terminal)

(line o))

Nera Nešic 13

(<= (terminal)

(not open))

The rules of the game must, of course, also specify what do line and open mean,but we omit this for brevity.

goal(<r>, <v>) , finally, specifies the reward value v that role r receives. In our example,we award 50 points to each player in case of a draw, otherwise the winner gets 100points and the loser gets 0 points. We define the goal values for the x player as:

(<= (goal(xplayer, 50))

(not (line x))

(not (line o))

(not open))


(line x))


(line o))

Extensions to GDL

The above rules are enough to describe any finite, deterministic, complete-informationgame. In order to be able to describe the non-deterministic games with imperfect infor-mation, GDL-II adds two more keywords[7]:

random denotes a player who performs the non-deterministic actions, such as dealingcards or rolling dice.

sees(<r>, ) expresses under what conditions is information p available to role r.

With these two rules, non-deterministic games – such as card games – can be modeled byintroducing random event rounds in which actual players can only play noop, while therandom player deals cards, rolls dice, etc. The sees rule can then be used to determinewhat players see – in a card game where the random player deals one card to each actualplayer, each player should only see their own card; that can be expressed as:

(<= (sees(p1, ci))

(does(random, deal(c1, c2))))


2.1.4 GGP and HEF

The HEF benchmarks are all stored in a database, with their states tracked via their uniqueserialized representation. Using GDL to serialize states makes a benchmark indepen-dent of any particular implementation of the game, which makes it much easier to reuseand share between studies – a researcher only needs the GDL rule sheet from which thebenchmark was created to be able to reconstruct the state and port it to a convenient im-plementation. In one of the case studies in Chapter 5, in fact, we show how to analyzea heuristic function that operates on a non-GGP state representation using a GGP bench-mark. Moreover, the default MCTS benchmark generator is also written to work withinthe GGP framework; since the MCTS algorithm doesn’t need any external knowledge ofthe problem to operate, this combination allows us to generate benchmarks for many dif-ferent games using the same implementation of the tool. For this reason, we believe thatGGP format is very well suited for HEF.

2.2 Monte Carlo Tree Search

Monte Carlo methods are an implementation of the well-known statistical method of ran-dom sampling. They have been immensely useful in analyzing the behavior of systemswhich are too complex to handle with deterministic mathematical methods, and havedemonstrated their utility as early as 1940’s, when they were employed in the Manhattanproject to study the behavior of neutrons relevant to the atomic fission[9].

Monte Carlo methods have quickly found their role in game AI, often used as a replace-ment for evaluation functions in minimax search. The combination of Monte Carlo eval-uation with tree search resulted in the Monte Carlo Tree Search (MCTS) algorithm thatraised a lot of interest in the scientific community. Its attractions are many: it is an any-time algorithm which can be used with little or no domain specific knowledge, and italso scales well with additional computational power. Further improvements have beenmade to the algorithm, such as the UCT selection policy, which made the algorithm morereliable and less prone to error [2, 10].

2.2.1 Monte Carlo Methods

As mentioned above, Monte Carlo methods (MCM) have a long history of usefulness inapproximation of intractable calculations. MCM define random variables which model

Nera Nešic 15

Figure 2.1: Using random sampling to approximate the value of π.

a problem, and perform a series of random samplings from the variables’ domain. Theresult of the samplings is then aggregated into an approximation of the problem.

A simple example of MCM at work is calculating an approximation of the value of π.That can be done by drawing a square with a circle inscribed inside it, so that the ratioof their areas is π/4. If we sample the points enclosed inside the square using a uniformdistribution, we expect that the ratio of points also belonging to the circle will be π/4(Fig. 2.1). In this example, MCM simulates random sampling through pseudo-randomgeneration of point coordinates. Since that is an inexpensive operation, we can quicklymake millions of samplings, yielding good approximation of the real value of π [11].

MCM also found some interesting applications in games. Abramson, for example, ar-gued that the classical approach to evaluation of non-terminal states in alpha-beta search,consisting of expertly designing an evaluation function, could be substituted with an ex-pected outcome approach, in which simulations are used to try and approximate the game-theorical values of moves. This approach won’t result in a perfect play, but in reality mostopponents won’t be playing perfectly either, and Abramson shows that his method, cou-pled with minimal expert guidance, can produce good results[12].

In some situations, however, we are not interested in approximating the values of eachaction; we only need to know what the best move is. If we spend equal amount of timeperforming simulations on each move, we will waste a lot of computation time on somemoves which we could identify as bad moves early on – and in a time sensitive contextlike GGP, this becomes a major concern. We would prefer using a method capable ofdetecting which moves are worth focusing on.


2.2.2 Bandit Problems

Move selection in game playing can be modeled as a bandit problem. The bandit prob-lems face the challenge of optimizing the reward received after a sequence of choicesamong K possible actions, given that the underlaying reward distribution is unknown. Inother words, we find ourselves with n tokens in front ofK slot machines, each one havinga different reward distribution function. We want to use our tokens in the machine whichyields the biggest reward; however, we first need to find out which machine that is. Thatis, we need to invest some of our tokens into exploration of the different reward functions,while still dedicating as many tokens as possible to exploiting the best option. Formallywe say we want to minimize regret, which is defined as the difference between the max-imum reward achievable with n tokens and the reward we actually obtained. It has beenproven that there exists no policy with a regret that grows slower than O(lnn)[13], so anypolicy which keeps the growth of regret within a constant factor of this rate is deemed toresolve the exploration-exploitation problem.

The simplest policy providing the upper confidence bound (UCB) on the reward is calledUCB1, and it dictates to play the bandit arm j that maximizes:

UCB1 = Xj +

√2 lnn

nj

(2.1)

where Xj is the average reward from arm j, nj is the number of times arm j has beenplayed, and n is the overall number of plays thus far. The left hand term, Xj , encouragesexploitation, while the right side term encourages playing arms which are lagging behindin terms of exploration[10].

2.2.3 Monte Carlo Tree Search

While UCB1 algorithm improves the performance of MCM in game playing, it stillpresents us with the problem that averaging a multitude of simulation results startingfrom a single node does not reveal anything about the structure of the game. For example,taking an action could bring us to a state in which the opponent has only one winningmove which, however, assures their victory. All other moves by the opponent result inour victory. This would also mean that a higher portion of the game subtree would endwith our victory, and random simulations would reflect this by showing us an averagemuch higher than the true game-theoretical value of the state.

Nera Nešic 17

Figure 2.2: Outline of a Monte Carlo Tree Search algorithm[14].

The MCTS algorithms try to solve this problem by having the random simulations un-cover and respect the underlaying structure of a game. They build a best-first searchtrees, basing their decision on which nodes to expand on prior simulation scores, whichallows them to eventually detect the deceptively good looking states like the one wedescribed[10].

MCTS Algorithm

A basic MCTS algorithm is illustrated in Fig. 2.2. It consists of four fundamental steps.First, it selects a node which it wants to further evaluate. Nodes are selected recursivelystarting from the root of the tree according to some selection policy. Once a leaf node ofthe current tree is encountered, the tree is expanded by adding one or more of its successorstates to the tree by either selecting them randomly or using a heuristic to try and expandthe better ones first. At this point, the algorithm simulates a single playout starting fromthe new node. Again, the simulation can be completely random, or guided by some moveselection heuristic. Once the terminal state of the game is reached, the score is propagatedbackwards through the tree. Each node keeps count of the number of times it was selected,and the average of scores achieved in playouts that it was part of. These values are thenagain used in the selection step to explore a new node[14, 2].

Upper Confidence Bounds Applied To Trees

In their paper introducing a new selection policy for MCTS, Kocsis and Szepesvari crit-icized the policies used up to that point for relying on either some uniform sampling orheuristic biasing that come with no guarantee[3]. They point out that the node selection


Figure 2.3: Probability of failure to select the optimal move in a game[3].

in MCTS trees compares to the bandit problem, in that it is necessary to balance explo-ration and exploitation of the most promising looking subtrees. They propose using thesuccessful UCB1 policy as the MCTS node selection policy, treating move selection forevery explored internal node as a separate multi-armed bandit. However, as also notedby Coulom[2], UCB1 in its original form is not readily suited for being a node selectionfunction in MCTS, as the policy was designed assuming that the reward distribution ofthe bandit arms does not change, whereas in the MCTS tree the sampling probability ofchildren nodes should change according to the new score information from the simula-tions. Kocsis and Szepesvari then propose a modification of UCB1, named UCT (UCB1applied to trees), which they show accounts for the drift in payoffs:

UCT = Xj + 2C

√lnn

nj

(2.2)

Where C is a constant constructed to counter the payoff drift.

Kocsis and Szepesvari prove that, given enough time, the error probability of MCTS-UCT converges to 0. They also ran experiments which confirm that this property holdsin practice; Fig. 2.3 shows a comparison of performance of MCTS-UCT to alpha-betasearch (AB), plain Monte-Carlo planning (MC) and Monte-Carlo planning with minimaxvalue update (MMMC) on games with different branching factors(B) and depth(D). TheMCTS-UCT algorithm converges to zero probability error sooner than other algorithms,and its error function decreases very rapidly, meaning that a premature termination is stilllikely to identify a good move. Admittedly, the employed minimax algorithm did notuse any heuristic function for non-terminal states, instead choosing a move at random incase the tree wasn’t fully explored after a given number of simulations, which causes aworse performance than what would be observed with a carefully constructed evaluationheuristic.

Nera Nešic 19

2.2.4 MCTS and HEF

MCTS-UCT algorithm is growing in popularity in game playing. It has been used insome very successful implementations, most quoted of which is MoGo[15], and its inde-pendence from domain-specific knowledge is making it a popular option in GGP playersas well[16]. This encouraged us to explore the possibility of using MCTS-UCT as a wayto generate HEF benchmarks for which no efficient minimax solver exists. While we donot expect these benchmarks to hold correct minimax values for all moves, we are relyingon UCT’s exploration-exploitation policy to identify good moves with small error proba-bility. That will of course restrict the metrics we can use with the MCTS benchmarks. Oneof our goals in this thesis is investigating how usable are the MCTS benchmarks for HEFpurposes, and identifying which metrics can be used reliably with such datasets.

2.3 Conclusion

In this chapter we provided an overview of two concepts central to the work presentedin this thesis. Our Heuristic Function Evaluation Framework relies on concepts from theGeneral Game Playing field in order to achieve its intended generality of use, makingit easily applicable to a variety of games and problems. The Monte Carlo Tree Searchalgorithm is vital in further supporting this goal, as it allows us to generate benchmarkdatasets to be used in HEF analysis for games in which obtaining game-theorical ground-truth values is extremely difficult.

20

Chapter 3

Heuristic Function EvaluationFramework Overview andImplementation

The core idea behind Heuristic Function Evaluation Framework (HEF) is avoiding longevaluation times in development phase of a heuristic function by investing a large amountof computational time only once to create a benchmark dataset, which can then be usedto evaluate the functions quickly and with a greater flexibility throughout the remainderof the study. Users can define the metrics that are best suited to answer specific questionsabout the heuristic’s performance. The implementation of each metric needs to return avalue that reflects how well does a function satisfy the metric’s criteria on a single state.All benchmark states are stored along with the depth at which they were encounteredin the game tree, which allows us to study how the heuristic’s performance changes indifferent areas of the game.

The framework is structured into three layers: benchmark data generation (Section 3.1),heuristic function analysis (Section 3.2), and data visualization (Section 3.3). The firstlayer generates a selection of states to use as benchmark, evaluating the scores of eachaction in these states, and storing the obtained information for future use. The heuristicanalysis layer provides the infrastructure needed to perform the analysis of the heuristicfunctions – including database access management and data aggregation. This layer ex-ports the analysis data to files which are picked up by the third layer and graphed withrespect to state depth. In this chapter we will discuss our implementation of these threelayers. Moreover, in Section 3.4 we propose an initial set of HEF metrics that we believe

Nera Nešic 21

Figure 3.1: Design of HEF benchmark database.

to be generally useful for heuristic function analysis. We will base most of our experi-ments and evaluations on this set of metrics.

3.1 Benchmark Data Generation Layer

HEF’s benchmark dataset consists of a large selections of states, with ground-truth valuesspecified for each action available in a state. Each state is uniquely identified through itsserialized representation and game that it belongs to, and it is stored along with the depthat which it was first encountered. Benchmark generation starts with the selection of gamestates to constitute the dataset. This step results in a series of XML files, one for eachstate, which are then picked up by the state evaluator which produces another series ofXML files containing all relevant state information. These files are then parsed by HEFand loaded into its database, and become available for analysis.

3.1.1 The Benchmark Database

Fig. 3.1 shows HEF’s schema diagram. The schema centers on two tables: states andstates_infos. States is a list of games and states available for benchmarking. The state

column holds the serialized string representation of a state. The chosen representationmust have two properties: there must be a one-to-one correspondence between each statein a game and its representing string, and it must contain all information necessary toreconstruct the original state from its benchmark data. In case of GGP benchmarks, statesare naturally defined as a list of string fluents; to ensure the persisted representations havethe one-to-one property, it is enough to concatenate the fluents in alphabetical order. Thesecond requirement is slightly more tricky; "General" being its operand word, GGP allows


its literals to contain any character, which makes it hard to parse the stored state string.For this reason, we added another table, state_fluents, which links individual literals tothe states, thus allowing us to reconstruct the original fluent set correctly.

States_infos table contains the ground-truth values of actions available in a state. In ourdefault implementation, this table is set to store the results of the GGP MCTS evaluator;for every state, role, and action for each role, states_infos stores the number of simulationsstarting with that action (the n column), the total number of simulations ran across allactions of a role (column total_n), and the average score each action achieved during thecourse of simulations (column q). Moreover, HEF can hold different sets of evaluationdata for the same states; these can be distinguished via the run column, which can be setto a custom descriptive value to facilitate dataset selection.

3.1.2 Benchmark State Selection

HEF’s default state selector is a command-line utility that requires three parameters: GDLgame rulesheet, destination directory for generated state files, and the desired number n ofstates we would like to generate at each depth of the game. The state selection algorithmstarts collecting the states from the root. It generates all of its successor states, from whichit chooses at most n states which have not been selected yet. To increase the likelihoodthat the selected states are representative of the game region we find them in, we choosethem with probability proportional to the number of times they appear on the currentlevel. Each generated state is written to a separate XML file in the target directory, whichwill contain the state’s string serialization, the fluents necessary to rebuild it, the depthsat which it was encountered, and the game it belongs to.

3.1.3 Obtaining Ground Truth Action Values

The MCTS evaluator obtains the ground-truth values by setting the provided state as theroot of the search tree, and running the MCTS-UCT algorithm on it for the specifiedamount of time. At the end of simulation, the ground truth values consist of the UCT in-formation: how many times an action was sampled, and what the average score was.

Obtaining the ground-truth scores is a computationally intensive task. Our Connect Fourbenchmark, for example, holds 800 states, and if we run the MCTS simulation on eachone for an hour, that amounts to very many hours of computing. Luckily, each state canbe evaluated independently, and we had a compute cluster at our disposal that allowed

Nera Nešic 23

us to do a lot of computation in parallel. We have designed the flow of the benchmarkgeneration phase to allow us make easy use of parallelization.

HEF’s default MCTS evaluator is built into a jar file which can be ran from command line,taking as arguments the GDL game rulesheet, target output directory, a state descriptionXML file (generated in the previous step), a unique job id, the name of the series, andthe time (in seconds) that we want to run the MCTS algorithm for. Every instance of theevaluator operates on a single input and output file, and the result files are gathered in acommon target directory. The name of the file is based on the job id, so as long as everyinstance of the evaluator is given a unique job id, they will happily run in parallel. Onceall the states have been processed, HEF offers another command line utility to parse theXML files and load the data into its database.

In case users don’t find GGP framework suitable for their study, or if they prefer a differentmethod of generating the ground-truth values, they can define their own state selector andevaluator; as long as the final result is written as an XML file following HEF’s template,they will be able to load the data in the database seamlessly.

3.2 Heuristic Analysis Layer

The heuristic analysis layer provides the infrastructure needed for working with metrics.This layer provides access to the underlaying database through an interface that we hopewill be suitable to most use cases. It also provides an analysis pipeline which filters thebenchmark states before passing them to the metric one by one, and takes care to exportthe metric results into corresponding output files which will be used in the visualizationlayer. The main design objective of this layer is making it easy for the users to formulatetheir own metrics; to do so, they only need to implement an interface, register it in theapplication context, and their metric will be available for analysis.

3.2.1 Implementation

The implementation of the heuristic analysis layer makes good use of dependency injection[17]to ensure the users have the flexibility to extend the layer as they need. Heuristic analysislayer is itself divided into three sub-layers, each offering a set of services. Since eachservice is defined in an interface, users can change the behavior of the analysis pipelineby injecting a different implementation of the service at runtime through an applicationcontext management framework.


Analysis Layers

Fig. 3.2 shows the three sub-layers of the class hierarchy of the analysis layer, and theservices they depend on in their default implementation. Every metric implemented inHEF is a HeuristicEvaluator. This layer is the point of contact with the data visualizationlayer, which uses it to request analysis with specified metric parameters. In addition, thislayer provides the basic services which will be needed by every metric: access to thebenchmark data and to the heuristic function we want to evaluate. These services areimplemented in DAO and heuristic managers, and allow the user to obtain all relevantinformation for each state in a game both as stored in the benchmark and computed bythe heuristic.

The second layer is intended for defining families of metrics that extend the HeuristicEvaluator. All of metrics that we implemented are subclasses of HeuristicAccuracyE-

valuator, and implement the scoreMetric method which, given a list of scores that theheuristic manager produces for a state, and a list of corresponding scores from the bench-mark, returns a value between 0 and 1 according to its evaluation metric. This layer alsotakes the responsibility of implementing the data export in a way that best fits its metrics’requirements.

On the third layer reside various implementations of accuracy metric families. HEF cur-rently defines four such sub-families; the best-only evaluators only measure a heuristic’sability to identify the optimal moves, and use the best move selector service to identify theoptimal moves in the benchmark to which they then compare the best moves chosen bythe heuristic. The expected score evaluators give us the expected score of using samplingfunction (defined by implementing the ActionProbabilityFunction interface) to choose amove based on the heuristic’s scores. The move ordering evaluators are concerned withhow correctly can a heuristic order the moves with respect to their ground-truth scores.The move categorization metrics tell us how correctly can a heuristic map its scores to thebenchmark values.

Data Access

The benchmark data is accessed through the provided data access object (DAO), whichimplements a wide variety of utilities for retrieving game data from the database. Twoof these utilities are particularly of interest when implementing new metrics: finding alist of states which satisfy certain conditions, and retrieving the score information for theselected state. By default, users can retrieve a list of all states for a game, or limit those

Nera Nešic 25

Figure 3.2: Class hierarchy of HEF evaluators.

states to a given depth. However, we also provide a simple SQL builder to allow the usersto have a wide degree of control over which states they wish to perform analysis on. Forexample, if a user wanted to analyze a heuristic only over states in which the maximumscore is greater than 50, they can do so:

QueryBuilder qb = new QueryBuilder();

qb.where(ClauseFactory.greaterThan(Column.Q, 50));

dao.getStates(qb);

Once a selection of states is made, the DAO can be used to get the score information foreach move in the state. This information is returned as a list of value objects (VO), sorteddescendingly according to their N value. Each VO holds complete information availableon a given action in the state: the state and action description, its Q and N values, and thedepths at which the state is found.

Heuristic Manager

The heuristic manager’s role is to implement and prepare the heuristics for evaluation.Each manager must implement the initialization method for the heuristic which does allthe work necessary to prepare it for the desired game. Once the heuristic is initialized,the heuristic manager provides a method which, given a state and a role, will return a listof actions and their scores as evaluated by the heuristic, sorted in descending order withrespect to the score.


Heuristic managers can be divided in two categories: static and dynamic. Static heuristics,like the action heuristic that we will analyze in Chapters 5 and 6, are compiled once beforethe game starts, and they remain unchanged for the duration of the game. In this case, theinitialization method needs only to load the heuristic and initialize the assets needed forthe state evaluation (in our GGP implementation, we use this step to create a state machinefor the selected game).

The dynamic heuristics, such as RAVE[10], are generated and improved during the game-play, and may differ from one turn to the next. This type of heuristics is more difficultto analyze in HEF due to the fact that two different gameplays may result in very dif-ferent heuristics. Nonetheless, HEF can allow for at least a preliminary analysis of suchheuristics by simulating the playout in its initialization phase, and keeping a snapshot ofthe heuristic at each depth level. Although we don’t study the dynamic heuristics here,we recognize their importance and believe that in future HEF can be easily extended tosupport them better.

3.3 Data Visualization Layer

The final layer of HEF implements a GUI offering a convenient way for users to visualizethe result of their metrics against various heuristics, games, and datasets. Analysis data isrepresented as a plot of average metric value per depth level. The GUI allows the users toselect which game they wish to test their heuristic on, after which HEF will offer datasetsand roles available for that game. The user can then select which parameters they wish tograph; they can chose the heuristic function they wish to examine, the metric they wishto examine it with, and they can specify if they want to restrict the analysis to a singlerole. Furthermore, they can control the level of aggregation of data with the "granularity"setting, which specific how many depth levels should be averaged together, resulting insmoother curves which can provide a better insight on the trending of the analyzed data,especially in long games with many depth levels. Since users will often define metricswhich complement each other, and are most useful when plotted together, HEF allowsusers to add any number of plots to the graph, and also allows users to control graphingoptions like line color and shape to make the visualization clearer.

Nera Nešic 27

Figure 3.3: Screenshot of HEF in action

3.3.1 Implementation

Dependency injection is used to build the list of various metrics and heuristic functionsthat should be available to users from the GUI. In order to avoid having to change sourcecode each time we want to change the parameters of a metric, or modify the list of onesavailable to the user, we use the Spring Framework[17], which allows us to define andbuild all services that we want HEF to use at runtime. Spring lets us describe all of ourdependencies in an XML file. For instance, we implemented the AntiFalsePositiveAccu-racyEvaluator metric, and we wish to add it to the list of offered metrics. Being a best-onlymetric, this evaluator needs to have its BestMoveSelector injected. We want our metricto use the MCTS implementation of the BestMoveSelector, so we define it as a bean. Fi-nally, we can add our metric to the list of the available evaluators, which is loaded by theGUI at startup. To do all this, we would need to modify Spring’s application context filelike this:



<bean id="mctsBestMoveSelector"

class="com.hef.accuracy.bestonly.moveselector

.MCTSBestMoveSelector" />



<bean id="antiFalsePositivesAccuracyEvaluator"

class="com.hef.accuracy.bestonly

.AntiFalsePositivesAccuaracyEvaluator" >

<property name="bestMoveSelector" ref="mctsBestMoveSelector" />

</bean>



<util:map id="guiEvaluators"

value-type="com.hef.accuracy.HeuristicAccuracyEvaluator">

<entry key="Anti False Positive"

value-ref="antiFalsePositivesAccuracyEvaluator" />

.

.

.

</util:map>

At runtime, Spring builds the application context containing all the specified metrics andservices, which are then automatically injected as needed. This way, user has an easytime controlling what is displayed in the UI, and what parameters should the metricshave.

Data Export and Caching

Once the user selects the heuristic, metric, role and dataset, the selected accuracy evalua-tor writes the calculated accuracy scores to a file, in which each line contains the accuracyscore for one state, and the corresponding depth of that state. Since the accuracy calcula-tion can take up to a minute, the results are stored in a directory-based cache in order toavoid re-computing them needlessly. Each metric defines its own caching structure whichis determined by the metric’s parameters that were used in generating each scores file.All accuracy evaluators initialize the caching path by sorting files by the related game,dataset, role, and evaluator class. Each evaluator then specifies what the remainder of thepath should be according to some relevant parameters. The cached scores are persisteduntil the user cleans them explicitly by checking the "clean cache" box in the GUI.

3.4 Proposed HEF Metrics

We now introduce a set of metrics that will be used through the rest of this work. They aredesigned for evaluating heuristic functions used in conjunction with MCTS or minimaxalgorithms, and we believe them to be a good starting point for analysis of most heuristicfunctions. In this section we will explain their evaluation criteria and the rationale behind

Nera Nešic 29

them. We will also give the formal definition for each of them; we will now define a setof functions that will be used in the following definitions.

Definition 1. Let S be a game state,R a role in that game, andm a move available in stateS. We define GT as a function such that GT (S,R,m) = v, where v is the ground-truthvalue of move m in state S for role R.

Definition 2. Let S be a game state and R a role in that game. We define GTmax andGTmin as functions such that GTmax(S,R) = V and GTmin(S,R) = v, where V isthe maximum and v minimum ground-truth value among moves in state S for role R.

Definition 3. Given an evaluation metric E, we define the metric score function, MSE ,as a function mapping a game state S and a role R according to policy specified by E tosome value v.

3.4.1 Game Behavior by Depth

HEF puts a lot of importance on state depth’s role when analyzing a heuristic, since thetopology of a game can change significantly at different stages of gameplay, and a heuris-tic may not be able to cope with all stages equally well. Sometimes this might be due toa deficiency of the heuristic, and other times it might be caused by the topology of thegame itself. For example, a region of the game in which all moves have a similar valuemight cause the accuracy of a heuristic to appear higher, when in fact it is simply bene-fiting from the fact that it is harder to pick a wrong move in the given region. To help usbetter understand the results of the heuristic function metrics, we implemented a family ofmetrics which help us keep track of how some properties of the game itself are changingat different depth levels.

Maximum Score Difference

Maximum score difference metric measures the ground-truth score difference betweenthe best and the worst moves available to each player in a state. Many games have certainzones in which choosing the right move can have a significant impact on the outcomeof the game, and others in which a player is, game-theoretically, doomed to a loss or adraw regardless of which move is chosen. The metric comes useful for identifying these"critical zones" of a game. A good heuristic function should be resistant to such areas of


the game – we don’t want to see its performance plummet in the most critical areas of thegame.

Definition 4. Let DIFFMAX be a maximum score difference metric. Given a state S anda role R, we have:

MSDIFFMAX(S,R) = GTmax(S,R)−GTmin(S,R)

Number of Available Moves

Available moves count metric reports the number of moves available to each player in astate. At some stages of the game, players have very few legal moves actually availableto them, which could give us a skewed perception of a heuristic’s accuracy as, with only afew moves possible, the chance of picking a good one increases. Plotting this metric alongwith our heuristic evaluation can help us further differentiate regions where the heuristic’sperformance changes due to the change in available moves.

Definition 5. Let MCOUNT be an available moves count metric. Given a role R and astate S having n moves available to R, we have:

MSMCOUNT (S,R) = n

3.4.2 Best-Only Move Accuracy Metrics

Some uses of heuristics are primarily concerned with the heuristic’s capability of identi-fying good moves. For example, some implementations of MCTS use heuristics to guidethe random playout and steer the exploration towards more beneficial game regions earlyon by identifying the good moves. We define a few of metrics that measure the best moveidentification accuracy, and also complement them with metrics which show how a com-pletely random heuristic function would behave in relation to these metrics. The expectedrandom heuristic scores can be used as the baseline to verify the other metric scores; aheuristic function could, for example, identify the best move 60% of the time, and thiscould lead us to believe that we have a decent heuristic function. However, we then verifythe probability that a random function would select a best move, and we find it to be 65%,due to the topography of the game which often has many moves with the same best valueavailable in a state. Now we see that there is something clearly wrong with our heuristicfunction.

Nera Nešic 31

K-Best Metric

K-best metric verifies that there is a best move (move with the highest ground-truth scorein a state) among the K moves that the heuristic function scored the highest. The metricreturns 1 if the best move is found in the said subset, 0 otherwise. Its random counterpartcalculates the probability that a random heuristic would include a best move in the Khighest scored moves. Sometimes, we might find that a heuristic has low accuracy withK = 1, but if we analyze it with K = 2, we might discover that it is quite good at placinga good move among its two highest scored moves.

Definition 6. Let KBEST be a K-best metric and let K ∈ N . Given a state S, role R, aheuristic function H , and a set of moves for role R HMK = {hm1...hmK} containing Khighest-scored moves according to H , we have:

MSKBEST (S,R) =

{1 if ∃ hm | hm ∈ HMK , GT (S,R, hm) = GTmax(S,R)

0 otherwise

Strict Best Metric

Strict-best metric calculates the fraction of best-moves included in the set of moves thatthe heuristic function assigns the same highest score to. This metric was implemented todeal with situations in which a heuristic evaluates many moves to the same value. Themetric takes all the moves that the heuristic assigns the same highest score to, and countshow many of the actual best moves are included. It returns the ratio of actual best movesto the number of all moves identified as optimal, and it is useful for identifying a heuristicfunction’s tendency towards producing false positives.

Definition 7. Let STRICT be a strict best metric. Given a state S, role R, a heuristicfunction H , and a set of moves for role R HMK = {hm1...hmn} containing all moveshmi such that H(S,R, hmi) = H(S,R, hm1), where hm1 is the highest-scored moveaccording to H , we have:

MSSTRICT (S,R) =1

n

n∑i=1

{1 if GT (S,R, hmi) = GTmax(S,R)

0 otherwise


Best Move Probability

Best move probability metric evaluates the probability of randomly choosing a best movein a state. This metric corresponds to the K-best random metric with K = 1, and it is aparticularly useful baseline for other metrics which we often use in conjunction with thestrict best metric. It can be useful to plot the metric alongside the available moves countto further distinguish situations in which the high accuracy stems from there simply notbeing many legal moves.

Definition 8. Let BMP be a best move probability metric with some K ∈ N. Given a roleR, a state S having n moves available to R, and a set of moves M = {m1...mb} such thatmi ∈M ⇐⇒ GT (S,R,mi) = GTmax(S,R), we have:

MSBMP (S,R) =b

n

3.4.3 Move Ordering Metrics

While identifying the best move is an important capability any heuristic should have,many applications require the heuristic to be accurate in evaluating the rest of the availablemoves as well. The move ordering metrics measure how well a heuristic performs in thisfield. We propose one metric in this category, which measures the correctness of moveordering produced by a heuristic function by identifying brackets of moves with sameground-truth scores and verifying that sorting moves according to their heuristic valueplaces them in the correct bracket.

This metric starts by creating score categories from the ground-truth data, grouping to-gether all the moves with the the same score. Since the ground-truth scores are sorted, acategory is identified simply by indices of the first and last move with this score in thelist. These categories are then translated onto the list of moves ordered by their heuristicscores; for each category, the metric counts how many moves are correctly placed intothem. The heuristic is assigned its score as the ratio of correct moves to total number ofmoves.

Definition 9. Let ORD be a move ordering metric. Given a heuristic function H , a roleR, a state S having nmoves available toR, and a list ofR’s movesM = {m1...mn} in de-scending order according to their game-theorical values, we define bracketsBs1,e1, Bs2,e2...

Bsk,ek as a sub-list of M starting at index si and ending at ei, in which all moves havethe same score. We define brackets HBs1,e1, HBs2,e2... HBsk,ek as similar sub-lists of

Nera Nešic 33

moves HM = {hm1...hmn} in descending order according to their heuristic values, withstarting and ending index of each HB equal to the corresponding B indices. We thenhave:

MSORD(S,R) =1

2n

k∑i=1

|HBei,si|∑j=1

{1 if hmj ∈ HBsi,ei =⇒ hmj ∈ Bsi,ei

0 otherwise

3.4.4 Expected Score Metric

One of the heuristic functions that we will analyze is used to guide the random playoutin MCTS. Each time the algorithm has to choose a move, it has the heuristic evaluate allthe possible moves, and then it uses a sampling function to select a move according tothe scores assigned by the heuristic. Having this sampling function introduces anothervariable to evaluating the algorithm’s performance. A heuristic might, for instance, do agood job at finding good moves, but the sampling function may put too much weight onthe wrong moves.

We use the expected score metrics to study the performance of a heuristic in conjunctionwith a sampling function. Using the list of actions and scores produced by the heuristicmanager and the corresponding list of ground-truth scores, the expected score is calcu-lated as sum of products of the sampling function’s probability of the choosing a moveand the ground-truth score of this move.

Definition 10. Let F be a move sampling function that maps a heuristic value of a moveto the probability of choosing that move. Let EF be an expected score metric for functionF . Given a role R, a state S and a set M = {m1...mn} of all moves available to R, and aheuristic function H , we have:

MSEF (S,R) =n∑

i=1

GT (S,R,mi)× F (H(S,R,mi))

3.4.5 Move Categorization Metrics

Another heuristic function that we will analyze is used by minimax search. When thesearch space is too big to search completely, the minimax algorithm needs to stop prema-turely, and uses heuristic to estimate the value the leaf states have. These values are thenpropagated up the search tree, and influence the decision on which move to make.


In this case, it is not only important that the heuristic identifies good moves, but its eval-uation of moves should directly map to the minimax score of that move. In our exampleof Connect Four, a heuristic should be able to produce a score mapping that categorizes amove as a win, loss, or draw. The move categorization metrics allow for use and definitionof a score mapper which is applied on top of the heuristic manager’s results, and measurehow many moves are mapped to their correct ground-truth scores.

Definition 11. Let C be a score mapping function, which attempts to map heuristic valueof a move to its ground-truth value. Let CAT be a move categorization metric for C.Given a state S, role R, a heuristic function H , and a set M = {m1...mn} of all movesavailable to R, we have:

MSCAT (S,R) =1

n

n∑i=1

{1 if C(H(S,R,mi)) = GT (S,R,mi)

0 otherwise

3.5 Conclusion

In this chapter we described the implementation of HEF. We discussed design decisionsthat made this particular implementation of the framework best suited for GGP, and wepresented the tools that the framework offers to its users to facilitate creation of bench-mark datasets, metrics definition, and heuristic function analysis. We then introduced aset of default HEF metrics, giving a formal definition for each of them and discussingtheir use cases. These metrics will be used in the rest of this thesis to evaluate the resultsof HEF analysis.

35

Chapter 4

Working With MCTS Datasets

In this thesis we propose using MCTS benchmark datasets when game-theorical (GT)ones are too difficult to obtain. An MCTS dataset introduces some fuzziness that wedon’t need to deal with in the GT ones; MCTS is generally pretty good at finding thebest move, and we can often recognize the other good moves if we look at how manysimulations were done on them, since the UCT policy will push the algorithm towardsexploiting the most promising moves. However, the further we move from the best moves,the less reliable the data becomes. Some of the least good moves will have a tiny fractionof simulations done starting with them, and the average score they achieve is likely to bevery skewed. This makes it difficult to know exactly what the correct move ordering is.Moreover, having the scores be the average of all simulations taken by the move meansthat we don’t always have a clear distinction between moves which win the game andthose ending in a draw, as these average scores can get quite diluted.

In this chapter we will explore these and other difficulties that an MCTS dataset intro-duces. We will evaluate the usability of the dataset on the various HEF metrics by com-paring their performance against a Connect Four MCTS dataset to a purely GT one (wewere able to find a very optimized minimax solver for the game, and used it to re-evaluatedall the states from the MCTS dataset)[18]. Additionally, we investigate how time we al-low for MCTS simulations impacts the performance of the dataset. To do so we willcompare the performance of two MCTS datasets, one generated with one hour of simu-lations, the other with three hours (we refer to them as MCTS1H and MCTS3H datasets,respectively), against the GT dataset.


Depth 0 10 20 30 39MCTS1H 1,481,730 23,753,994 41,445,020 50,775,790 53,985,321MCTS3H 2,005,427 52,593,673 97,458,856 121,812,100 151,920,880

Table 4.1: Average number of simulations per state in MCTS datasets, grouped by differ-ent depth levels.

4.1 Experimental Setup

The experiments presented in this chapter rely on three benchmark datasets for the gameof Connect Four: GT, MCTS1H, and MCTS3H. All three datasets are made of the same498 game states, and differ in generation methods. The states were selected using themethod described in Section 3.1.2. The GT dataset was obtained with help from an op-timized Connect Four solver which computes the game-theorical value of the given statefor the player whose turn it is. Since we are interested in knowing the scores of individualactions in a state, we computed the game-theorical values of all successive states for eachstate in the benchmark, and assigned the inverse of their score to the actions that lead tothem.

MCTS1H and MCTS3H were obtained through Monte Carlo Tree Search simulation forrespectively one and three hours. We used the default HEF evaluator described in Section3.1.3 with different simulation time parameters for the two datasets. Table 4.1 shows theaverage number of simulations per state at each of the sampled depth levels.

4.2 Identifying the Best Move

The first question we wanted to answer was how good was MCTS at doing its primaryjob of identifying the best available move. To measure the best-move accuracy of MCTS,we first identified the set of optimal moves per state according to the GT dataset – theseare all the moves having the same highest score – and checked if the move which had thehighest number of simulations in the MCTS dataset was in this set.

Fig. 4.1 shows the accuracy of MCTS1H and MCTS3H in picking the best move. Overall,the achieved accuracy of both datasets is about 85%, versus the 53% accuracy we wouldexpect if choosing moves randomly. The move count plot shows us that the accuracy doessomewhat improve when states have fewer moves available (which is expected, as then thesearch space shrinks), although this effect is not very strong. The accuracy is lower at thebeginning of the game, which is not surprising as the search space there is the largest, butthere is also an interesting plunge in performance of MCTS1H at depths 30 - 35, which

Nera Nešic 37

Figure 4.1: Accuracy of MCTS datasets in identifying the best move. Both MCTS1H andMCTS3H achieve an overall accuracy of approximately 85%, while the random chanceof picking a best move is 53%. The benefit of extra simulation time for MCTS3H showsat depths 30 - 35.

tells us there are some subtleties at work there that are fooling the search – although itseems that additional computational time helps overcoming this difficulty.

4.3 Depth Statistics

The original score differential metric computes the maximum difference of move scoresfor an action at a given depth, relying on the assumption that the dataset provides an ac-curate score value for all actions. In case of MCTS datasets, however, the scores we havein the benchmark are the average score an action achieved over all random simulations,which makes scores somewhat diluted and fuzzy. Moreover, the scores can get quiteunreliable for the least explored states which only have a handful of simulations.

Fig. 4.2 shows what happens if we apply the score differential metric directly to a MCTSdataset’s Q (average score over all simulations starting with a move) values. As expected,both datasets show smaller difference between the maximum and minimum scores. Thisin itself doesn’t worry us; we mainly use the score differential metric to identify the "areasof interest" in which choice of a move can result in significantly different outcomes. Wefind the absolute value of the score differential less important. The MCTS1H dataset doesa relatively good job on this metric, identifying the main differential peak, and generallykeeping up with the shape of the GT dataset afterwards. If we used this dataset with ourmetric, we would miss the information about the important score difference at the initialstate of the game, but otherwise, it captures most of the behavior of the GT dataset.


Figure 4.2: Performance of score differential metrics with MCTS datasets. Score dif-ferential metric shows the maximum difference between scores two moves can achievein the same state, averaged over all states at a given depth. The Q value (average scoreachieved over all simulations starting from an action) from MCTS simulations can be usedto somewhat approximate the score differentials behavior of the GT dataset, but compar-ing the MCTS1H and MCTS3H dataset performance, we see that the Q value becomesless reliable as we increase the allowed simulation time.

Nera Nešic 39

The MCTS3H dataset, on the other hand, deviates significantly from the GT dataset.We can still see some of the topology past depth 20: a peak at depth 25, another moresignificant one at depth 32, and a dip at depth 35, but we entirely miss the biggest peakat depth 18. Again, this is not surprising as with more simulations the average scores getmore diluted, but it tells us Q is not reliable enough to be used for the score differentialmetric.

If we look back at Chapter 2, however, we see that MCTS really shines at solving armedbandit problems – that is, optimizing the total reward by exploiting the best action, whilestill exploring other available options. The N values tell us how many times the algo-rithm took a certain move over the course of simulations, and it is directly proportionalto how much reward the algorithm was getting from the subtree originating with a certainaction. While a Q action may have shifted and got diluted over simulations, the N valuesare a simple aggregation, and are more reliable in deciding which actions yield higherreward.

With this in mind, we implemented a new metric which measures the logarithm of theratio of the highest N to the lowest N in a state. As we see in Fig. 4.3, the metric performssimilarly to its Q counterpart on the MCTS1H dataset. The MCTS3H dataset, on theother hand, improves the metric result significantly, revealing the peak at depth 8, andemphasizing the peaks and dips in the second half of the game. These results indicate thatthe MCTS datasets’ N values are able to provide us with the a good approximation of theGT score differentials dynamic.

4.4 Best-Only Metrics

The best-only metrics evaluate a heuristic’s ability to identify the best moves available toa player in a state. They rely on a best move selector, which selects the set of moves fromthe benchmark with the same highest score. With MCTS datasets, however, we need toredefine what we mean by "highest score"; while the MCTS algorithm will likely identifyand exploit the most profitable moves, the average score over all simulations achieved bymoves that have the same GT value can be quite different.

Before evaluating the accuracy of best-only metrics on MCTS datasets, we checked if it ispossible to implement best move selectors able to accurately identify the best move set inthe MCTS dataset. The first one we investigated selects all moves with the average scorediffering from the highest (optimal) score by at most a constant value O. To avoid selectingpossible false positives with low N score and high Q scores, we first order all benchmark


Figure 4.3: Performance of score differential metrics using the N value on MCTS datasets.Using the N value (total number of simulations starting from an action) of MCTS datasetsgives us curves which follow the GT dataset more closely than the Q values. This metricis taking the logarithm of the ratio of highest to lowest N in a state.

moves from highest to lowest N, and we iterate through them selecting all moves until weencounter the first one with Q more than O away from the best move.

Fig. 4.4 shows the accuracy of this selector for different O values, measured against bothMCTS datasets. We measure the percentage of real best moves included in the set thatthe MCTS selector chooses, as well as the number of false positives – moves includedin the best move selection which are not actually best moves. We aim to maximize theformer while limiting the increase in the later. Unsurprisingly, we see that a higher O

increases both counts. Going from O=5 to O=10 improves the best move inclusion atlower depths by up to 40%, while increasing the rate of false positives by 20% limitedonly to the first few depths. The further increase to O=15 yields a smaller return on thebest move inclusion, but increases the false positives count significantly in the MCTS3Hdataset.

Given these results, we would say that O=10 seems to be a good value for the ConnectFour game. However, we notice this kind of move selection to be heavily influenced bythe player whose turn it is in a state; as we see in the MCTS1H dataset, there are bigdisparities in best move inclusion between consecutive turns in the later game. If theperformance of the best selector jitters so much within a single game, we worry howreliable the metric will be on other games. Moreover, determining the correct O value for

Nera Nešic 41

a game for which we don’t have a GT dataset for comparison will be difficult, since thescore distributions vary from one game to another.

Following the idea from the score differential metric, we investigated best move selectorsthat relied on the N value, rather than the Q. When a state has moves with high valuedifference, we often observe a sheer cutoff in number of simulations dedicated to the bestmoves. We implemented a second best move selector which sets the best move inclusionbreak point at the highest derivative value with respect to N. The results are shown inFig. 4.5; the derivative selection doesn’t have as high false positive rate in the initialstates, but it also falls far behind the O-Distance selection in the first half of the game. Wealso observe that the derivative selector acts more reliably in the later stages of the game,especially on the MCTS1H dataset.

For the rest of this chapter, we will use the O-selector with O=10 for evaluating the best-only metrics, since it is achieves the best accuracy for Connect Four. So far we haveobserved that the extra two hours of simulation in the MCTS3H dataset have noticeablepositive effect on the metrics performance, so we will default to it.

4.4.1 K-Best Accuracy

The K-Best metrics tell us whether a heuristic is able to place at least one of the bestmoves within the first K moves that it ranks the highest. Fig. 4.6 shows how this metriccopes with the MCTS3H dataset. We are charting the accuracy of the action heuristicas calculated by the metric using the O-distance best move selector with O=10 againstcorresponding results from a metric using the GT move selection on the GT dataset. Theresults look quite good; the largest discrepancy we observe is in the initial states, whichis caused by amount of false positives the O-distance selector introduces at these depths,but even with that, we see all the accuracy dips and peaks at the right places. The secondsignificant error of the MCTS3H dataset that we observe is at depth 11, where the MCTSmetrics fail to represent the significant rise in accuracy that the GT dataset captures. As K

increases, the two datasets converge, which is to be expected,as a higher K increases theprobability that any move will make it into both the selected best moves and the K-bestmoves chosen by the heuristic.

Fig. 4.6 also shows the random counterpart of the K-best metric, which tells us what isthe chance that a random heuristic would place a best move within the best K moves.Here the GT and MCTS behave very similarly, as this metric depends on the size of theselected best moves, regardless of which moves are actually selected.


Figure 4.4: Accuracy of O-distance best move selection in MCTS datasets. O-DistanceSelection selects all moves in a list of moves ordered by N until a move is met differingmore than O from the best move. Using O=10 seems to be the best compromise betweenbest move inclusion and false positives rate, although we have some concerns about thedips in performance in alternate states in MCTS1H.

Nera Nešic 43

Figure 4.5: Accuracy of derivative based best move selection in MCTS datasets. Usingthe derivative with respect to N to find a best move break point improves the false positivesrate and mitigates the jaggedness of best move inclusion for MCTS1H, at a great cost ofactual best move inclusion rate.


Figure 4.6: Performance of the K-best metric on the MCTS dataset. Accuracy of themetric improves with higher K, but keeps close to the accuracy values provided by theGT dataset. The random K-best metrics which are often used in conjunction with theK-best perform splendidly on MCTS.

Nera Nešic 45

Figure 4.7: Performance of the strict-best metric on the MCTS dataset. The metric per-forms well on MCTS, although it displays the same deficiencies found in K-Best. Mostof the accuracy information is preserved, especially if compared to the best move ratiometric.

4.4.2 Strict-Best Accuracy

The Strict-Best metric looks for false positives – it takes all the actions that get assignedthe same highest score by the heuristic function and counts how many of them are in theactual best move set. It is useful to look at this metric in conjunction with the best moveratio, which tells us the ratio of best moves to the total number of moves in a state (andcoincides with the random K-Best for K=1), so we plotted them together in Fig. 4.7 tosee if a MCTS dataset preserves the information from these two metrics. We see thatthe Strict-Best metric gets slightly too optimistic on the MCTS dataset, but overall staysclose to the trends of the GT dataset. The distance from the random metric is exaggeratedin the initial states, which could lead us to believe that the heuristic does significantlybetter than it actually does, but excluding this, the performance of the MCTS dataset issatisfying.

4.5 Expected Score Metrics

The expected score metrics tell us how well a heuristic works in conjunction with a sam-pling function. The sampling function bases its decision of which move to pick on the


scores the heuristic assigns to its moves. We calculate the expected score obtainable bythe heuristic/sampling function pair using the benchmark scores of the actions. Fig. 4.8shows the plot of this metric using three different sampling functions on GT and MCTSdatasets. In both datasets we use the Q value of the moves to calculate the expected score,and as we can see, that doesn’t work very well for MCTS. As our findings from the pre-vious sections indicate, the Q scores tend to get diluted and are often incorrect for theleast exploited moves, and this imprecision of the benchmark data cancels out the effectof the sampling function almost entirely. Our graph does order the three sampling func-tions correctly on the MCTS dataset – with the random one yielding least return and theepsilon-greedy the most, but they are so close to each other that we don’t get a correctperception of the extent to which a sampling function improves the expected score.

We can, however, use the MCTS dataset to investigate the behavior of the heuristic limitedto a few best moves, which are more likely to have values closer to the correct ones. Fig.4.8, bottom, shows the expected scores of our sampling functions limited to the threebest moves. That is, we let the sampling function calculate the probability of pickingeach move out of the complete set, but we only apply the first three probabilities whencomputing the expected score. Using this metric reveals that the MCTS dataset startsexhibiting trends similar to the GT dataset. Once again the metric’s actual values aresquished, but the score ratios between scores achieved by different metrics stay similar –in GT the average ratio between epsilon-greedy and random functions is 2.15, and 1.54between epsilon-greedy and Boltzmann distribution, while MCTS ratios are 1.82 and 1.4on the respective functions, which makes us hopeful that his modification of expectedscore metrics can produce reliable – if partial – information.

4.6 Move Ordering and Categorization Metrics

The move categorization metrics are used in scenarios in which the heuristic scores shouldmap directly onto minimax values of actions. These metrics evaluates both the heuristicand an implementation of a mapper which maps the heuristic scores. The produced map-ping is then compared to the benchmark values of the actions, and we count how manyvalues are mapped correctly. As such, the move categorization metric relies on the cor-rect score values in the benchmark more than any other metric. Unfortunately, as we haveseen, the MCTS datasets produce Q values which are approximations of the real scoreat best. If we were to use these metrics on a MCTS dataset, we would first need to mapthe MCTS scores to the GT scores. The first question to ask, however, is whether thatmapping even exist.

Nera Nešic 47

Figure 4.8: Performance of expected score metrics on MCTS datasets. The original im-plementation of the expected scores metric does not work well with a MCTS datasets,as it hides the difference in performance between different sampling functions. If welimit the evaluation to the three best moves, however, the MCTS dataset starts revealinginformation that we would get in a GT dataset.


We decided to look for that answer with help of Weka[19], a machine learning softwaredeveloped at the University of Waikato which implements a selection of machine learningalgorithms. We tried using several classifiers offered by Weka (random forests, multilayerperceptrons, Ridor rule learner, and J48 decision trees) to train a model which predictsthe GT dataset score based on the Q, N, total N, and depth values of a move in a state,using 20% of our dataset for training. Among these classifiers, we obtained the bestresults with the decision trees, using which produces a mapping with overall accuracy of67%, although the performance varies depending on players and depths – the red playeris stuck with accuracy between 25% and 50% from depths 20 to 34. These results arenot encouraging – with this much jitter in the benchmark itself, we cannot hope to getmeaningful move categorization evaluation of a heuristic.

We then tried applying the K-best strategy: we trained a new model using only infor-mation on the 3 moves per state with highest N score in the MCTS dataset, and testedits ability to map scores of the 3 best moves of the rest of the states. We were hopingthat, as we did with the expected score metrics, we could at least gain a partial insightinto a heuristic’s performance limited to the best K moves. Although we did see a smallimprovement in accuracy, which went to 74%, we still observe significant jitter and pitsin performance for the red player, which indicates that MCTS datasets are not suitable forthis kind of evaluation.

The move ordering metrics are similarly affected by the MCTS datasets. These metricsidentify the positions in a sorted list of moves in the benchmarks occupied by moves ofthe same score, and check that the way the heuristic scores the moves, they are still placedinto their right positions in the list. Of course, this metric relies on the benchmark moveordering being correct, which we saw becomes problematic beyond the first few moves.We measured the accuracy of move ordering produced by the MCTS dataset against theGT one, and found it to be 57.2%, which we consider far too low to use to evaluate ourmetrics against.

4.7 Conclusion

In this chapter we presented our findings on performance of MCTS datasets on our pro-posed metrics, limited to the game of Connect Four. We have seen that the MCTS datasetshave some flaws. Since MCTS focuses on discovering and exploiting the best move, andisn’t concerned with less optimal moves, MCTS generated datasets are inadequate foruse on any metrics which evaluate a heuristic’s ability to sort the moves correctly accord-

Nera Nešic 49

ing to their predicted value, as the move ordering of the benchmark itself is simply notreliable.

Similarly, we will have trouble with any metric which strictly relies on the GT valuesof moves in the benchmark, since MCTS tells us only what the average score of a moveis over many simulations and, as we have seen in the case of our move categorizationmetric, there does not seem to be a straight-forward way of mapping the MCTS scoresto the GT scores. This is also a problem with the expected score metrics, although wewere able to circumvent it, to a degree, by designing a metric which focuses only on thethree best moves. Interestingly, in the game of Connect Four this gives us very closeresults to calculating the expected score over all moves, since this game often has manylosing moves per state with score 0. When using this modified metric, however, we needto take the results with a grain of salt; we cannot rely on the expected scores given byMCTS datasets to be accurate in their absolute values (as we have seen, they are muchlower than in GT datasets), but the metric does reveal the relative performance betweenapplying various sampling functions. This information, while not perfect, can still beused to tell whether a newly designed sampling function works better or worse than theprevious one.

MCTS performs better on metrics which are only concerned with checking how good aheuristic is at spotting the optimal moves. In cases of the K-best, random K-best, andstrict-best metrics, we see that their results on MCTS datasets follow closely their resultson GT datasets. We do however need to keep in mind that the MCTS dataset has difficul-ties in determining the best move set in the initial states of the game, so the results fromthose regions might not be accurate.

We are also quite happy with the performance of the score differential metric, which isan important auxiliary metric that helps interpret results from other metric. Using the Nvalues instead of Q for this metric gives us an accurate picture of how score differentialchanges from one depth to another, although we lose the actual values of these differen-tials. We do not consider that to be problematic for the metric, however, as its primaryintent is to signal us in which region of the game is the ability to select a good move morerelevant due to the higher difference in possible outcomes.

Keeping our findings in mind, we believe that MCTS benchmarks do have their use, de-spite their limitations. A GT dataset allows for a more wholesome analysis of heuristics,but GT datasets can be very difficult to obtain. MCTS datasets limit us to best-moveanalysis, which gives us valuable, although incomplete, information on behavior of ourheuristics. Lacking the possibility to obtain a GT dataset, MCTS one can still be valu-able.

50

Chapter 5

Case Studies

So far we have presented our approach to the systematic evaluation of heuristic func-tions with HEF – we discussed the metrics we developed, their implementation, and theirperformance when used with MCTS derived datasets. In this chapter we show HEF inaction.

The first example of HEF analysis is based on the study of action heuristics for GGPcarried out at Reykjavik University. In this example, we use HEF to understand why thisheuristic wasn’t beneficial in some of the games that were tested. The second study caseinvolves analyzing two heuristics which were written by Reykjavik University studentsas part of their Artificial Intelligence class. We show how HEF helps us understand thestrengths and weaknesses of each heuristic, and explains the outcome of pitching themagainst each other.

5.1 Action Heuristic Analysis

5.1.1 The Action Heuristic Study

As we discussed in Chapter 2, the MCTS algorithm relies on many simulations of game-play to decide whether a move is good and worth of further exploitation. The problemwith using random simulations, however, is that they don’t recognize meaningful movesequences, and thus end up exploring a very wide region of a game tree which is not veryinteresting, diluting the results of playing actual significant moves. Consider, for exam-ple, a game in which a player can move in two directions, and needs to go towards a goalcell. Reaching the goal state can be quite straight-forward (e.g. move forwards 5 times),

Nera Nešic 51

yet, if we consider all the move sequences that are legal, we will end up exploring manyaction sequences which take us away from the goal, or end up in cycles. To deal with thisproblem, we can use heuristics to guide our random simulations, and modify the samplingfunction for choosing the legal moves accordingly. This was done to great success withthe MoGo Go player, which combines a set of offline heuristics with its simulation phase[15].

The action heuristic study done by Schiffel and Trutman looks into possibility of gen-erating such simulation guiding heuristics for GGP players[20]. The heuristic function,as its names implies, is focused on evaluating the action rather than a state, to avoid theoften expensive computation of states which follow each available action. This heuristicis constructed by regression on goal states. The regression starts by listing all the goaldescriptions resulting in a win for a player. Then, it looks at what conditions must befulfilled so that the last action of a player results in the goal state. For example, if ourgoal state is three in a row – cells 1:1, 1:2, and 1:3 being marked by x, player x can get toit through action "mark 1:1" if 1:1 is blank, and 1:2 and 1:3 are marked as x. A formulafor evaluating the action "mark 1:2 " is constructed as a disjunction of all possible statesfrom which this action results in a win. During gameplay, then, this formula is fuzzilyevaluated, and the heuristic assigns score to actions depending on how close the currentstate resembles the regressed states.

The results of the preliminary study using the action heuristic to guide MCTS searchare encouraging, as shown in Table 5.1. The results were obtained from playing 400matches per game against an unmodified MCTS player (switching which player is usingheuristics after 200 games), and limiting the search for both players to 10,000 randomsimulations per turn. The study evaluated the performance of the heuristic in guidingthe tree exploration and playout simulations, both separately and in conjunction with oneanother. As we see, the heuristic player does at least as well as, and in many cases betterthan the unmodified MCTS player. The two exceptions are the connect4 and checkers-small games, where the score of the heuristic player is significantly lower.

5.1.2 Studying Action Heuristics with HEF

We decided to take a look at performance of action heuristics on a few of the games.We were interested in bidding-tictactoe, breakthrough, checkers-small, connect4, ghost-maze2p, nineBoardTicTcToe, pentago, and sheep_and_wolf, in which we saw the use ofheuristics impact the performance significantly, by 20% or more. We skipped nineBoard-TicTacToe due to problems with the game rule sheet. We generated the HEF benchmark


Table 5.1: Performance of GGP MCTS players using action heuristics to guide the simu-lation against an unmodified MCTS player.

Depth 0 10 20 30 40bidding-tictactoe 1,162,854 36,672,454 \ \ \

breakthrough 254,162 274,284 4,944,536 29,623,671 2,400,104checkers-small 215,698 3,812,724 1,415,568 450,498 23,754,318

pentago 463,178 575,751 550,493 1,032,596 53,816,216

Table 5.2: Average number of simulations per state in MCTS datasets for analyzed games,grouped by different depth levels.

data for the remainder of the games. For each game, we generated between 150 and 200benchmark states (depending on the branching and depth of the game), and we went upto 1100 states for checkers-small, as we were interested in examining this game moreclosely, due to the significant failure of action heuristics on it. The HEF data for gameswas obtained through an hour long MCTS simulation per state; Table 5.2 shows the aver-age number of simulations over all states at various game depths.

Once we examined our datasets, however, we also discarded sheep_and_wolf and ghost-maze2p from analysis; these games suffer severely from the above mentioned issue ofhaving many meaningless actions available throughout the game. As discussed in Chap-ter 2, GGP games must terminate after a finite number of steps, and for these games thatsimply means that after a certain number of turns, the game ends in a draw. Thus, mostsimulations end up in a draw and we see minuscule variance in average move scores,and thus we don’t think this HEF dataset is reliable enough. We do speculate, however,that regardless of the quality of the heuristic on different parts of the game, the heuristicplayer is at advantage against the pure MCTS one simply because the heuristic offers aone-ply lookahead, which guarantees that a winning move will be taken in simulationwhen available, instead of taking a move that might result in some silly cycle.

Nera Nešic 53

This leaves us with five games to analyze. We make the assumption that the findings aboutMCTS datasets from the previous chapter can be generalized to other games, and that thebest-only metrics will provide us with reliable results. We choose to use the derivativebest move selector, since the absolute score difference that fit the Connect Four data bestis more difficult to generalize to a dataset which we can’t check against GT values.

Fig. 5.1 shows the results of best-only analysis of the action heuristics for the five games.First thing that we notice is that the heuristics work poorly on the checkers-small game.The strict-best metric tells us that the probability of the best-scored moves actually beingbest moves is consistently lower than the probability of picking a best move at random.The K-best metrics, which tell us how often the K highest ranked moves contain at leastone best move, also show under-performance of the heuristic: while the K2 metric followsthe random baseline more or less closely, if with some worrisome dips in mid-game,the K3 metric consistently stays under its baseline. This data tells us that the randomsimulations of the heuristic player were biased towards sub-optimal moves, which is aplausible explanation for the huge gap in performance that we see from results in Table5.1.

The results for Connect Four game are, on the other hand, surprising. The metrics showthat the action heuristic seems to be doing a better job at picking the best moves than arandom sampling would. The previous experiment, on the other hand, shows us that theheuristic player loses about 10% more often than a random player. This makes us wonderwhere the problem lays. If we look at our metrics, we see their measured accuracy dropbelow the baseline in two places: around depths 15 and 33. These are also the areas of thegame which the score differential metric signals us to be of high interest, since the choiceof the right move here can have a big impact on the score. Now, we do expect the metricsto show dips in accuracy when the best available moves in a state are fewer – which iswhat happens in high score differential states – as the probability of randomly pickinga good move decreases. However, here we see that in the critical states of the game, inwhich our choice of the right action matters the most, the action heuristic leans towardschoosing the wrong moves.

If we instead look at pentago2008, which is the game on which the action heuristicsachieve the best score among the ones analyzed, we notice the exact opposite: in thesecond part of the game, where the scores differential is the biggest, the heuristic’s accu-racy rises far above its baseline. The heuristic’s accuracy is even more relevant when weconsider that a benchmark state in this game has a maximum of 37 and an average of 14moves available to players. We can say that, in this game, the heuristic knows what it’sdoing.


(a) Checkers-small

(b) Connect Four

Nera Nešic 55

(c) Breakthrough

(d) Pentago


(e) Bidding tic-tac-toe

Figure 5.1: Best-only metrics evaluation of action heuristics. The metrics are ran againstthe GT dataset for Connect Four, and MCTS1H for other games. The Metrics workingwith the MCTS datasets are relying on the first derivative method for selecting the bestmoves set. The K-best metrics tell us how often the K highest ranked moves contain atleast one best move, while the strict-best metrics calculate the ratio of best moves includedin the moves that the heuristic evaluates to the same highest score. Score differentialshows the difference between the highest and lowest benchmark value for states at a givendepth.

Nera Nešic 57

The heuristic doesn’t show an improvement as impressive on bidding-tictactoe. The strict-best metric consistently lags behind its baseline, although it curiously overtakes it at depth11, where the maximum score differential occurs. The K-best metrics show a betterperformance, both overtaking their baselines after depth 7 and reaching the 100% ac-curacy rather early in the game, which is probably why the heuristic was beneficial in thisgame.

Breakthrough gives us some mixed results. Between depths 10 and 25, all metrics showthe heuristic is performing well. We then see a small dip in performance, a short recoveryand another, more worrisome, plummet of accuracy at depth 42, corresponding to thescore differential maximum. Despite this, the heuristic player wins approximately 15%more games than the random one. It might be that the effectiveness of the heuristic inthe early game is stirring the game in the more favorable direction from the start. Wealso need to keep in mind that breakthrough has a very large branching factor, with 39maximum and average 28 moves per state, which may mitigate the shortcomings of theheuristic.

5.2 Analyzing Student Written Connect4 Heuristics

In our second example of HEF use cases, we analyze a heuristic developed by a group ofReykjavik University students as part of their Artificial Intelligence course assignment[21].The students were tasked with writing an AI player for Connect Four. They implementedtheir player – named Nilli – using iterative deepening alpha-beta search, which resorts to aheuristic functions in case that search hits a state at maximum depth which isn’t terminal.When asked how they tested their heuristics for this player, the students said they simplydebugged a playthrough and observed how the heuristic behaved in a few states. Thisapproach is painfully familiar to the author, and is one of the driving reasons that HEFwas developed. While the direct observation is good for understanding the mechanismsof the heuristic, it is a very time intensive kind of analysis which is hard to apply to anextended sample of game states. In this section we will describe the Nilli heuristic, andshow how HEF could have been used to complement its development.

5.2.1 Nilli Heuristics

The students provided us with two versions of the heuristics they implemented for thegame. The first one (we will refer to it as "Nilli-old") was rather simple: it starts with a


base score of 500 (corresponding to the score assigned to a draw), and iterates through allcells of the game. For each cell occupied by the player’s token, it adds one to the scorefor each adjacent token of player’s color. Likewise, if the token belongs to the opponent,it subtracts one from the score for each adjacent token of opponent’s color. It is a verysimple heuristic, following the logic that more adjacent tokens of player’s color meanthere are more possible chances of winning.

The second iteration of the heuristic ("Nilli-new") checks every possible diagonal, verti-cal, and horizontal set of cells which could lead to victory. A set containing both tokentypes is discarded; otherwise, the set is awarded increasingly more points the more tokensit contains. For example, a diagonal is awarded one point for containing one player tokenand three empty cells, fifteen points for two player tokens,forty points for three tokens,and thousand points for all four tokens (the values are negative for sets with opponent’stokens). Nilli-new does a better job at reflecting the game structure, and it accounts forthe increasing difficulty of getting a higher number of tokens into a set. The player usingthis version of the heuristic is reported to have performed quite well in the class ConnectFour competition.

5.2.2 Nilli Meets HEF

Unlike the Action Heuristics, Nilli heuristics worked on a representation of the gamedifferent from the GGP one used in the benchmark. Thus, the first step for their analysiswas mapping the states from their HEF representation to Nilli’s one. Once this was done,Nilli heuristics were ready for all our metrics. Fig. 5.2 shows the performance of the twoiterations of the heuristic on the strict best, K-best, and move ordering metrics. The firstthing we note is that Nilli-new dominates Nilli-old in all metrics – which is not surprising,since it uses a more convincing logic. Looking at the strict-best metric, we see that bothheuristics show good resistance to the score dynamics; the best move probability metricshows us a drop in chance of randomly picking a good move around depth 16; this drop,however, does not affect the heuristics, which instead exhibit a peak in accuracy aroundthese depths of a game. This is a good thing to observe – it makes us more confidentthat the heuristic is doing something smart, rather than having its accuracy depend on thetopology of a game region.

We do notice, however, that the old heuristic lags behind its baseline until depth 9. This isunderstandable – at these depths of the game the pieces are still sparse, so is is likely thatmost cells have similar neighborhoods, which makes it hard for Nilli-old to differentiatethem. The old heuristic shows he same issue in the K-best metrics as well; here it takes

Nera Nešic 59

even longer to intersect its K3-best baseline. This is a sign of trouble; sometimes the strict-best metric can be too restrictive, penalizing a heuristic which may be able to recognizea good move and give it high ratings, although not necessarily picking it as a best move.The K3-best metric, on the other hand, is more relaxed, and as we have seen with thesction heuristic analysis, the gap with the K3-best metric tends to be smaller than the gapwith the strict-best metrics – as there is more chance the heuristic will put a good movewithin the best three ones. In our example, however, it seems that Nilli-old is entirelyswapping the bad moves for good ones. The move ordering metric also shows Nilli-oldfall behind Nilli-new.

After looking at the HEF data, we evaluated the two heuristics empirically. We createda random player, which uses the same implementation of alpha-beta search as the Nilliplayer, but assigns a random value between 1 and 999 to any non-terminal state at maxi-mum search depth. We setup 400 matches between the Nilli player and the random playerfor each heuristic, with turn time limited to 60 seconds. Nilli-new won 74.9% of matchesagainst the random player, while Nilli-old won only 24.6% of games – a result that agreeswell with the performance shown by the metrics, and most likely reflects the Nilli-old’sdifficulty in picking good moves in the early game.

5.2.3 Comparing Nilli and Action Heuristics

Having analyzed two kinds of heuristics, we thought it would be interesting to comparethe action and Nilli heuristics side by side (Fig. 5.3). The first metric we look at is,once again, strict-best, and it shows us that action heuristic works better than Nilli-old,but in turn Nilli-new outperforms it – especially in the beginning of the game and, asmentioned earlier, in the depth 16 area of interest. Keeping in mind that action heuristicis automatically generated by regression on game rules, it is good to see that it does betterthan a human-designed heuristic (however simple the design of Nilli-old may be). It isespecially good to see that the heuristic copes relatively well with the initial states, whichare the least similar to the goal states that the heuristic is regressed from.

The second graph shows the performance of heuristics relative to the expected score met-rics. We see that the epsilon-greedy function reflects well the disparity in accuracy be-tween the two heuristics, while the Nilli heuristics don’t do particularly well with theBoltzmann distribution sampling function, despite their accuracy being higher than ac-tion heuristic’s. This is probably explained by the fact that the sampling function hasbeen developed and fine-tuned to work well with the action heuristic’s score range.


Figure 5.2: HEF analysis of Nilli heuristic. HEF metrics show the improvement of Nilli-new over its older version. All metrics are run against the GT dataset.

Nera Nešic 61

Finally, we ported the action heuristic to be used with Nilli player’s alpha-beta search,and set up the same experiment as with the two Nilli heuristics. Out of 400 games, actionheuristic player won only 11% of time! This result does not reflect the accuracy analysis,which placed these heuristics above Nilli-old, but instead arises from the fact that theaction heuristic scoring was simply not designed to be used with alpha-beta search.

5.3 Conclusion

In this chapter we gave a few examples of what it is like working with HEF. The twoheuristics that we chose to study are quite different between them: the action heuristic isgenerated automatically, intended to be used in conjunction with MCTS, and developedto be usable with any GGP game. The Nilli heuristic was manually designed to fit thegame of Connect Four and be used in alpha-beta search and, unlike the action heuristic, itoperates on a different game state description from the one used in HEF benchmarks. Forboth heuristics, however, HEF gave us a way to quickly answer various questions we hadabout the heuristics.

The previous study of action heuristics left us wondering why this heuristic under-performson some of the games. The MCTS datasets made it possible to evaluate the heuristicacross five games – four of which we did not have any efficient way of calculating thegame-theorical values for. In some cases, our metrics gave us answers – we are confidentsaying that the the poor performance of the heuristic in checkers-small is related to theaccuracy measures which consistently lag behind their baselines. In pentago2008 we canalso tie the good results with the high accuracy measures. In the case of Connect Four, onthe other hand, we have more questions than we started with: the heuristic’s accuracy isnot bad in itself, yet the heuristic player lags behind the random player. Perhaps there isa problem with our sampling function? Maybe using this heuristic in MCTS prevents usfrom exploring the game enough? Likewise, seeing the performance of the heuristic withalpha-beta search, we might want to analyze the heuristic with the move categorizationmetrics.

In other words, we don’t expect that a single metric can tell us whether a heuristic isgood. Each metric answers a specific question about the heuristic based on the heuristic’sbehavior over hundreds of states. Asking the right questions will help us understand whatis our heuristic actually doing, and can guide our troubleshooting. What is importantwith HEF is that such questions are easy to formulate – all a user has to do is define how


Figure 5.3: Comparison of Nilli and action heuristics performance against HEF metrics.In terms of strict best accuracy, action heuristic doesn’t perform as well as Nilli-new. Thesampling function analysis shows that action heuristic achieves higher expected scoreswith Boltzmann distribution, which was tailored specifically for this heuristic. The higheraccuracy of Nilli heuristics translates into higher expected scores with epsilon-greedysampling function.

Nera Nešic 63

should the new metric evaluate a single state, and the framework will take care of datamanagement, aggregation, and visualization.

64

Chapter 6

Conclusions

In this thesis we proposed a paradigm for evaluating game heuristic functions and wepresent a framework that implements it. The main goal of this paradigm is providing afast and focused evaluation method for intermediate stages of heuristic function design.This is achieved by pre-calculating a benchmark dataset containing a set of states that havea value (ideally minimax value) assigned to each action, and using this dataset to answerquestions about the heuristic’s performance – such as evaluating the accuracy with whichit identifies the best move. The Heuristic Function Evaluation Framework provides allthe infrastructure needed to work with this paradigm, from benchmark generation to datavisualization; the users only need to formulate their questions in form of metrics whichverify the heuristic’s answers against the benchmark values.

Since generating minimax-value benchmarks can often be very difficult, we proposedusing Monte Carlo Tree Search algorithm for generating datasets which are less accurate,but easily obtainable. We compared the metric results from MCTS datasets to a game-theoretical one that we were able to obtain for the game of Connect Four, looking at howaccurately the MCTS dataset models the GT one. We found that these datasets show goodperformance on metrics that limit their analysis to the first few best moves of the state,while we don’t recommend using them with metrics which rely on the correct ordering ofall moves, or on exact minimax values of the benchmark.

We showed how the HEF metrics can be used to study heuristics. We examined theaction heuristic, designed for GGP, across five games using the MCTS datasets. We foundthat the accuracy metrics clearly predict the best and worst performing games. ConnectFour surprised us, as we saw that the heuristic achieved good scores despite the poorperformance of the player using the heuristic; this is useful information, though, as itallows us to rule out the heuristic as the sole source of the performance issue. We also

Nera Nešic 65

examined two versions of an heuristic that students wrote for Connect Four. The firstversion did not capture the structure of the game as well as the second one did, and thisreflected in the metric scores, in which the second version dominated the first.

We consider the results of this thesis encouraging. We have shown that the evaluationparadigm is capable of describing and predicting the behavior of heuristic functions, andthe fact that MCTS datasets can be used for this kind of analysis makes HEF a versatiletool that can simplify the heuristic function studies.

6.1 Future Work

We were only able to create a GT dataset for the game of Connect Four. While the resultsof evaluating the MCTS datasets for Connect Four look promising, it is still necessaryto repeat the evaluation on several different games so that we can be confident about ourfindings. The action heuristic study case relied heavily on our assumption that the resultsfrom Connect Four can be extended to other games. Looking at the data produced by themetrics, that assumption does not seem to be wrong, as we do see that the experimentaldata supports it, but we would like to extend the MCTS dataset evaluation study to moregames in the future.

We also note that currently we leave the responsibility of deciding the significance ofvarious metrics for given applications to the user of the paradigm. For example, themove categorization metrics are not overly relevant for the heuristics intended to findbest moves for MCTS random simulations, so achieving a high score under these metricsdoesn’t mean the heuristic will perform well in the actual MCTS search. In the future,we would like to investigate methods for determining the correlation between scores as-signed by a metric to a heuristic and the actual performance of the program on relevantproblems.

Another thing that we would like to examine further is how to extend HEF analysis to thedynamic heuristics, as we mentioned in Chapter 3, since these heuristics are quite popular.A naive way of integrating them into HEF would be to simply simulate a playthroughduring the heuristic’s initialization phase, and use their final form of the heuristic fromthat simulation for analysis. The issue with doing so is that we don’t get to evaluate theintermediate forms of this heuristic. It would perhaps be better to support snapshottingof the heuristic during the playthrough, and matching the snapshots to benchmark statesof corresponding depth. The next problem that poses itself is that dynamic heuristicsadapt to what they experience during the playthrough, so different playthroughs could


build the heuristic differently. HEF would, however, benefit from a study of this type ofheuristics.

67

Bibliography

[1] T. Anantharaman, “Confidently selecting a search heuristic,” ICCA JOURNAL,vol. 14, no. 1, pp. 3–16, 1991.

[2] R. Coulom, “Efficient selectivity and backup operators in Monte-Carlo tree search,”in Computers and games, pp. 72–83, Springer, 2007.

[3] L. Kocsis and C. Szepesvári, “Bandit based Monte-Carlo planning,” in Machine

Learning: ECML 2006, pp. 282–293, Springer, 2006.

[4] M. Campbell, A. J. Hoane, and F.-h. Hsu, “Deep Blue,” Artificial intelligence,vol. 134, no. 1, pp. 57–83, 2002.

[5] J. McCarthy, “Ai as sport,” Science, vol. 276, no. 5318, pp. 1518–1519, 1997.

[6] M. Genesereth, N. Love, and B. Pell, “General game playing: Overview of the aaaiggpcompetition,” AI magazine, vol. 26, no. 2, p. 62, 2005.

[7] M. Thielscher, “A general game description language for incomplete informationgames.,” in AAAI, vol. 10, pp. 994–999, Citeseer, 2010.

[8] N. Love, T. Hinrichs, D. Haley, E. Schkufza, and M. Genesereth, “General gameplaying: Game description language specification,” 2008.

[9] N. Metropolis, “The beginning of the Monte Carlo method,” Los Alamos Science,vol. 15, no. 584, pp. 125–130, 1987.

[10] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfsha-gen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlotree search methods,” Computational Intelligence and AI in Games, IEEE Transac-

tions on, vol. 4, no. 1, pp. 1–43, 2012.

[11] M. H. Kalos and P. A. Whitlock, Monte Carlo methods. John Wiley & Sons, 2008.


[12] B. Abramson, “Expected-outcome: A general model of static evaluation,” Pattern

Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, no. 2, pp. 182–193, 1990.

[13] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Ad-

vances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[14] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “Monte-Carlo tree search: A newframework for game ai.,” in AIIDE, 2008.

[15] S. Gelly and Y. Wang, “Exploration exploitation in Go: UCT for Monte-Carlo Go,”in NIPS: Neural Information Processing Systems Conference On-line trading of Ex-

ploration and Exploitation Workshop, (Canada), Dec. 2006.

[16] Y. Björnsson and H. Finnsson, “Cadiaplayer: A simulation-based general gameplayer,” Computational Intelligence and AI in Games, IEEE Transactions on, vol. 1,no. 1, pp. 4–15, 2009.

[17] D. R. Prasanna, Dependency injection. Manning Publications Co., 2009.

[18] J. Tromp, “John’s connect four playground.” https://tromp.github.io/

c4/c4.html. Accessed: 2015-11-24.

[19] G. Holmes, A. Donkin, and I. H. Witten, “Weka: A machine learning workbench,”in Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian

and New Zealand Conference on, pp. 357–361, IEEE, 1994.

[20] M. Trutman and S. Schiffel, “Creating action heuristics for general game playingagents,” in The IJCAI-15 Workshop on General Game Playing, p. 39, 2015.

[21] A. F. Bjarnason, A. S. Guðmundsson, M. Jónsson, and S. B. Stefánsdóttir. PersonalCommunications, 2015.

[22] S. Schiffel and M. Thielscher, “Fluxplayer: A successful general game player,” inAAAI, vol. 7, pp. 1191–1196, 2007.

[23] H. Finnsson and Y. Björnsson, “Cadiaplayer: Search-control techniques,” KI-

Künstliche Intelligenz, vol. 25, no. 1, pp. 9–16, 2011.

[24] M. Enzenberger, M. Müller, B. Arneson, and R. Segal, “Fuego—an open-sourceframework for board games and Go engine based on Monte Carlo tree search,”Computational Intelligence and AI in Games, IEEE Transactions on, vol. 2, no. 4,pp. 259–270, 2010.

Nera Nešic 69

[25] B. Bouzy and G. Chaslot, “Monte-Carlo Go reinforcement learning experiments,”in Computational Intelligence and Games, 2006 IEEE Symposium on, pp. 187–194,IEEE, 2006.

[26] B. Bouzy, “Associating domain-dependent knowledge and Monte Carlo approacheswithin a Go program,” Information Sciences, vol. 175, no. 4, pp. 247–257, 2005.

[27] S. Gelly and D. Silver, “Combining online and offline knowledge in UCT,” in Pro-

ceedings of the 24th international conference on Machine learning, pp. 273–280,ACM, 2007.

School of Computer ScienceReykjavík UniversityMenntavegi 1101 Reykjavík, IcelandTel. +354 599 6200Fax +354 599 6201www.reykjavikuniversity.isISSN 1670-8539

introducing heuristic function evaluation framework · introducing heuristic function evaluation...

Documents