performance and prediction: bayesian modelling of fallible choice in chess guy haworth

ACG12 Performance and Prediction, 2009-05-111

Performance and Prediction:

Bayesian Modelling

of Fallible Choice in ChessGuy Haworth

[email protected]


Topics ….

Motivation

Reference Fallible Players E(c)

In the zone of Endgame Table Zone (ETZ)

Prior to the Endgame Table Zone

A set of hypotheses {Hk} about engines {E(ck)}

Bayesian Inference, given a choice of hypotheses, and evidence:

Prior belief, posterior belief, Prob [Hk]

Translating the 'Reference Player' idea to the pre-EZT

Results … differentiation, value of small samples,

Motivation

Assess decision makers when they are under pressure Need a Utopian Decision Maker, a Reference Agent (RA)

Finite set of choices, each with some Utility Value A 'model world' is used to define the Utility Value RA always makes the choice with the best Utility Value

RA is then deskilled to make Reference Fallible Agents (RFAs) RFA does not always make the best choice

{RFA} the Space of Reference Fallible Agents (SRFA)

Now we take a human decision maker H… and associate them with some profile in SRFA

… by hypothesising that they are one of the RFAs and

… weighing the evidence to decide how likely each RFA actually is


1.Kc2, Kc1 or Ka1?


Mate in d = 23 with 1. Kc2Mate in d = 24 with 1. Kc1Mate in d = 29 with 1. Ka1

A Chess Engine E chooses 1. Kc2A stochastic version of E may not

Let E(c) be a stochastic engine:Likelihood[E(c) moves to p, depth d]

= (1 + d)-c Prob[E(c), p]

c = 0: all moves equally likelyc = : only best moves played

(d = #23) Kc2, (24) Kc1, (#29) Ka1


c = 5 c = 20

Kc2 Kc2

Kc1

Kc1

c = 0

Kc2

Kc1

Ka1

Ka1

Ka1

Which Engine … is playing the moves?

Suppose you see a sequence of player P's moves in KQKR You are told that they are being played by some engine E(c) You are told that it is one of E(0), E(5) or E(20) Which agent, A, is it: E(0), E(5) or E(20)? What would be fair odds? If you 'know nothing' (as you do) at the beginning …

Prob[A = E(i)] = 1/3 Let's suppose you see a sequence of optimal moves You should start to lean away from E(0) and towards E(20) But what are the probabilities now? No need to guess …

Bayes' Rule tells you exactly what the new probabilities are


Bayes' Rule

Probability [Hypothesis | Evidence] Prob [Hypothesis] Prob [Evidence | Hypothesis]

We have a choice of three hypotheses:

H0 "E = E(0)", H5 "E = E(5)", H20 "E = E(20)"

Prob[Hi] = 1/3 = 0.33 = the prior probability, i.e. before Kc2 is seen

Prob[E(0), Kc2] = 1/3 = 0.33 Prob[E(5), Kc2] = 0.47; Prob[E(20), Kc2] = 0.70

Prob[H0 | Kc2] 0.33 0.33 = 0.11 … etc (0.16, 0.23 … sum 0.50)

Scaling … Prob[H0 | Kc2] = 0.22 = the posterior probability

Prob[H5 | Kc2] = 0.31 and Prob[H20 | Kc2] = 0.47


The effect of Prior Probabilities

In the example above, the posterior probability of H = Prob[Ev | H]

This is because the prior probability of H was 1/3 for all H So the application of Bayes' Rule has been somewhat obscured

Suppose the priors were H0 0.2, H5 0.3, H20 0.5

Then the posterior probabilities are proportional to: H0 : 0.2 0.33 = 0.066

H5 : 0.3 0.47 = 0.141

H20 : 0.5 0.70 = 0.350 … totalling 0.557 so we scale up to …

Prob[H0] = 0.066/0.557 = 0.12

Prob[H5] = 0.141/0.557 = 0.25; Prob[H5] = 0.350/0.557 = 0.63

So, new posteriors were 0.22/0.31/0.47 … now 0.12/0.25/0.63


Rev. Bayes, Transform, Aeolian Harp


c2

c1

Bayesian Inference

PA: P A0

PA: P A

Refine model parameters 'Model Error'║EPP – EPA║

c2

c1

"Let the Wind of Evidence blow through the Aeolian Harp of your Hypotheses"

Chess Engines as Benchmarks?


Engines are improving all the time: hw, algorithms, knowledge There is actually a danger that they may become too good

They are not infallible: 'best moves' are not necessarily best q.v. changes of mind from one search depth to the next However, greater depth of search better engine [Beal] Benchmark fallibility contributes statistical uncertainty to findings

Independence is also required: engine E cannot vote itself 'best'!

Using the idea on pre-EGT Chess

Idea is to use chess-engine evaluations {vi} rather than depths Announced in 'Chess Endgame News', ICGA J. 28-4, 243 (2005)

However, this brings some complications: Some evaluations, unlike Depths to Mate, are negative The evaluations vi are evaluated using heuristics

Chess-engines' preferences are not infallible Engines' preferences may vary engine-to-engine, depth-to-depth

Some intuitive observations: A panel of engines is better than one engine as a benchmark The better the engine and the greater the depth [Beal], the better Uncertainty is halved by using four times the data


Performance v Skill Rating


Player move 1 move 2 … move n1 outcome

White e2-e4 … … Qf5# 1

Black e7-e5 … … - 0

Player move 1 move 2 … move n2 outcome

White d2-d4 … … g2-g3 0

Black f7-f5 … … Bb5# 1

Player move 1 move 2 … move nk outcome

White e2-e4 … … c7-c8 Qc8

1

Black e7-e6 … … - 0

…

performance rating (Elo)skill rating vs.

Player Elo

Kasparov

2851

Karpov 2795

… …

Stochastic choice, given position evaluations

At position p, some move mi to positions pi have evals vi

Can we say Likelihood[E(c), mi] = (1 + vi)c

No, because some vi may be negative

Need a mapping v w, s.t. i, wi 0 and v1 > v2 w1 < w2

Some functions w = C(v) are better than others! The intuitively obvious wi = 1 + |v1| + (v1 – vi) is not ideal

The wi are analogous to the di taken from an Endgame Table

Currently using wi = + (v1 – vi) with >0 … in fact = 0.1

Model choices, yet to be tested as to effect Choice of specific engine and search-depth Choice of Mapping to, e.g., r1.E() + r2.E(c) rather than to one engine E(c)


Results … based on TOGA II (depth 10)

Measured: Performance against Kaissa rather than opponent' An absolute rather than relative measure, given the benchmark A measure not affected by the opponent's performance m data points per game, rather than one (the game result)

Spectroscope! Virtual players at different ELO can be differentiated Higher ELO higher apparent competence c Winners visibly relax and play for the result when closing out Best performance-indicators are drawn games between like players

Performance, pre- and post-ELO, assessed in same terms Epochs of performance compared, pre- and post- cheating accusations Games tracked in 2D not 1D (net advantage); games compared Absolute performance of ELO 2400 players tracked across time


Virtual ELO Players: Data


# Player Elomin Elomax Period Games Pos. c min c maxμ

cσ

c σ c * Pos½

1 Elo_2100 2090 2110 1994-1998 217 12,751 1.04 1.10 1.0660 .00997 1.126

2 Elo_2200 2190 2210 1971-1998 569 29,611 1.11 1.15 1.1285 .00678 1.167

3 Elo_2300 2290 2310 1971-2005 568 30,070 1.14 1.18 1.1605 .00694 1.2034 Elo_2400 2390 2410 1971-2006 603 31,077 1.21 1.25 1.2277 .00711 1.253

5 Elo_2500 2490 2510 1995-2006 636 30,168 1.25 1.29 1.2722 .00747 1.297

6 Elo_2600 2590 2610 1995-2006 615 30,084 1.27 1.33 1.2971 .00770 1.3367 Elo_2700 2690 2710 1991-2006 225 13,796 1.29 1.35 1.3233 .01142 1.341

Profile of Virtual ELO Players in c-space


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4

Pro

b

c

Prob[c] for different Elo ranges

E2100

E2200

E2300

E2400

E2500

E2600

E2700

2100

2300 24002600

2700

Keres -v- The Rest (1948); D.P.Singh (2006)

WCC (1948): Keres 0 Botvinnik 4


D.P.Singh v Opponents

D.P.Singh – 'before' and 'after'


Two 6-month periods

Not as conclusive as it appears

'c'-tracking across the whole period

Allwermann-Kalinitschew (1998)

Variation in c for both players;

Track locus of game in 2D


Standard 1-dimensional charting of the game

Summary

'Contextual Analysis' (CA) of the individual player's decisions 'Decision Matching' (DM) uses less information and is cruder 'Average Differencing' (AD) uses less information: ditto

CA successfully differentiates players of different ELOs the standard deviation on c gives an idea of differentiator-power expect CA to be a better differentiator than AD … and expect AD to be better than DM

CA, using Bayesian Analysis, applied to: Career and epoch analysis, tournament and game analysis

Future directions: Evolving the method, including deeper statistical treatment Applying it to other chess and non-chess scenarios


Spare


performance and prediction: bayesian modelling of fallible choice in chess guy haworth

Documents

ka1 acg12 performance

47acg12 performance

isacg12 performance

11acg12 performance

probh20 kc2

70probh0 kc2

choice of hypotheses

best choice