zipf's law in the popularity distribution of chess openings

4

Click here to load reader

Upload: trinhthu

Post on 01-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zipf's Law in the Popularity Distribution of Chess Openings

Zipf’s Law in the Popularity Distribution of Chess Openings

Bernd Blasius1 and Ralf Tonjes2

1ICBM, University Oldenburg, 26111 Oldenburg, Germany2Institut fur Physik, Universitat Potsdam, 14415 Potsdam, Germany and Ochadai Academic Production, Ochanomizu University,

Tokyo 112-8610, Japan(Received 28 February 2008; revised manuscript received 4 August 2009; published 16 November 2009)

We perform a quantitative analysis of extensive chess databases and show that the frequencies of

opening moves are distributed according to a power law with an exponent that increases linearly with the

game depth, whereas the pooled distribution of all opening weights follows Zipf’s law with universal

exponent. We propose a simple stochastic process that is able to capture the observed playing statistics and

show that the Zipf law arises from the self-similar nature of the game tree of chess. Thus, in the case of

hierarchical fragmentation the scaling is truly universal and independent of a particular generating

mechanism. Our findings are of relevance in general processes with composite decisions.

DOI: 10.1103/PhysRevLett.103.218701 PACS numbers: 89.20.�a, 05.40.�a, 89.75.Da

Decision making refers to situations where individualshave to select a course of action among multiple alterna-tives [1]. Such processes are ubiquitous, ranging fromone’s personal life to business, management, and politics,and have a large part in shaping our life and society.Decision making is an immensely complex process and,given the number of factors that influence each choice, aquantitative understanding in terms of statistical laws re-mains a difficult and often elusive goal. Investigations arecomplicated by the shortage of reliable data sets, sinceinformation about human behavior is often difficult to bequantified and not easily available in large numbers,whereas decision processes typically involve a huge spaceof possible courses of action. Board games, such as chess,provide a well-documented case where the players in turnselect their next move among a set of possible gamecontinuations that are determined by the rules of the game.

Human fascination with the game of chess is long-standing and pervasive [2], not least due to the sheerinfinite richness of the game. The total number of differentgames that can be played, i.e., the game-tree complexity ofchess, has roughly been estimated as the average number oflegal moves in a chess position to the power of the length ofa typical game, yielding the Shannon number 3080 � 10120

[3]. Obviously only a small fraction of all possible gamescan be realized in actual play. But even during the firstmoves of a game, when the game complexity is stillmanageable, not all possibilities are explored equallyoften. While the history of successful initial moves hasbeen classified in opening theory [4], not much is knownabout the mechanisms underlying the formation of fashion-able openings [5]. With the recent appearance of extensivedatabases, playing habits have become accessible to quan-titative analysis, making chess an ideal platform for ana-lyzing human decision processes.

The set of all possible games can be represented by adirected graph whose nodes are game situations and whoseedges correspond to legal moves from each position

(Fig. 1). Every opening is represented by its move se-quence as a directed path starting from the initial node.We will differentiate between two game situations if theyare reached by different move sequences. This way thegraph becomes a game tree, and each node � is uniquelyassociated with an opening sequence.Using a chess database [6] we can measure the popular-

ity n� or weight of every opening sequence as the numberof occurrences in the database. We find that the weightedgame tree of chess is self-similar and the frequencies SðnÞof weights follow a Zipf law [7]

SðnÞ � n�� (1)

with universal exponent� ¼ 2. Note, the precise scaling inthe histogram of weight frequencies SðnÞ and in the cumu-lative distribution CðnÞ over the entire observable range[Fig. 2(a)]. Similar power law distributions with universalexponent have been identified in a large number of natural,economic, and social systems [7–15]—a fact which hascome to known as the Zipf or Pareto law [7,8]. If we countonly the frequencies SdðnÞ of opening weights n� after thefirst d moves we still find broad distributions consistentwith power law behavior SdðnÞ � n��d [Fig. 2(b)]. Theexponents �d are not universal, however, but increaselinearly with d [Fig. 2(b), inset). The results are robust:similar power laws could be observed in different data-bases and other board games, regardless of the consideredgame depth, constraints on player levels or the decadewhen the games were played. Stretching over 6 orders ofmagnitude, the here-reported distributions are among themost precise examples for power laws known today insocial data sets.As seen in (Fig. 1) for each node � the weights of its

subtrees define a partition of the integers (1 . . . n�). Theassumption of self-similarity implies a statistical equiva-lence of the branching in the nodes of the tree. We can thusdefine the branching ratio distribution over the real intervalr 2 ½0; 1� by the probability QðrjnÞ that a random pick

PRL 103, 218701 (2009)

Selected for a Viewpoint in PhysicsPHY S I CA L R EV I EW LE T T E R S

week ending20 NOVEMBER 2009

0031-9007=09=103(21)=218701(4) 218701-1 � 2009 The American Physical Society

Page 2: Zipf's Law in the Popularity Distribution of Chess Openings

from the numbers 1 . . . n is in a subset of size smaller orequal to rn. Taking n to infinity QðrjnÞ may have acontinuous limit QðrÞ for which we find the probabilitydensity function (PDF) qðrÞ ¼ Q0ðrÞ. If the limit distribu-tion qðrÞ of branching ratios exists it carries the fingerprintof the generating process. For instance, the continuum limitof the branching ratio distribution for a Yule-Simon pref-erential growth process [13] in each node of the tree wouldbe qðrÞ � r�, where �< 0 is a model specific parameter.On the other hand, in a k-ary tree where each gamecontinuation has a uniformly distributed random a prioriprobability the continuum limit corresponds to a randomstick breaking process in each node, yielding qðrÞ � ð1�rÞk�2. For the weighted game tree of chess qðrÞ can directlybe measured from the database [Fig. 3(a)]. We find thatqðrÞ is remarkably constant over most of the interval butdiverges with exponent 0.5 as r ! 1, and is very well fittedby the parameterless arcsine distribution

qðrÞ ¼ 2

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� r2

p : (2)

The form of the branching ratio distribution suggests thatin the case of chess there is no preferential growth processinvolved, but something entirely different which must berooted in the decision process during the opening moves ofa chess game [5].In the following we show that the asymptotic Zipf law in

the weight frequencies arises independently from the spe-cific form of the distribution qðrÞ, and hence, the micro-scopic rules of the underlying branching process. ConsiderN realizations of a general self-similar random segmenta-tion process of N integers, with paths (�0; �1; . . . ) in thecorresponding weighted tree. In the context of chess eachrealization of this process corresponds to a random gamefrom the database ofN games (e.g., dark shading in Fig. 1).The weights nd ¼ n�d

describe a multiplicative randomprocess

nd ¼ NYdi¼1

ri; n0 ¼ N; (3)

where the branching ratios rd ¼ nd=nd�1 for sufficientlylarge nd are distributed according to qðrÞ independent of d.For lower values of nd the continuous branching ratiodistribution is no longer a valid approximation and a

100

101

102

103

104

105

106

popularity n

10-4

10-2

100

102

104

106

freq

uenc

y S

d(n)

d=4 logbin

d=16 histogram

d=16 logbin

d=22 logbin

1 10 20 30 40d

01234

α

100

101

102

103

104

105

10610

-610

-410

-210

010

210

410

610

8

freq

uenc

y S

(n)

histogram

logbin

Zipf-distribution

100

102

104

106

100

102

104

106

A

B

n

C(n)

FIG. 2 (color online). (a) Histogram of weight frequenciesSðnÞ of openings up to d ¼ 40 in the Scid database and withlogarithmic binning. A straight line fit (not shown) yields anexponent of � ¼ 2:05 with a goodness of fit R2 > 0:9992. Forcomparison, the Zipf distribution Eq. (8) with � ¼ 1 is indicatedas a solid line. Inset: number CðnÞ ¼ PN

m¼nþ1 SðmÞ of openingswith a popularity m> n. CðnÞ follows a power law with ex-ponent � ¼ 1:04 (R2 ¼ 0:994). (b) Number SdðnÞ of openings ofdepth d with a given popularity n for d ¼ 16 and histogramswith logarithmic binning for d ¼ 4, d ¼ 16, and d ¼ 22. Solidlines are regression lines to the logarithmically binned data(R2 > 0:99 for d < 35). Inset: slope �d of the regression lineas a function of d and the analytical estimation Eq. (6) usingN ¼ 1:4� 106 and � ¼ 0 (solid line).

FIG. 1 (color online). (a) Schematic representation of theweighted game tree of chess based on the SCIDBASE [6] for thefirst three half moves. Each node indicates a state of the game.Possible game continuations are shown as solid lines togetherwith the branching ratios rd. Dotted lines symbolize other gamecontinuations, which are not shown. (b) Alternative representa-tion emphasizing the successive segmentation of the set ofgames, here indicated for games following a 1.d4 opening untilthe fourth half move d ¼ 4. Each node � is represented by a boxof a size proportional to its frequency n�. In the subsequent halfmove these games split into subsets (indicated vertically below)according to the possible game continuations. Highlighted in (a)and (b) is a popular opening sequence 1.d4 Nf6 2.c4 e6 (Indiandefense).

PRL 103, 218701 (2009) P HY S I CA L R EV I EW LE T T E R Sweek ending

20 NOVEMBER 2009

218701-2

Page 3: Zipf's Law in the Popularity Distribution of Chess Openings

node of weight one has at most one subtree; i.e., the statend ¼ 1 is absorbing.

To calculate the PDF pdðnÞ of the random variable ndafter d steps it is convenient to consider the log-transformed variables � ¼ logðN=nÞ and � ¼ � logr.The corresponding process f�dg is a random walk �d ¼P

di¼1 �i with non-negative increments �i and its PDF

�dð�Þ transforms as npdðnÞ ¼ �dð�Þ. An analytic solutioncan be obtained for the class

qðrÞ ¼ ð1þ �Þr�; 0 � r � 1; (4)

of power law distributions, which typically arise in prefer-ential attachment schemes. In this case the jump process �d

is Poissonian and distributed according to a gamma distri-

bution �dð�Þ ¼ ð1þ�Þdðd�1Þ!�

d�1e�ð1þ�Þ�. After retransforma-

tion to the original variables and noting that from theprobability pdðnÞ for a single node at distance d to theroot to have the weight n one obtains the expected numberSdðnÞ of these nodes in N realizations of the randomprocess as SdðnÞ ¼ NpdðnÞ=n, and in particular

SdðnÞ ¼ ð1þ �ÞdNðd� 1Þ!

�log

N

n

�d�1

�N

n

�1��

: (5)

The functions SdðnÞ are strongly skewed and can exhibitpower law like scaling over several decades. A logarithmicexpansion for 1< n � N shows that they approximatelyfollow a scaling law SdðnÞ � n��d with exponent

�d ¼ ð1� �Þ þ 1

logNðd� 1Þ: (6)

The exponent�d is linearly increasing with the game depthd and with a logarithmic finite size correction which is inexcellent agreement with the chess database [Fig. 2(b),inset]. Power laws in the stationary distribution of randomsegmentation and multiplicative processes have been re-ported before [9] and can be obtained by introducing slightmodifications, such as reflecting boundaries, frozen seg-ments, merging, or reset events [16–18]. In contrast, theapproximate scaling of SdðnÞ in Eq. (5) is fundamentallydifferent, as our process does not admit a stationary distri-bution. The exponents �d increase due to the finite size ofthe database.As shown in Fig. 3(b) we find excellent agreement

between the weight frequencies SdðnÞ in the chess databaseand direct simulations of the multiplicative process, Eq. (3)using the arcsine distribution Eq. (2). If the branchingratios are approximated by a uniform distribution qðrÞ ¼1 the predicted values of SdðnÞ are systematically toosmall, since a uniform distribution yields a larger flowinto the absorbing state n� ¼ 1 than observed in the data-base. Still, due to the asymptotic behavior of qðrÞ for r !0, this approximation yields the correct slope in the log-logplot so that the exponent �d can be estimated quite wellbased on Eq. (6) with � ¼ 0.By observing that SdðnÞ in Eq. (5) is the dth term in a

series expansion of an exponential function, we find theweight distribution in the whole game tree as SðnÞ ¼P

dSdðnÞ to be an exact Zipf law. For branching ratiodistributions qðrÞ different from Eq. (4) the weight fre-quencies are difficult to obtain analytically. But usingrenewal theory [19] the scaling can be shown to holdasymptotically for n � N and a large class of distributionsqðrÞ. For this, note that the random variable �ð�Þ ¼maxðd:�d < �Þ is a renewal process in �. The expectationE½�ð�Þ� is the corresponding renewal function related tothe distributions of the �d as

P1d¼1 Probð�d < �Þ ¼

E½�ð�Þ�. If the expected value � ¼ E½�� ¼ E½� logr� isfinite and positive [e.g., for the distribution (4) � ¼1=ð1þ �Þ], the renewal theorem provides

lim�!1

d

d�E½�ð�Þ� ¼ 1

�: (7)

Thus, we obtain lim�!1P1

d¼1 �dð�Þ ¼ 1� and finally

limðn=NÞ!0

SðnÞ ¼ N

�n2: (8)

Thus, the multiplicative random process [Eq. (3)] with anywell behaving branching ratio distribution qðrÞ on theinterval [0, 1] always leads to an asymptotically universalscaling for n � N [compare also the excellent fit of Eq. (8)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

r

q(r)

A

101

102

103

104

n

10-6

10-4

10-2

100

102

104

106

freq

uenc

y P

d(n)

database d=22 simulation arcsinsimulation uniformtheory uniform

B

FIG. 3 (color online). (a) Probability density qðrÞ of branchingratios r sampled from all games in the Scid database with a binsize of �r ¼ 0:01 and arcsine distribution Eq. (2) (black solidline). Every edge of the weighted game tree, from nodes of sizend�1 to nd, contributes to the bin corresponding to r ¼ nd=nd�1

with weight r. We disregarded clusters with nd < 100 so that, inprinciple, a cluster could contribute to any of the bins. We foundqðrÞ to be depth independent. (b) Distribution of opening popu-larities SdðnÞ for d ¼ 22 obtained from the Scid database (black)and from a direct simulation of the multiplicative processEq. (2), with branching ratios qðrÞ taken from a uniform orarcsine distribution. Further indicated is the theoretical resultEq. (5) (dashed line). Similar results are obtained for othervalues of d.

PRL 103, 218701 (2009) P HY S I CA L R EV I EW LE T T E R Sweek ending

20 NOVEMBER 2009

218701-3

Page 4: Zipf's Law in the Popularity Distribution of Chess Openings

to the chess data in Fig. 2(a)]. In [15] the same Zipf-lawscaling was found for the sizes of the directory trees in acomputer cluster. The authors propose a growing mecha-nism based on linear preferential attachment. Here we haveshown that the exponent � ¼ 2 for the weight distributionof subtrees in a self-similar tree is truly universal in thesense that it is the same for a much larger class of generat-ing processes and not restricted to preferential attachmentor growing.

There are direct implications of our theory to generalcomposite decision processes, where each action is as-sembled from a sequence of d mutually exclusive choices.What in chess corresponds to an opening sequence, may bea multivariate strategy or a customized ordering in othersituations. The question how such strategies are distributedis important for management and marketing [20]. Oneconsequence of our theory is that in a process of d com-posite decisions the distribution SdðnÞ � n��d of decisionsequences, or strategies, which occur n times shows a tran-sition from low exponents �d � 2, where a few strategiesare very common, to higher exponents �d > 2, whereindividual strategies are dominating. This is due to thedivergence of the first moment in power laws with expo-nents smaller than 2 [11]. From [Eq. (6)] the criticalnumber dcr of decisions at which this transition occursdepends logarithmically on the sample size N and on theleading order � of qðrÞ near zero as

dcr ¼ 1þ ð1þ �Þ logN: (9)

Applied to the chess database with N ¼ 1:4� 106 weobtain dcr � 15 [see also Figs. 2(b), inset, and 4]. Thisseparates the database into two very different regimes: intheir initial phase (d < dcr) the majority of chess games aredistributed among a small number of fashionable openings(for d ¼ 12, for example, 80% of all games in the databaseare concentrated in about 23% of the most popular open-ings), whereas beyond the critical game depth rarely usedmove sequences are dominating such that in aggregate theycomprise the majority of all games (Fig. 4). Note, that thisresult arises from the statistics of iterated decisions anddoes not indicate a crossover of playing behavior withincreasing game depth.

Our study suggests the analysis of board games as apromising new perspective for statistical physics. Theenormous amount of information contained in game data-bases, with its evolution resolved in time and in relation toan evolving network of players, provides a rich environ-ment to study the formation of fashions and collectivebehavior in social systems.We are indebted to Andriy Bandrivskyy for invaluable

help with the data analysis.

[1] H. Simon, Administrative Behaviour (Macmillan,New York, 1947); I. L. Janis and L. Mann, DecisionMaking: A Psychological Analysis of Conflict, Choice,and Commitment (Free Press, New York, 1977).

[2] H. J. R. Murray, A History of Chess (Oxford UniversityPress, London, 2002), reprint.

[3] C. E. Shannon, Philos. Mag. 41, 256 (1950).[4] The Encyclopedia of Chess Openings A–E, edited by

A. Matanovic (Chess Informant, Beograd, Serbia, 2001),4th ed.

[5] M. Levena and J. Bar-Ilan, Computer Journal (UK) 50,567 (2007).

[6] Here we present results based on ScidBase (http://scid.sourceforge.net) with N ¼ 1:4� 106 recorded games.Each game was uniquely coded by a string (up to d ¼40 half moves) and the game strings were sorted alphabeti-cally, so that all games following the same move sequenceup to a given game depth d were grouped together inclusters. Popularities were obtained by counting the clus-ter sizes nd.

[7] G. K. Zipf, Human Behaviour and the Principle of Least-Effort (Addison-Wesley, Cambridge, 1949).

[8] V. Pareto, Cours d’Economie Politique (Droz, Geneva,1896).

[9] D. Sornette, Critical Phenomena in Natural Sciences(Springer, Heidelberg, 2003), 2nd ed.

[10] M. Mitzenmacher, Internet Math. 1, 226 (2004).[11] M. E. J. Newman, Contemp. Phys. 46, 323 (2005).[12] J. Willis and G. Yule, Nature (London) 109, 177 (1922).[13] H. A. Simon, Biometrika 42, 425 (1955).[14] R. A. K. Cox et al., J. Cult. Econ. 19, 333 (1995);

S. Redner, Eur. Phys. J. B 4, 131 (1998); A. L. Barabasiand R. Albert, Science 286, 509 (1999); R. L. Axtell,Science 293, 1818 (2001); X. Gabaix et al., Nature(London) 423, 267 (2003).

[15] K. Klemm et al., Phys. Rev. Lett. 95, 128701 (2005).[16] H. Kesten, Acta Math. 131, 207 (1973); D. Sornette Phys.

Rev. E 57, 4811 (1998); D. Sornette and R. Cont, J. Phys. I7, 431 (1997); S. C. Manrubia and D.H. Zanette, Phys.Rev. E 59, 4945 (1999).

[17] P. L. Krapivsky et al., Phys. Rev. E 61, R993 (2000).[18] J. R. Banavar et al., Phys. Rev. E 69, 036123 (2004).[19] W. Feller, An Introduction to Probability Theory and Its

Applications (John Wiley & Sons, New York, 1971),Vol. 2.

[20] C. Anderson, The Long Tail: Why the Future of Business IsSelling Less of More (Hyperion, New York, 2006).

[21] J. L. Gastwirth, Rev. Econ. Stat. 54, 306 (1972).

0 0.2 0.4 0.6 0.8 1

fraction of games W

0

0.2

0.4

0.6

0.8

1

frac

tion

of o

peni

ngs

Q

0 5 10 15 20 25 30

game depth d

0

0.2

0.4

0.6

0.8

1

frac

tion

of o

peni

ngs

Q

d=25 d=20 d=15 d=10 d=5

W=0.9

W=0.8

W=0.5

A B

Gin

i coe

ffici

ent

G

FIG. 4. Inequality [11,21] of the distribution SdðnÞ.(a) Proportion W of games that is concentrating in the fractionQ of the most popular openings, for several levels of the gamedepth d. (b) Q as a function of d for three different values of W(solid lines) and Gini coefficient G ¼ 1� 2

R10 QðWÞdW as a

function of game depth (dotted line).

PRL 103, 218701 (2009) P HY S I CA L R EV I EW LE T T E R Sweek ending

20 NOVEMBER 2009

218701-4