judging the judges: evaluating the performance of

14
Judging the Judges: Evaluating the Performance of International Gymnastics Judges Hugues Mercier Université de Neuchâtel Switzerland [email protected] Sandro Heiniger Universität St.Gallen Switzerland [email protected] Abstract—Judging a gymnastics routine is a noisy process, and the performance of judges varies widely. In this work, we design, describe and implement a statistical engine to analyze the perfor- mance of gymnastics judges during and after major competitions like the Olympic Games and the World Championships. The engine, called the Judge Evaluation Program (JEP), has three objectives: (1) provide constructive feedback to judges, executive committees and national federations; (2) assign the best judges to the most important competitions; (3) detect bias and outright cheating. Using data from international gymnastics competitions held during the 2013–2016 Olympic cycle, we first develop a marking score evaluating the accuracy of the marks given by gymnastics judges. Judging a gymnastics routine is a random process, and we can model this process very accurately using heteroscedastic random variables. The marking score scales the difference between the mark of a judge and the theoretical performance of a gymnast as a function of the intrinsic judging error vari- ability estimated from data for each apparatus. This dependence between judging variability and performance quality has never been properly studied. We then study ranking scores assessing to what extent judges rate gymnasts in the correct order, and explain why we ultimately chose not to implement them. We also study outlier detection to pinpoint gymnasts who were poorly evaluated by judges. Finally, we discuss interesting observations and discoveries that led to recommendations and rule changes at the Fédération Internationale de Gymnastique (FIG). Keywords: Sports judges, quantifying accuracy, intrinsic judging error variability, heteroscedasticity, outlier detection, gymnastics. I. I NTRODUCTION Gymnastic judges and judges from similar sports are suscep- tible to well-studied biases 1 . Ansorge and Scheer [1] detected a national bias of artistic gymnastics judges at the 1984 Olympic Games: judges tend to give better marks to athletes from their home country while penalizing close competitors from other countries. National bias was subsequently detected in rhythmic gymnastics at the 2000 Olympic Games [26], and in numerous other sports such as figure skating [7], [35], Muay Thai boxing [20], ski jumping [35], diving [10] and dressage [28]. Plessner [23] observed a serial position bias in gymnastics experiments: a competitor performing and evaluated last gets better marks than when performing first. Boen, Hoye, Auweele, et al. [4] found a conformity bias in gymnastics: open feedback 1 Consult Landers [19] for an initial comprehensive survey until 1970, and Bar-Eli, Plessner, and Raab [3] for a recent survey. causes judges to adapt their marks to those of the other judges of the panel. Damisch, Mussweiler, and Plessner [8] found a sequential bias in artistic gymnastics at the 2004 Olympic Games: the evaluation of a gymnast is likely more generous than expected if the preceding gymnast performed well. Plessner and Schallies [24] showed in an experiment that still rings judges can make systematic errors based on their viewpoint. Biases observed in other sports might also occur in gymnastics as well. Findlay and Ste-Marie [12] found a reputation bias in figure skating: judges overestimate the performance of athletes with a good reputation. Price and Wolfers [27] quantified the racial bias of NBA officials against players of the opposite race, which was large enough to affect the outcome of basketball games. Interestingly, the racial bias of NBA officials subsequently disappeared, most probably due to the public awareness of the bias from the first study [25]. The aforementioned biases are often unconscious and cannot always be entirely eliminated in practice. However, rule changes and monitoring from the Fédération Internationale de Gymnastique (FIG) as well as increased scrutiny induced by the media exposure of major gymnastics competitions make these biases reasonably small and tempered by mark aggregation. In fact, judging is much more about skill and training than bias: it is difficult to evaluate every single aspect of the complex movements that are part of a gymnastics routine, and unsurprisingly nearly all international judges are former gymnasts. This challenge has been known since at least the 1930s [36], and there is a large number of studies on the ability of judges to detect execution mistakes in gymnastic routines [13], [21], [22], [33], [34] 2 . In a nutshell, novice judges consult their scoring sheet much more often than experienced international judges, thus missing execution errors. Furthermore, international judges have superior perceptual anticipation, are better to detect errors in their peripheral vision and, when they are former gymnasts, leverage their own sensorimotor experiences. Even among well-trained judges at the international level, there are significant differences: some judges are simply better than others. For this reason, the FIG has developed and used the Judge Evaluation Program (JEP) to assess the performance 2 Consult Landers [19] for an initial comprehensive survey until 1970, and Bar-Eli, Plessner, and Raab [3] for a recent survey. arXiv:1807.10021v3 [stat.AP] 16 Aug 2019

Upload: others

Post on 04-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Judging the Judges: Evaluating the Performance of

Judging the Judges: Evaluating the Performanceof International Gymnastics Judges

Hugues MercierUniversité de Neuchâtel

[email protected]

Sandro HeinigerUniversität St.Gallen

[email protected]

Abstract—Judging a gymnastics routine is a noisy process, andthe performance of judges varies widely. In this work, we design,describe and implement a statistical engine to analyze the perfor-mance of gymnastics judges during and after major competitionslike the Olympic Games and the World Championships. Theengine, called the Judge Evaluation Program (JEP), has threeobjectives: (1) provide constructive feedback to judges, executivecommittees and national federations; (2) assign the best judgesto the most important competitions; (3) detect bias and outrightcheating.

Using data from international gymnastics competitions heldduring the 2013–2016 Olympic cycle, we first develop a markingscore evaluating the accuracy of the marks given by gymnasticsjudges. Judging a gymnastics routine is a random process, andwe can model this process very accurately using heteroscedasticrandom variables. The marking score scales the differencebetween the mark of a judge and the theoretical performanceof a gymnast as a function of the intrinsic judging error vari-ability estimated from data for each apparatus. This dependencebetween judging variability and performance quality has neverbeen properly studied. We then study ranking scores assessingto what extent judges rate gymnasts in the correct order, andexplain why we ultimately chose not to implement them. We alsostudy outlier detection to pinpoint gymnasts who were poorlyevaluated by judges. Finally, we discuss interesting observationsand discoveries that led to recommendations and rule changesat the Fédération Internationale de Gymnastique (FIG).

Keywords: Sports judges, quantifying accuracy, intrinsicjudging error variability, heteroscedasticity, outlier detection,gymnastics.

I. INTRODUCTION

Gymnastic judges and judges from similar sports are suscep-tible to well-studied biases1. Ansorge and Scheer [1] detected anational bias of artistic gymnastics judges at the 1984 OlympicGames: judges tend to give better marks to athletes from theirhome country while penalizing close competitors from othercountries. National bias was subsequently detected in rhythmicgymnastics at the 2000 Olympic Games [26], and in numerousother sports such as figure skating [7], [35], Muay Thai boxing[20], ski jumping [35], diving [10] and dressage [28].

Plessner [23] observed a serial position bias in gymnasticsexperiments: a competitor performing and evaluated last getsbetter marks than when performing first. Boen, Hoye, Auweele,et al. [4] found a conformity bias in gymnastics: open feedback

1Consult Landers [19] for an initial comprehensive survey until 1970, andBar-Eli, Plessner, and Raab [3] for a recent survey.

causes judges to adapt their marks to those of the otherjudges of the panel. Damisch, Mussweiler, and Plessner [8]found a sequential bias in artistic gymnastics at the 2004Olympic Games: the evaluation of a gymnast is likely moregenerous than expected if the preceding gymnast performedwell. Plessner and Schallies [24] showed in an experimentthat still rings judges can make systematic errors based ontheir viewpoint. Biases observed in other sports might alsooccur in gymnastics as well. Findlay and Ste-Marie [12] founda reputation bias in figure skating: judges overestimate theperformance of athletes with a good reputation. Price andWolfers [27] quantified the racial bias of NBA officials againstplayers of the opposite race, which was large enough to affectthe outcome of basketball games. Interestingly, the racial biasof NBA officials subsequently disappeared, most probably dueto the public awareness of the bias from the first study [25].

The aforementioned biases are often unconscious and cannotalways be entirely eliminated in practice. However, rulechanges and monitoring from the Fédération Internationalede Gymnastique (FIG) as well as increased scrutiny inducedby the media exposure of major gymnastics competitionsmake these biases reasonably small and tempered by markaggregation. In fact, judging is much more about skill andtraining than bias: it is difficult to evaluate every single aspectof the complex movements that are part of a gymnastics routine,and unsurprisingly nearly all international judges are formergymnasts. This challenge has been known since at least the1930s [36], and there is a large number of studies on theability of judges to detect execution mistakes in gymnasticroutines [13], [21], [22], [33], [34]2. In a nutshell, novice judgesconsult their scoring sheet much more often than experiencedinternational judges, thus missing execution errors. Furthermore,international judges have superior perceptual anticipation, arebetter to detect errors in their peripheral vision and, whenthey are former gymnasts, leverage their own sensorimotorexperiences.

Even among well-trained judges at the international level,there are significant differences: some judges are simply betterthan others. For this reason, the FIG has developed and usedthe Judge Evaluation Program (JEP) to assess the performance

2Consult Landers [19] for an initial comprehensive survey until 1970, andBar-Eli, Plessner, and Raab [3] for a recent survey.

arX

iv:1

807.

1002

1v3

[st

at.A

P] 1

6 A

ug 2

019

Page 2: Judging the Judges: Evaluating the Performance of

of judges during and after international competitions. The workon JEP was started in 2006 and the tool has grown iterativelysince then. Despite its usefulness, JEP was partly designedwith unsound and inaccurate mathematical tools, and was notalways evaluating what it ought to evaluate.

A. Our contributions

In this article, we design and describe a toolbox to assess, asobjectively as possible, the accuracy of international gymnasticsjudges using simple yet rigorous tools. This toolbox is nowthe core statistical engine of the new iteration of JEP3 pro-viding feedback to judges, executive committees and nationalfederations. It is used to reward the best judges by selectingthem to the most important competitions such as the OlympicGames. It finds judges performing below expectations so thatcorrective measures can be undertaken. It provides hints aboutinconsistencies and confusing items in the Codes of Pointsdetailing how to evaluate each apparatus, as well as weaknessesin training and accreditation processes. In uncommon butimportant circumstances, it can uncover biased and cheatingjudges.

The main tool we develop is a marking score evaluating theaccuracy of the marks given by a judge. We design the markingscore such that it is unbiased with the apparatus/discipline underevaluation, and unbiased with respect to the skill level of thegymnasts. In other words, the main difficulty we overcomeis as follows: a parallel bars judge giving 5.3 to a gymnastdeserving 5.0 must be evaluated more generously than a vaultjudge giving 9.9 to a gymnast deserving 9.6, but how muchmore? To quantify this, we model the behavior of judges asheteroscedastic random variables using data from internationaland continental gymnastics competitions held during the 2013–2016 Olympic cycle. The standard deviation of these randomvariables, describing the intrinsic judging error variability ofeach discipline, decreases as the performance of the gymnastsimproves, which allows us to quantify precisely how judgescompare to their peers. To the best of our knowledge, thisdependence between judging variability and performancequality has never been properly studied in any setting (sportor other).

Besides allowing us to distinguish between accurate anderratic judges, we also use the marking score as the basic toolto detect outlier evaluations. The more accurate is a judge, thelower is his/her outlier detection threshold.

We then study ranking scores quantifying to what extentjudges rank gymnasts in the correct order. We analyzed differentmetrics to compare distances between rankings such as thegeneralized version of Kendall’s τ distance [18]. Dependingon how these rankings scores are parametrized, they are eitherunfair by penalizing unlucky judges who blink at the wrongtime, or correlated with our marking score and thus unnecessary.

3The new iteration of JEP was developed in collaboration with the FIGand the Longines watchmaker. It is a full software stack that handles all theinteractions between the databases, our statistical engine, and a user-friendlyfront-end to generate statistics, recommendations and judging reports.

Since no approach was satisfactory, the FIG no longer usesranks to monitor its judges4.

We made other interesting observations that led to recom-mendations and changes at the FIG during the course of thiswork. We show that so called reference judges, hand-pickedby the FIG and imparted with more power than regular paneljudges, are not better than these regular panel judges in theaggregate. We thus recommended that the FIG stops grantingmore power to reference judges. We also show that womenjudges are significantly more accurate than men judges inartistic gymnastics and in trampoline, which has training andevaluation implications.

This is the first of a series of three articles on sports judging.In the second article [16], we refine national bias studies ingymnastics using the heteroscedastic behavior of the judgingerror of gymnastics judges. In the third article [15], we showthat this heteroscedastic judging error appears with a similarshape in other sports where panels of judges evaluate athletesobjectively within a finite marking range.

The remainder of this article is organized as follows. Wepresent our dataset and describe the gymnastic judging systemin Section II. We the discuss true performance quality andcontrol scores in gymnastics in Section III. We derive themarking score in Section IV. In Section V, we use the markingscore to detect outliers. Section VI discusses ranking scoresand why we ultimately left them aside. We present interestingobservations and discoveries in Section VII and conclude inSection VIII by discussing the strengths and limitations of ourapproach.

II. DATA AND JUDGING IN GYMNASTICS

Gymnasts at the international level are evaluated by panelsof judges for the difficulty, execution, and artistry componentsof their performances. The marks given by the judges areaggregated to generate the final scores and rankings of thegymnasts. The number of judges for each component and theaggregation method are specific to each discipline. In this arti-cle, we analyze the execution component of all the gymnasticsdisciplines: artistic gymnastics, acrobatic gymnastics, aerobicgymnastics, rhythmic gymnastics, and trampoline. We alsoevaluate artistry judges in acrobatic and aerobic gymnastics,but exclude difficulty judges from our analysis. Our datasetencompasses 21 international and continental competitionsheld during the 2013–2016 Olympic cycle culminating withthe 2016 Rio Olympic Games.

The execution of a gymnastics routine is evaluated by apanel of judges. Table I summarizes the composition of thetypical execution panel for each discipline5. With the exceptionof trampoline, these panels include execution and referencejudges. Execution and reference judges have different power

4The previous iteration of JEP used a rudimentary ranking score.5The execution panels do not always follow this typical composition:

the qualifying phases in artistic and rhythmic gymnastics may includefour execution judges instead of five, World Cup events and continentalchampionships do not always feature reference judges, and aerobic andacrobatic gymnastics competitions can have larger execution panels.

2

Page 3: Judging the Judges: Evaluating the Performance of

Typical panel Number of NumberDiscipline composition performances of marks

Acrobatic gymnastics 4 E + 2 R 756 4’870Aerobic gymnastics 4 E + 2 R 938 6’072Artistic gymnastics 5 E + 2 R 11’940 78’696Rhythmic gymnastics 5 E + 2 R 2’841 19’052Trampoline 5 E 1’986 9’654

Table I: Standard composition of the execution panel, number of performances and number of marks per discipline.E = Execution judges; R = Reference judges.

and are selected differently, but they all judge the executionof the routines under the same conditions and using the samecriteria.

After the completion of a routine, each execution paneljudge evaluates the performance by giving it a score between0 and 10. Table I includes the number of performances andjudging marks per discipline in our dataset. The number ofperformances in an event is not always equal to the numberof gymnasts. For instance, gymnasts who wish to qualify forthe vault apparatus finals jump twice, each jump counting as adistinct performance in our analysis. The number of judgingmarks depends on the number of performances and the size ofthe judging panels.

III. TRUE PERFORMANCE QUALITY AND CONTROL SCORESIN GYMNASTICS

The execution evaluation of a gymnastics routine is basedon deductions precisely defined in the Code of Points of eachapparatus6. The score of each judge can thus be compared tothe theoretical true performance of the gymnast.

In practice the true performance level is unknown, and theFIG typically derives control scores with outside judging panelsand video reviews post-competition. Unfortunately, the FIGdoes not provide accurate control scores for every performance:the number of control scores and how they are obtained dependson the discipline and competition. Besides, even when a controlscore is available, the codes of points might be ambiguousor the quality of a performance element may land betweentwo discrete values. This still results in an approximation ofthe true performance, albeit a very good one. Control scoresderived post-competition can also be biased, for instance ifpeople deriving them know who the panel judges are, andwhat marks they initially gave. For all these reasons, in ouranalysis, we train our model using the median judging markof each performance as the control score. Whenever marks byreference judges, superior juries and post-competition reviewsare available, we include them with execution panel judgesand take the median mark over this enlarged panel, henceforthincreasing the accuracy of our proxy of the true performancequality. We discuss the implications of training our data withthe median, and control scores in general, in Section VIII.

6The 2017–2020 Codes of Points, their appendices and other documentsrelated to rules for all the gymnastics disciplines are publicly available athttps://www.gymnastics.sport/site/rules/rules.php. Competitions in our datasetwere ruled by the 2013–2016 Codes of Points.

p Performance pλp True quality level of Performance pcp Control score of Performance pj Judge jsp,j Mark of Judge j for Performance pep,j Judging discrepancy sp,j − cp (approximates the judging error)

of Judge j for Performance pmp,j Marking score for Performance p by Judge jMj Marking score of Judge jd Apparatus / Discipline dσd(cp) Intrinsic judging error variability of Discipline dαd, βd, γd Parameters of Discipline dn Number of performances in an event

Table II: Notation.

IV. MARKING SCORE

We now derive a marking score to evaluate the performanceof gymnastics judges. We first describe our general approachusing artistic gymnastics data in Section IV-A and presentresults for the other gymnastics disciplines in Section IV-B.Table II summarizes the notation we use in this section.

A. General approach applied to artistic gymnastics

The marking score must have the following properties. First,it must not depend on the skill level of the gymnasts evaluated:a judge should not be penalized nor advantaged if he judgesan Olympic final with the world’s best 8 gymnasts as opposedto a preliminary round with 200 gymnasts. Second, it mustallow judges comparisons across apparatus, disciplines, andcompetitions. The marking score of a judge is thus based onthree parameters:

1) The control scores of the performances2) The marks given by the judge3) The apparatus / disciplineLet sp,j be the mark of Judge j for Performance p, and let

ep,j , sp,j − cp be the judging discrepancy of Judge jfor Performance p. Since we use the median of the enlargedjudging panel as the control score (cp , med

j(sp,j)), thus as

a proxy of the true performance level λp, it follows that ep,jis a proxy of the judging error of Judge j for Performance p.We emphasize once more that we discuss the advantages anddrawbacks of using the median as control score in Section VIII.

Figure 1 shows the distribution of ep,j for artistic gymnastics.Our first observation is that judges are too severe as often asthey are too generous, which is trivially true because we usethe median as control score. The second observation is that

3

Page 4: Judging the Judges: Evaluating the Performance of

the judging error is highly heteroscedastic. Judges are muchmore accurate for the best performances, and simply using ep,junderweights errors made for the best gymnasts.

−2

−1

0

1

2

2 4 6 8 10

Control score

Judg

ing

erro

r

400

800

1200

1600

Frequency

Figure 1: Distribution of the judging errors in artistic gymnas-tics. To improve the visibility, we aggregate the points on a0.1× 0.1 grid.

●●

●●●●

●●●

●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0.0

0.2

0.4

0.6

2 4 6 8 10

Control score

Var

ianc

e of

judg

ing

erro

r

200

400

600

Frequency

Figure 2: Variance of judging error versus control score inartistic gymnastics.

Figures 2 and 3 respectively show the sample variance andthe sample standard deviation of the judging error ep,j as afunction of the control score cp for artistic gymnastics. InFigures 2, 3 and all similar figures that follow, the frequency isthe number of performances with a given control score, and thefitted curves are exponential weighted least-squares regressionsof the data. In Figure 2, we observe that the sample variancedecreases almost linearly with the control score, except forthe best performances for which it does not converge to zero.By inspection, the fitted standard deviation in Figure 3 is anoutstanding fit. The outliers correspond to the rare gymnastswho aborted or catastrophically missed their routine. Theweighted root-mean-square deviation (RMSD) of the regressionis 0.015, which is almost one order of magnitude smallerthan the smallest deduction allowed by a judge. We use this

●●

●●●●

●●●

●●

●●●●●●●●●●●

●●

●●●

●●

●●

●●●

●●●●●

●●●

●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●

●●●

●●●

σD(cp) = 0.596 − 0.071 ⋅ e0.205⋅cp

RMSD = 0.0150.0

0.2

0.4

0.6

0.8

2 4 6 8 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

200

400

600

Frequency

Figure 3: Standard deviation of judging error versus controlscore in artistic gymnastics, corresponding to the intrinsicjudging error variability for this discipline.

exponential equation for our estimator of the standard deviationof the judging error σd(cp), which we call the intrinsic judgingerror variability.

We can do the same analysis at the apparatus level. Forexample, Figures 4, 5, 6 and 7 respectively show the intrinsicjudging error variability (the weighted least-squares regressionof the standard deviation of the judging error) for still rings,uneven bars, women’s floor exercise and men’s floor exercise.

More generally, the estimator for σd(cp) depends on thediscipline (or apparatus) d under evaluation and the controlscore cp of the performance, and is given by

σd(cp) , max(αd + βdeγdcp , 0.05). (1)

For some apparatus like men’s floor exercise in Figure 7 theintrinsic judging error variability is linear within the data range.Since there is no mark close to 10 in our dataset, and sinceσd(cp) becomes small for the best recorded performances,we can omit the mathematical ramifications of the boundedmarking range. However, for apparatus such as women’s floorexercise in Figure 6, the best fitted curves go to zero before 10.Since athletes might get higher marks than in our original dataset in future competitions, we use max(·, 0.05) as a fail-safemechanism to avoid comparing judges’ marks to a very lowand even negative extrapolated intrinsic error variability in thefuture.

We emphasize that all the disciplines and apparatus weanalyzed have highly accurate regressions. Besides acrobaticgymnastics, for which we do not have as much data, the worstweighted root-mean-square deviation is RMSD ≈ 0.04.

4

Page 5: Judging the Judges: Evaluating the Performance of

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●●

●●

●●

σD(cp) = 0.517 − 0.026 ⋅ e0.294⋅cp

RMSD = 0.0220.0

0.1

0.2

0.3

0.4

6 7 8 9 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

30

60

90

120

Frequency

Figure 4: Standard deviation of judging error versus controlscore for still rings.

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●σD(cp) = 0.624 − 0.08 ⋅ e0.197⋅cp

RMSD = 0.0370.0

0.2

0.4

0.6

2 4 6 8 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

10

20

30

40

50

Frequency

Figure 5: Standard deviation of judging error versus controlscore for uneven bars.

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●●●●

σD(cp) = 0.341 − 0.001 ⋅ e0.566⋅cp

RMSD = 0.0260.0

0.1

0.2

0.3

0.4

0.5

6 8 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

20

40

60

80

Frequency

Figure 6: Standard deviation of judging error versus controlscore for women’s floor exercise.

●●

●●

●●

●●●

●●●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●●●●●

●σD(cp) = 3.011 − 2.24 ⋅ e0.027⋅cp

RMSD = 0.0330.0

0.2

0.4

0.6

0.8

5 6 7 8 9 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

25

50

75

Frequency

Figure 7: Standard deviation of judging error versus controlscore for men’s floor exercise.

The marking score of Performance p by Judge j is

mp,j ,ep,jσd(cp)

=sp,j − cpσd(cp)

. (2)

It expresses the judging error as a function of the standarddeviation for a specific discipline and control score. The overallmarking score for Judge j is given by

Mj ,√E[m2

p,j ] =

√√√√ 1

n

n∑p=1

m2p,j . (3)

The marking score of a perfect judge is 0, and a judge whosejudging error is always equal to the intrinsic judging errorvariability σd(cp) has a marking score of 1.0. The mean squarederror weights outliers heavily, which is desirable for evaluatingjudges.

Figure 8 shows the boxplots of the marking scores for allthe judges for each apparatus in artistic gymnastics usingthe regression from Figure 3. The acronyms are defined inTable III. The first observation is that there are significantdifferences between apparatus. Pommel horse, for instance,is intrinsically more difficult to judge accurately than vaultand floor exercise. The FIG confirms that the alternative, i.e.,that judges in pommel horse are less competent than judgesin men’s vault or men’s floor exercise, is highly unlikely. Thedifferences between floor and vault on one side and pommelhorse on the other side were previously observed in punctualcompetitions [2], [5], [6]. Note that the better accuracy of vaultjudges does not make it easier to rank the gymnasts since manygymnasts execute the same jumps at a similar performancelevel.

5

Page 6: Judging the Judges: Evaluating the Performance of

●●

● ●

●●

●●

0.0

0.5

1.0

1.5

2.0

M_P

H (

153)

M_S

R (

149)

W_U

B (

119)

M_P

B (

158)

M_H

B (

153)

M_F

X (

156)

W_B

B (

117)

W_F

X (

115)

M_V

T (

153)

W_V

T (

119)

Mar

king

sco

re

Figure 8: Distribution of the overall marking scores per artisticgymnastic apparatus using one overall formula. The acronymsare defined in Table III, and the numbers between brackets arethe number of judges per apparatus in the dataset.

Acronym Apparatus

BB Balance beam (women)FX Floor exercise (men and women)HB Horizontal bar (men)PB Parallel bars (men)PH Pommel horse (men)SR Still rings (men)UB Uneven bars (women)VT Vault (men and women)

Table III: The artistic gymnastics apparatus and their acronyms.

Acronym Apparatus

DMT Double mini-trampoline (men and women)IND Individual trampoline (men and women)TUM Tumbling (men and women)

Table IV: The trampoline apparatus and their acronyms.

A highly desirable feature for the marking score is to becomparable between apparatus and disciplines, which provesdifficult with one overall formula. The differences betweenapparatus make it challenging for the FIG to qualitativelyassess how good the judges are and to convey this informationunambiguously to the interested parties. We thus estimated theintrinsic judging error variability σd(cp) for each apparatus(instead of grouping them together) and used the resultingregressions to recalculate the marking scores. The results,presented in Figure 9, now show a good uniformity and makeit simpler to compare judges from different apparatus witheach other. A pommel horse judge with a marking score of1.0 is average, and so is a vault judge with the same markingscore. This has allowed us to define a single set of quantitativeto qualitative thresholds applicable across all the gymnasticsapparatus and disciplines.

●●

●● ●

●●

●●●

●●

●●●

●●●●

●●●

●●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●σD(cp) = 4.182 − 3.335 ⋅ e0.022⋅cp

RMSD = 0.0270.00

0.25

0.50

0.75

4 6 8 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

50

100

150

Frequency

Figure 10: Standard deviation of judging error versus controlscore in rhythmic gymnastics.

●●

●●

●●

●●

●●

●●

0.0

0.5

1.0

1.5

2.0

M_P

H (

153)

M_S

R (

149)

W_U

B (

119)

M_P

B (

158)

M_H

B (

153)

M_F

X (

156)

W_B

B (

117)

W_F

X (

115)

M_V

T (

153)

W_V

T (

119)

Mar

king

sco

re

Figure 9: Distribution of the overall marking scores per artisticgymnastic apparatus using an individual formula per apparatus.The acronyms are defined in Table III, and the numbers betweenbrackets are the number of judges per apparatus in the dataset.

B. Other gymnastic disciplines

We use the same approach for the other gymnastics disci-plines. Figures 10, 11 and 12 respectively show the weightedleast-squares regressions for rhythmic gymnastics, acrobaticgymnastics and aerobic gymnastics. We do not discuss theresults at the apparatus level, although we found notabledifferences: group routines in rhythmic gymnastics are moredifficult to judge than individual ones, and groups in acrobaticgymnastics are more difficult to judge than pairs. We alsoanalyzed the artistry judges in acrobatic and aerobic gymnastics,and were surprised to observe that the heteroscedasticity oftheir judging error was almost the same as for execution judges.

Trampoline, shown in Figure 13, was the most puzzlingdiscipline to tackle. The behavior on the left side of the plot isdue to gymnasts who aborted their routine before completing alltheir jumps, for instance by losing balance and landing a jumpoutside the center of the trampoline. We solved the problem by

6

Page 7: Judging the Judges: Evaluating the Performance of

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●●●●

●●●●σD(cp) = 1.366 − 0.226 ⋅ e0.183⋅cp

RMSD = 0.1020.0

0.4

0.8

1.2

5 6 7 8 9 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

10

20

30

40

Frequency

Figure 11: Standard deviation of judging error versus controlscore in acrobatic gymnastics.

●●

●●

●●●●

●●●

●●

●●●●

●●●●●

●●●●●

●●

●●

●●●

σD(cp) = 3.953 − 3.1 ⋅ e0.023⋅cp

RMSD = 0.0370.0

0.2

0.4

0.6

6 7 8 9 10

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

5

10

15

20

Frequency

Figure 12: Standard deviation of judging error versus controlscore in aerobic gymnastics.

fitting the curves based on the completed routines. The resultis shown in Figure 14, with aborted routines represented withrings instead of filled circles. Again, the weighted RMSD isexcellent.

When calculating the marking score for trampoline judges,the marks of gymnasts who did not complete their exercise maybe omitted. If they are accounted for, the estimator generouslyevaluates judges when gymnasts do not complete their routine,which results in a slightly improved overall marking score.

The behavior observed in trampoline appears in other sportswith aborted routines or low scores [15] and can be modeledwith concave parabola. This, however, decreases the accuracy ofthe regression for the best performances, which is undesirable.

Trampoline exhibits the largest differences between appara-tus: tumbling is much more difficult to judge than individualtrampoline, which in turn is much more difficult to judgethan double mini-trampoline. The boxplots per trampolineapparatus in Figure 15 clearly illustrate this (the acronymsare defined in Table IV). We thus use a different regression

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●

0.0

0.1

0.2

0.3

2.5 5.0 7.5 10.0

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

25

50

75

100

Frequency

Figure 13: Standard deviation of judging error versus controlscore in trampoline.

●●

●●●

●●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

σD(cp) = 0.225 − 6e−07 ⋅ e1.301⋅cp

RMSD = 0.0110.0

0.1

0.2

0.3

2.5 5.0 7.5 10.0

Control score

Sta

ndar

d de

viat

ion

of ju

dgin

g er

ror

25

50

75

100

Frequency

Routine●

complete

incomplete

Figure 14: Standard deviation of judging error versus controlscore in trampoline. The rings indicate aborted routines. Datafrom synchronized trampoline is removed.

equation per apparatus. Finally, note that Figure 14 excludesdata from synchronized trampoline because its judging panelsare partitioned in two halves, each monitoring a differentgymnast. The subpanels (two judges each) are too small toderive accurate control scores.

V. OUTLIER DETECTION

We can use the marking score to signal judging marks thatare unlikely high or low, with an increased emphasis on outliersfrom the same nationality. Figure 16, like Figure 1, shows thejudging errors for artistic gymnastics judges. Differences ofmore than two standard deviations (2 · σd(cp)) away fromthe control score are marked in red7. The problem with thisapproach is that a bad judge has a lot of outliers, and a greatjudge none. This is not what the FIG wants, because an erraticjudge can be unbiased and a precise judge can be dishonest.

7We use a different equation for σd(cp) per apparatus.

7

Page 8: Judging the Judges: Evaluating the Performance of

0.0

0.5

1.0

1.5

TUM (11) IND (20) DMT (11)

Mar

king

sco

re

Figure 15: Distribution of the overall marking scores pertrampoline apparatus using one overall formula. The acronymsare defined in Table IV, and the numbers between brackets arethe number of judges per apparatus in the dataset.

Instead of using the same standard deviation for all thejudges, we scale the standard deviation by the overall markingscore of each judge, and flag the judging scores that satisfy

|ep,j | > max(2 · σd(cp) ·Mj , 0.1). (4)

We use max(·, 0.1) to ensure that a difference of 0.1 fromthe control score is never an outlier. The results are shown inFigure 17. Eq. (4) flags ≈ 5% of the marks, which is slightlymore than what would be expected for a normal distribution.The advantage of the chosen approach is that it compares eachjudge to herself/himself, that is, it is more stringent for precisejudges than for erratic judges. The disadvantage of the chosenapproach is that one might think that a judge without outliers isgood, which is false. The marking score and outlier detectionwork in tandem: a judge with a bad marking score is erratic,thus bad no matter how many outliers it has.

It is important to note that we cannot infer consciousbias, chicanery or cheating from an outlier mark. A flaggedevaluation can be a bad but honest mistake, caused by externalfactors, or even indicate that a judge is out of consensuswith the other judges who might be wrong at the sametime. Nevertheless this information is useful for the FIG:performances with large discrepancies among panel judgessystematically lead to careful video reviews post-competition.In egregious but very rare circumstances they may even resultin sanctions by the FIG Disciplinary Commission. We presenta comprehensive analysis of national bias in gymnastics in thesecond article of this series [16].

VI. RANKING SCORE

The ranking of the gymnasts is determined by their scores,which are themselves aggregated from the marks given by thejudges. The old iteration of JEP used a rudimentary rankingscore to evaluate to what extent judges ranked the best athletesin the right order. In a vacuum this makes sense: the FIG wants

● ●

−2

−1

0

1

2

2 4 6 8 10

Control score

Judg

ing

erro

r

400

800

1200

1600

Frequency

Figure 16: Distribution of the judging errors in artistic gym-nastics. Dots in red are more than two standard deviations(2 · σd(ca)) away from the control score. To improve thevisibility, we aggregate the points on a 0.1 × 0.1 grid andshift the outliers (red dots) by 0.05 on both axes.

−2

−1

0

1

2

2 4 6 8 10

Control score

Judg

ing

erro

r

400

800

1200

1600

Frequency

Figure 17: Distribution of the judging errors in artistic gym-nastics. Dots in red are more than 2 · σd(ca) ·Mj away fromthe control score. To improve the visibility, we aggregate thepoints on a 0.1× 0.1 grid and shift the outliers (red dots) by0.05 on both axes.

to select the most deserving gymnasts for the finals, and awardthe medals in the correct order. In this section we show thatproviding an objective assessment of the judges based on theorder in which they rank the best athletes is problematic, andwe recommended that the FIG stops using this approach.

Definition 6.1: Let G = {g1, g2, . . . , gn} be a set of ngymnasts. A ranking on G is a sequence r = a1a2a3 . . . an,ai 6= aj ∀i, j ∈ {1, . . . , n} of all the elements of G that definesa weak order on G. Alternatively, a ranking can be noted asr = (rg1 , rg2 , rg3 , . . . ), where rg1 is the rank of Gymnast g1,rg2 is the rank of Gymnast g2, and so on.

The mathematical comparison of rankings is closely relatedto the analysis of voting systems and has a long and richhistory dating back to the work of Ramon Llull in the 13th

8

Page 9: Judging the Judges: Evaluating the Performance of

Parameter set wi δi Dij

1 1 1 12 1 1 |ci − cj |3 1 1

i|ci − cj |

Table V: Parameters of the ranking scores for our two seriesof simulations.

century. Two popular metrics on the set of weak orders areKendall’s τ distance [17] and Spearman’s footrule [31], bothof which are within a constant fraction of each other [9]. Inrecent years, Kumar and Vassilvitskii [18] generalized thesetwo metrics by taking into account element weights, positionweights, and element similarities. Their motivation was to findthe ranking minimizing the distance to a set of search resultsfrom different search engines.

Definition 6.2: Let r be a ranking of n competitors. Letw = (w1, . . . , wn) be a vector of element weights. Let δ =(δ1, . . . , δn) be a vector of position swap costs where δ1 , 1and δi is the cost of swapping elements at positions i− 1 and

i for i ∈ {2, 3, . . . , n}. Let pi =i∑

j=1

δj for i ∈ {1, 2, . . . , n}.

We define the mean cost of interchanging positions i and riby p(i) =

pi−prii−ri . Finally, let D : {1, . . . , n} × {1, . . . , n} be

a non-empty metric and interpret D(i, j) = Dij as the costof swapping elements i and j. The generalized Kendall’s τdistance [18] is

K ′∗ = K ′w,δ,D(r) =∑s>t

wswtpsptDst[rs < rt]. (5)

Note that K′∗ is the distance between r and the identity

ranking id = (1, 2, 3, . . . ). To calculate the distance betweentwo rankings r1 and r2, we calculate K ′(r1, r2) = K ′w,δ,D(r1◦(r2)−1), where (r2)−1 is the right inverse of r2.

These generalizations are natural for evaluating gymnasticsjudges: swapping the gold and silver medalists should beevaluated more harshly than inverting the ninth and tenth bestgymnasts, but swapping the gold and silver medalists whentheir marks are 9.7 and 9.6 should be evaluated more lenientlythan if their marks are 9.7 and 8.7.

To test the relevance of ranking scores as a measurementof judging accuracy, we ran several simulations to comparethem to our marking score. As an example for this article, weuse the men’s floor exercise finals at the 2016 Rio OlympicGames. We first calculate the control scores c1, c2, . . . , c8of the eight finalists from the marks given by the sevenexecution judges (five panel judges and two reference judges).We then simulate the performance of 1000 average judgesj ∈ {1, 2, ..., 1000} by randomly creating, for each of them,eight marks s1,j , s2,j , . . . , s8,j for the eight finalists using anormal distribution with mean cp and standard deviation σd(cp)for p ∈ {1, 2, . . . , 8}. We then calculate, for each judge, themarking score as well as three ranking scores based on Eq. (5)with the three different sets of parameters from Table V.

Figures 18, 19 and 20 show the ranking score with respect tothe marking score of the 1000 judges for the three parameter

0

3

6

9

0.5 1.0 1.5

Marking score

Ran

king

sco

re

5

10

15

20

25

Frequency

Figure 18: Ranking score vs marking score for 1000 syntheticaverage judges and the first set of ranking score parametersfrom Table V. We aggregate the points on the x-axis to improvevisibility.

●●

●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●

●●●

●●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●

●●●●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

0

1

2

0.5 1.0 1.5

Marking score

Ran

king

sco

re

5

10

15

20

Frequency

Figure 19: Ranking score vs marking score for 1000 syntheticaverage judges and the second set of ranking score parametersfrom Table V. We aggregate the points to improve visibility.

sets. The figures illustrate that the correlation between theranking score and the marking score varies widely dependingon the chosen parameters.

The parameters used in Figure 18 are those of the originalversion of Kendall’s τ distance [17]. This simply counts thenumber of bubble sort swaps required to transform one rankinginto the other; swapping the first and second gymnasts separatedby 0.1 point is equivalent to swapping the seventh and eighthgymnasts separated by 1.0 point. In Figure 19, the elementswap costs vary (Dij = |ci − cj |). This decreases the penaltyof swaps as the marks get closer to each other; in particular,swapping two gymnasts with the same control score ci = cjincurs no penalty. This increases the correlation between themarking score and the ranking score, and both, to some extent,measure the same thing. In Figure 20, we also vary the positionswap costs (δi = 1

i ). This increases the importance of havingthe correct order as we move towards the gold medalist. The

9

Page 10: Judging the Judges: Evaluating the Performance of

●●

●●

●●●

●●●●

●●

●●●●●

●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●

●●●

●●

●●●●●●

●●

●●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

0.0

0.1

0.2

0.3

0.5 1.0 1.5

Marking score

Ran

king

sco

re

5

10

15

Frequency

Figure 20: Ranking score vs marking score for 1000 syntheticaverage judges and the third set of ranking score parametersfrom Table V. We aggregate the points to improve visibility.

correlation between the marking score and the ranking scoredecreases, thus we penalize good judges that unluckily makemistakes at the wrong place, and reward erratic judges thatsomehow get the podium in the correct order.

It is unclear how to parametrize the ranking score; it is eitherredundant with the marking score, or too uncorrelated to beof any practical value. The marking score already achievesour objectives. It is based on the theoretical performances ofthe gymnasts over hundreds of performances for each judgeand reflects bias and cheating, as these involve changing themarks up or down for some of the performances. Furthermore,the FIG is adamant that a theoretical judge who ranks all thegymnasts in the correct order but is either always too generousor too strict is not a good judge because he/she does not applythe Codes of Points properly. From these observations, the FIGstopped using ranking scores to monitor the accuracy of itsjudges.

VII. OBSERVATIONS, DISCOVERIES ANDRECOMMENDATIONS

During the course of this work we made interesting andsometimes surprising observations and discoveries that led torecommendations to the FIG. We summarize our observationsabout reference judges in Section VII-A and judging genderdiscrepancies in Section VII-B.

A. Reference judges

In addition the regular panel of execution judges, all thegymnastic disciplines except trampoline also have so calledreference judges. In artistic and rhythmic gymnastics, thereare two reference judges, and the aggregation process is asfollows8. The execution panel score is the trimmed mean ofthe middle three of out five execution panel judges, and thereference score is the arithmetic mean of the two reference

8Acrobatic and aerobic gymnastics have a similar process for execution andartistry judges.

judges’ marks. If the gap between the execution panel score andthe reference score exceeds a predefined tolerance threshold,and if the difference between the marks of both referencejudges is below a second threshold, then the final executionscore of the gymnast is the mean of the execution panel andreference scores. This makes reference judges dangerouslypowerful.

●●

●●

●●

● ●●

●●

0.0

0.5

1.0

1.5

2.0

Panel (1102) Reference (406)

Mar

king

sco

reFigure 21: Distribution of marking scores for Artistic Gymnas-tics execution panel and reference judges.

At each competition, execution judges are randomly selectedfrom a set of accredited judges submitted by the nationalfederations. In contrast, reference judges are hand-picked by theFIG, and the additional power granted to them is based on theassumption that execution judges are sometimes incompetentor biased. To test this assumption, we compared the markingscores of the execution panel and reference judges. The resultsfor artistic gymnastics are shown in Figure 219. Although thisis obvious by inspection, a two-sided Welch’s t-test returneda p-value of 0.18 and we could not reject the null-hypothesisthat both means are equal.

We ran similar tests for the other gymnastics disciplines,and in all instances reference judges are either statisticallyindistinguishable from the execution panel judges, or worse.Having additional judges selected by the FIG is an excellentidea because it increases the size of the panels, thus makingthem more robust. However, we strongly recommended thatthe FIG does not grant more power to reference judges. Theyare not better in aggregate, and the small size of the referencepanels further increases the likelihood that the errors they makehave greater consequences. The FIG Technical Coordinator hasrecently proposed the adoption of our recommendation.

B. Gender discrepancies: women are more accurate judgesthan men

In artistic gymnastics, men apparatus are almost exclusivelyevaluated by men judges and women apparatus are almost

9In Figure 21, judges have at least one marking score per apparatus forwhich they evaluated gymnasts. A judge has two marking scores on a singleapparatus when appearing on the regular execution panel and on the referencepanel for different events.

10

Page 11: Judging the Judges: Evaluating the Performance of

●●●

●●

●●

●●

0.0

0.5

1.0

1.5

2.0

Men (922) Women (470)

Mar

king

sco

re

Figure 22: Distribution of marking scores per gender in artisticgymnastics.

exclusively evaluated by women judges. Figure 8, besidesshowing the differences between apparatus, also shows thatthe marking scores for women apparatus are lower than thoseof men apparatus. Figure 22 formalizes this observation bydirectly comparing the marking scores of men and womenjudges in artistic gymnastics10.

The average woman evaluation is ≈ 15% better than theaverage man evaluation. More formally, we ran a one-sidedWelch’s t-test with the null-hypothesis that the mean of themarking scores of men is smaller than or equal to the meanmarking score of women. We obtained a p-value of 10−15,leading to the rejection of the null-hypothesis.

A first hypothesis that can explain this difference is thatin artistic gymnastics, men routines include ten elements,whereas women routines include eight elements. Furthermore,the formation and accreditation process is different for men andwomen judges. Men, who must judge six apparatus, receiveless training than women, who must only judge four. Some menjudges also have a (maybe unjustified) reputation of laissez-faire, which contrasts with the precision required from womenjudges.

To obtain more insight, we compared women and men judgesin trampoline, which has mixed judging panels as well as thesame accreditation process and apparatus per gender. In otherwords, men and women judges in trampoline receive the sametraining and execute the same judging tasks. The results areshown in Figure 23. The difference between gender observed inartistic gymnastics is less pronounced but remains in trampoline:women judge more accurately than men. We suspect that animportant contributor of this judging gender discrepancy ingymnastics is the larger pool of women practicing the sport,which increases the likelihood of having more good womenjudges at the top of the pyramid since nearly all judges areformer gymnasts from different levels. As an illustration, a2007 Survey from USA Gymnastics reported four times more

10In Figure 22, judges have one marking score per apparatus for which theyevaluated gymnasts.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Men (19) Women (20)

Mar

king

sco

re

Figure 23: Distribution of marking scores per gender intrampoline.

women gymnasts than men gymnasts in the USA [14]. A 2004report from the ministère de la Jeunesse, des Sports et de la VieAssociative reported a similar ratio in France [32]. Accurateinformation on participation per gender is difficult to come by,but fragmentary results indicate a similar participation genderimbalance in trampoline [29].

On a different note, we did not observe any mixed-genderbias in trampoline, i.e., judges are not biased in favor of same-gender athletes. This in opposition to other sports such ashandball where gender bias by referees led to transgressivebehaviors [30].

In light of our gender analysis, we recommended that the FIGand its technical committees thoroughly review their processesto select, train and evaluate men judges in artistic gymnasticsand trampoline. The marking score we developed providesvaluable help for this task.

VIII. CONCLUSIONS AND LIMITATIONS

We put the evaluation of international gymnastics judges ona strong mathematical footing using robust yet simple tools.This has led to a better assessment of current judges, andwill improve judging in the future. It is clear that there aresignificant differences between the best and the worst judges;this in itself is not surprising, but we can now quantify thismuch more precisely than in the past.

Our main contribution is a marking score that evaluates theaccuracy of the marks given by judges. The marking scorecan be used across disciplines, apparatus and competitions. Itscalculation is based on the intrinsic judging error variabilityestimated from prior data. Since athletes improve, and sinceCodes of Points are revised every four years, this intrinsicvariability can and should be calibrated at the beginning ofevery Olympic cycle with data from the previous cycle. Wecalibrated our model using 2013–2016 data, and it should berecalibrated after the 2020 Tokyo Summer Olympics.

The FIG can use the marking score to assign the best judgesto the most important competitions. The marking score is alsothe central piece of our outlier detection technique highlighting

11

Page 12: Judging the Judges: Evaluating the Performance of

evaluations far above or below what is expected from eachjudge. The marking score and outlier detection work in tandem:the more accurate a judge is in the long-term, the harder itis for that judge to cheat without being caught due to a lowoutlier threshold detection.

The FIG classifies international gymnastics judges in fourcategories: Category 1, 2, 3 and 4. Only judges with aCategory 1 brevet can be assigned to major internationalcompetitions. The classification is based on theoretical andpractical examinations, with increasingly stringent thresholdsfor the higher categories. As an example, in men’s artisticgymnastics [11] the theoretical examination for the executioncomponent consists in the evaluation of 30 routines, 5 perapparatus. Our statistical engine is much more precise thanthe FIG examinations because it tracks judges longitudinallyin real conditions over thousands of evaluations. Our datasetis dominated by Category 1 judges, and even at this highestlevel it shows significant differences among judges.

A. Limitations of our approach: relative evaluations andcontrol scores

The first limitation of our approach is that judges arecompared with each other and not based on their objectiveperformance. An apparatus with only outstanding judges willtrivially have half of them with a marking score below themedian, and the same is true of an apparatus with onlyatrocious judges. From discussions with the FIG, no apparatusor discipline has the luxury of having only outstanding judges.We therefore proposed qualitative thresholds based on the factthat most judges are good, and a reward-based approach forthe very best ones.

The second limitation of our approach is its dependenceon accurate control scores. Even though the Codes of Pointsare very objective in theory, in practice we must work withapproximations of the true performance level, which remainsunknown. This has implications for evaluating judges liveduring competitions and for training our model.

During competitions, quick feedback is necessary, and weapproximate the control score with the median judging mark.Relying on the median for a single performance or a small eventsuch as an Olympic final can be misleading. A high markingscore for a specific performance is not necessarily an indicatorof a judging error but can also mean that the judge is accuratebut out of consensus with the other inaccurate judges. The FIGtypically relies on observers like outside panels and superiorjuries to obtain quick feedback during competitions. We donot report detailed results here, but our analysis shows thatlike for reference judges, these observers are in the aggregateequal or worse than regular panel judges, and giving themadditional power is dangerous. Discrepancies between thisoutside panel and the regular panel should be viewed withcircumspection. The best the FIG can do in this circumstanceis to add these outside marks to the regular panel to increaseits robustness until a more accurate control score is availablepost-competition.

The Technical Committee (TC) of each discipline calculatescontrol scores post-competition using video reviews. Each TCuses a different number of members, ranging from two toseven, to evaluate each performance. Furthermore, each TCuses a different aggregation technique: sometimes membersverbally agree on a score and other times they take the average.Even with video review, the FIG cannot guarantee the accuracyand unbiasedness of the TC members: some of them might befriends with judges they are evaluating and know what marksthey initially gave. We therefore suggested clear guidelines forthe calculation of the control scores post-competition to makethem as robust as possible in the future. This is paramount toguarantee the accuracy of our approach on a per routine andper competition basis.

When we trained our model using 2013–2016 data, the FIGdid not have control scores for every performance, and couldnot tell us under what conditions the available control scoreshad been derived. For this reason, we trained our model usingthe median of all the marks at our disposal. Considering thesize of our dataset, this provides an excellent approximationof the intrinsic judging error variability. However, like for liveevaluations during competitions, retrospective judge evaluationsduring the 2013–2016 Olympic Cycle must be interpretedcautiously. While a longitudinal evaluation provides a veryaccurate view of the judges’s performance, a bad evaluation fora specific event might indicate that the judge was accurate butout of consensus with other inaccurate judges. More precisecontrol scores obtained by video review must once again beprovided to settle the matter.

ACKNOWLEDGMENTS

This work is the result of fruitful interactions and discussionswith the other project partners. We would like to thankNicolas Buompane, Steve Butcher, Les Fairbrother, AndréGueisbuhler, Sylvie Martinet and Rui Vinagre from the FIG,Benoit Cosandier, Jose Morato, Christophe Pittet, PascalRossier and Fabien Voumard from Longines, and Pascal Felber,Christopher Klahn, Rolf Klappert and Claudia Nash from theUniversité de Neuchâtel. This work was partly funded byLongines. A preliminary version of this work was presentedat the 2017 MIT Sloan Sports Analytics Conference.

REFERENCES

[1] C. J. Ansorge and J. K. Scheer, “International biasdetected in judging gymnastic competition at the 1984Olympic Games”, Research Quarterly for Exercise andSport, vol. 59, no. 2, pp. 103–107, 1988.

[2] A. Atikovic, S. Kalinski, S. Bijelic, and N. AvdibašicVukadinovic, “Analysys results judging world champi-onships in men’s artistic gymnastics in the London 2009year”, Sportlogia, vol. 7, Dec. 2011.

[3] M. Bar-Eli, H. Plessner, and M. Raab, Judgement,decision making and success in sport. Wiley-Blackwell,2011.

12

Page 13: Judging the Judges: Evaluating the Performance of

[4] F. Boen, K. van Hoye, Y. V. Auweele, J. Feys, andT. Smits, “Open feedback in gymnastic judging causesconformity bias based on informational influencing”,Journal of Sports Sciences, vol. 26, no. 6, pp. 621–628,2008.

[5] M. Bucar Pajek, I. Cuk, J. Pajek, M. Kovac, and B.Leskošek, “Is the quality of judging in women artisticgymnastics equivalent at major competitions of differentlevels?”, Journal of human kinetics, vol. 37, no. 1,pp. 173–181, 2013.

[6] M. Bucar, I. Cuk, J. Pajek, I. Karacsony, and B.Leskošek, “Reliability and validity of judging inwomen’s artistic gymnastics at University Games 2009”,European Journal of Sport Science, vol. 12, no. 3,pp. 207–2–15, 2012.

[7] B. Campbell and J. W. Galbraith, “Nonparametric tests ofthe unbiasedness of Olympic figure-skating judgments”,The Statistician, pp. 521–526, 1996.

[8] L. Damisch, T. Mussweiler, and H. Plessner, “Olympicmedals as fruits of comparison? Assimilation and con-trast in sequential performance judgments”, Journalof Experimental Psychology: Applied, vol. 12, no. 3,pp. 166–178, 2006.

[9] P. Diaconis and R. L. Graham, “Spearman’s footrule asa measure of disarray”, Journal of the Royal StatisticalSociety, vol. 39, no. 2, pp. 262–268, 1977.

[10] J. W. Emerson, M. Seltzer, and D. Lin, “Assessingjudging bias: An example from the 2000 OlympicGames”, The American Statistician, vol. 63, no. 2,pp. 124–131, 2009.

[11] Fédération Internationale de Gymnastique (FIG), 2017-2020 FIG judges’ rules. Specific rules for men’s artisticgymnastics. [Online]. Available: https://www.gymnastics.sport/site/rules/rules.php#2 (visited on 08/01/2019).

[12] L. C. Findlay and D. M. Ste-Marie, “A reputation biasin figure skating judging”, Journal of Sport & ExercisePsychology, vol. 26, pp. 154–166, 2004.

[13] K. Flessas, D. Mylonas, G. Panagiotaropoulou, D.Tsopani, A. Korda, C. Siettos, A. D. Cagno, I. Ev-dokimidis, and N. Smyrnis, “Judging the judges’ perfor-mance in rhythmic gymnastics”, Medicine & Science inSports & Exercise, vol. 47, no. 3, pp. 640–648, 2015.

[14] Gymnastics Member Club, Diversity study. USA Gym-nastics member club online newsletter, https://usagym.org / pages / memclub / news / winter07 / diversity . pdf,version Winter 2008, 2008. (visited on 08/01/2019).

[15] S. Heiniger and H. Mercier, “Judging the judges: Ageneral framework for evaluating the performance ofinternational sports judges”, ArXiv e-prints, Jul. 2018.arXiv: 1807.10055 [stat.AP]. [Online]. Available:https://arxiv.org/abs/1807.10055.

[16] ——, “National bias of international gymnastics judgesduring the 2013-2016 olympic cycle”, ArXiv e-prints,Jul. 2018. arXiv: 1807.10033 [stat.AP]. [Online].Available: https://arxiv.org/abs/1807.10033.

[17] M. G. Kendall, “A new measure of rank correlation”,Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938.

[18] R. Kumar and S. Vassilvitskii, “Generalized distancesbetween rankings”, in Proceedings of the 19th Interna-tional Conference on World Wide Web, ser. WWW’10,Raleigh, North Carolina, USA, 2010, pp. 571–580, ISBN:978-1-60558-799-8.

[19] D. M. Landers, “A review of research on gymnastic judg-ing”, Journal of Health, Physical Education, Recreation,vol. 41, no. 7, pp. 85–88, 1970.

[20] T. D. Myers, N. J. Balmer, A. M. Nevill, and Y. Al-Nakeeb, “Evidence of nationalistic bias in MuayThai”,Journal of Sports Science & Medicine, vol. 5, no. CSSI,pp. 21–27, 2006.

[21] A. Pizzera, “Gymnastic judges benefit from their ownmotor experience as gymnasts”, Research Quarterly forExercise and Sport, vol. 83, no. 4, pp. 603–607, 2012.

[22] A. Pizzera, C. Möller, and H. Plessner, “Gaze behaviorof gymnastics judges: Where do experienced judges andgymnasts look while judging?”, Research Quarterly forExercise and Sport, pp. 1–8, 2018.

[23] H. Plessner, “Expectation biases in gymnastics judging”,Journal of Sport and Exercise Psychology, vol. 21,pp. 131–144, 1999.

[24] H. Plessner and E. Schallies, “Judging the cross onrings: A matter of achieving shape constancy”, AppliedCognitive Psychology, vol. 19, pp. 1145–1156, 2005.

[25] D. G. Pope, J. Price, and J. Wolfers, “Awareness reducesracial bias”, Management Science, 2018. DOI: 10.1287/mnsc.2017.2901. eprint: https://doi.org/10.1287/mnsc.2017.2901. [Online]. Available: https://doi.org/10.1287/mnsc.2017.2901.

[26] R. Popovic, “International bias detected in judging rhyth-mic gymnastics competition at Sydney-2000 OlympicGames”, Facta universitatis-series: Physical Educationand Sport, vol. 1, no. 7, pp. 1–13, 2000.

[27] J. Price and J. Wolfers, “Racial discrimination amongNBA referees”, The Quarterly Journal of Economics,pp. 1859–1887, 2010.

[28] A. Sandberg, “Competing identities: A field studyof in-group bias among professional evaluators”, TheEconomic Journal, vol. 128, no. 613, pp. 2131–2159,2018. DOI: 10 . 1111 / ecoj . 12513. eprint: https : / /onlinelibrary.wiley.com/doi/pdf/10.1111/ecoj.12513.[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/ecoj.12513.

[29] M.-R. G. Silva, R. Santos-Rocha, P. Barata, and F.Saavedra, “Gender inequalities in Portuguese gymnastsbetween 2012 and 2016”, Science of Gymnastics Journal,vol. 9, no. 2, pp. 191–200, 2017.

[30] N. Souchon, G. Coulomb-Cabagno, A. Traclet, and O.Rascle, “Referees’ decision making in handball andtransgressive behaviors: Influence of stereotypes aboutgender of players?”, Sex Roles, vol. 51, no. 7/8, pp. 445–453, 2004.

13

Page 14: Judging the Judges: Evaluating the Performance of

[31] C. Spearman, “The proof and measurement of associ-ation between two things”, The American Journal ofPsychology, vol. 15, no. 1, pp. 72–101, 1904.

[32] STAT Info, Bulletin de la Mission statistique du ministèrede la Jeunesse, des Sports et de la Vie Associative, http:/ /www.sports .gouv. fr / IMG/archives /pdf /statinfo04-07 . pdf, version November 2004, 2004. (visited on08/01/2019).

[33] D. M. Ste-Marie, “Expert–novice differences in gym-nastic judging: An information-processing perspective”,Applied Cognitive Psychology, vol. 13, no. 3, pp. 269–281, 1999.

[34] ——, “Expertise in women’s gymnastic judging: Anobservational approach”, Perceptual and Motor Skills,vol. 90, pp. 543–546, 2000.

[35] E. Zitzewitz, “Nationalism in winter sports judging andits lessons for organizational decision making”, Journalof Economics & Management Strategy, vol. 15, no. 1,pp. 67–99, 2006.

[36] L. F. Zwarg, “Judging and evaluation of competitiveapparatus or gymnastic exercises”, The Journal of Health

and Physical Education, vol. 6, no. 1, pp. 23–49, 1935.

14