master's thesis analyzing usa ultimate algorithm
DESCRIPTION
Michael Silger's master's thesis approaches ranking system in USA Ultimate with new ideas about a ranking algorithm.TRANSCRIPT
An Examination of Pairwise Comparisons of College Open Ultimate Frisbee Teams
Michael Silger
11/28/2012
Under the direction of Dr. Christopher Wikle Department of Statistics University of Missouri
146 Middlebush Hall, Columbia, MO 65211, United States
2
Contents 2 Abstract 3 Acknowledgements 4 1 Introduction 5
1.1 The importance of an accurate ranking system 5 1.2 The current ranking method 6 1.3 Scrutiny of the current ranking method 7 1.4 An alternative approach to ranking frisbee teams 10
2 Data 11
3 Methodology 12
3.1 Explanation of Elo Algorithm 12 3.2 Changing the Update Parameter to a Function 14 3.3 Explanation of Assumptions for Proposed Algorithm 16
4 Results 18 5 Analysis 19 6 Discussion 21 References 24
3
Abstract The merit of pairwise comparisons in sports is a subject of much debate today,
particularly in the ultimate frisbee community. The goal of this project is to study the
current methods employed by USA Ultimate (USAU), the governing body of college
ultimate frisbee, to see if a more robust and efficient approach to pairwise comparisons
can be discerned. In the search for a more effective ranking scheme, I applied an adapted
approach to Arpad Elo’s chess ranking algorithm, which yielded results similar to
USAU’s official rankings with a Spearman rank correlation of 0.974. My method
provides a more statistically sound approach for ranking teams based on conventional
principles established in the sport of ultimate frisbee.
4
Acknowledgements I would first and foremost like to thank Dr. Christopher Wikle for assisting me on my
endeavor to explore ranking algorithms. He has been pivotal to the success of this project
as well as several statistical interests in my undergraduate and graduate career at the
University of Missouri. I would like to thank Dr. Larry Ries and Dr. Lori Thombs for
giving their time to serve on my committee. I would also like to thank Adam Gold for
assisting me in data extraction, as well as Chelsea Tossing for relentlessly editing my
paper. Lastly, I would like to thank the faculty members who have played a part in my
education and all of my friends who have helped and supported me through this project.
5
1. Introduction
With 450 teams participating in open division conference qualifiers last year, the
increasing popularity of ultimate frisbee has also increased the need for a reliable ranking
system. Efforts to rank ultimate frisbee, however, face problems compounded by both the
high dimensionality of the number of teams and the nonstandard structure of the season.
A regular season is not comprised of playing a game or two per week against conference
opponents, but rather an entire tournament in a weekend against competition from across
the country. Once the regular season ends, the national qualifying system begins. The
steps to qualifying for nationals involve first competing at a conference tournament
between local opponents and then advancing to a regional tournament that determines
national attendance.
1.1 The importance of an accurate ranking system
The regular season encompasses all USAU sanctioned tournaments up until a deadline,
usually at the beginning of April. The goal of USAU is to be able to use the sanctioned
tournament results to accurately rank teams from across the nation. In order to be
considered in the ranking system, a team must compete in 10 games prior to sectionals
and fill out the necessary paperwork to ensure their eligibility. Prior to the beginning of
qualifying rounds, rankings are used to allocate bids to regionals and nationals. Teams
are awarded these bids based on conference and regional tournament placement.
Specifically, bids to nationals are allocated to match the number of teams from each
region ranking in the top 20 at the end of the regular season. The accuracy of the rankings
6
is paramount to teams ranked from 10-30, as those teams have the greatest potential to
benefit from an extra bid allocated to their region.
1.2 The current ranking method
The current algorithm designed to rank frisbee teams was created by Sholom Simon
(“USA Ultimate College Rankings Algorithm,” 2012). Simply put, each team receives
rating points for every game played and those points are averaged. The algorithm takes
the rating for each of the two competing teams and swaps their rating prior to the contest
to reflect the strength of schedule amongst the two competitors. We define Rn to be the
team’s new rating and Ropp to be the opponent’s current rating, such that
𝑅! = 𝑅!"" +!""!
. (1)
Once a game is finished, a team can gain or lose rating points based on the margin of
victory or loss. From equation 1, x is a factor that takes into account the score of a given
game as defined in equation 2:
𝑥 = max (0.66, 2.5 ∗ !"#$%&'(")*!"##"#$%&'()
!). (2)
There is a cap in place so that once the loser’s score has been doubled, the maximum
points gained/lost are attained.
There is a weighted decay function in place to give recent games precedence when
calculating the new rating, whereas games in the first week of competition receive the
lowest weight. The weighting is doubled for games at regionals and tripled for games at
nationals under the assumption that teams are at full strength for these contests. The
algorithm is run every Tuesday, and the new ratings from the weekend are reinitialized as
7
the week one starting values and run through to the current week. The reiteration is done
20 times with convergence attained around the tenth reiteration (“USA Ultimate College
Rankings Algorithm,” 2012). The reiterative process is expected to account for
underrated or overrated opponents.
While the bids to nationals are determined in early April, tournament play continues and
weekly ratings will still reflect current performances. Rating calculation ends when the
national tournament has concluded, resulting in a final ranking for the season, which is
important when considering tournaments to attend in the upcoming season.
1.3 Scrutiny of the current ranking method
There are a few concerns over USAU’s current methodology for ranking frisbee teams.
The first stems from the algorithm’s approach to reflecting strength of schedule. While
effective, the winning team may lose rating points under two circumstances as a result of
a difference in ratings between the two competitors. An exceedingly high difference
between the competitors’ ratings can result in the winning team losing rating points
regardless of score. Alternately, the winner may lose rating points if the opponent scores
more points than expected.
The second concern is the mercy threshold that initiates once the losing team’s score has
been doubled. If a game is played to 15, winning 15-0 is considered equal to winning 15-
7. Considering the loss of information that occurs for teams with shorter seasons, this is
an oversight; the fewer games a team plays, the more vital it becomes to include every
8
piece of information available in their rating. The mercy threshold may also dilute a
team’s true rating by undervaluing large margins of victory.
A third concern is that the algorithm disregards forfeits, and ratings are computed as if
the game was never played. This allows teams to abuse the system by protecting
themselves if they believe their rating will suffer from a loss. If a team on the cusp of the
top twenty is having an off weekend, they are able to forfeit without any repercussions.
Conversely, teams playing well that are forfeited against lose the chance to gain rating
points. This basic flaw in the rating system can undermine the competitive nature of the
sport by discouraging game play.1
The final concern is the method by which rating points gain is calculated for the margin
of victory. The calculation forms a concave curve that disagrees with the fundamental
nature of ultimate frisbee. A close game should indicate that the teams are similar in
strength, and each additional point scored by the loser becomes more valuable. However,
under the USAU model, the marginal value of the opponent’s points scored on the winner
increases as the number of points scored decreases. Figure 1 demonstrates the concave
curve for the USAU point allocation system. There are a finite number of scores that exist
in a frisbee game and Table 1 captures the marginal value for each score possible. As
seen in Table 1, the less competitive the game, the higher the marginal value of each
point becomes until the mercy threshold takes effect. A more intuitive model would have
a convex function in which the marginal value for each point scored increases as the
1 The third concern discussed here is a specific problem referred to as “gaming the rankings” in the ultimate frisbee community.
9
losing team scores more points. The USAU rating points calculator also has marginal
values that carry a large weight when compared to the surrounding values. There is no
logical reason for the large jump in values found around the mercy threshold seen in
Table 1.
Figure 1: A graph giving the USAU point allocation scheme assuming the winning score is 15.
Table 1: A table showing the marginal difference in the USAU rating points gained. The left column represents the two scores that are being compared.
0.00#
100.00#
200.00#
300.00#
400.00#
500.00#
600.00#
700.00#
0# 2# 4# 6# 8# 10# 12# 14# 16#
Loser&Score/Winner&Score 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1716#15 2515#14 2714#13 2913#12 32 3712#11 36 41 4811#10 40 47 54 6210#9 45 54 63 74 849#8 52 63 76 89 103 1188#7 62 77 93 110 129 116 447#6 75 96 118 143 136 54 0 06#5 96 125 158 162 68 0 0 0 05#4 130 176 196 88 0 0 0 0 0 04#3 194 246 116 0 0 0 0 0 0 0 03#2 322 162 0 0 0 0 0 0 0 0 0 02#1 246 0 0 0 0 0 0 0 0 0 0 0 01#0 0 0 0 0 0 0 0 0 0 0 0 0 0
10
1.4 An alternative approach to ranking frisbee teams
A different approach to ranking frisbee teams should be considered using a model that
minimizes the concerns raised by the USAU algorithm. The model should reflect the
same principles exhibited by participants in ultimate frisbee, such as the variability of a
team’s performance. Variability can be captured using statistical models; one of the first
models developed for pairwise comparisons was created by Arpad Elo for the game of
chess. Elo (1978) is able to measure pairwise comparisons of high dimensionality across
a given time period and calculate a rating of relative strength for each competitor. The
Elo model and its extensions can be seen in more complex models such as Glickman’s
(1993) Glicko rating system. The biggest difference between the two models is that the
Glicko system includes an estimate of reliability in the rating of a contender. I chose to
use the Elo method for its simplicity in implementation and programming.
For the game of chess, Elo (1978) starts off each competitor with a provisional rating
period of about 30 games. Once the provisional period ends, a player is given a rating
reflective of their performance in that extent of time. That rating follows a distribution
where the mean is equal to the rating and the variance is a measure of reliability; Elo
(1965) and McClintock (1977) show that many performances of an individual will be
normally distributed on an appropriate scale. The player rating is updated after a given
time period on a continuous or periodic basis, and will eventually converge to its true
strength rating. An unexpected result could be the consequence of a statistical fluctuation
or an actual change in the player’s ability; thus, new performances are weighted based on
how much importance is given to past performances. The weight is a measure of
11
reliability for a team’s distribution. I use the Elo algorithm as the basis for my rating
algorithm with some minor modifications to the assumptions in the model. The specifics
of the Elo algorithm and my adjustments to its assumptions will be discussed in detail in
the methodology section.
2. Data
The data used in this analysis were collected from www.usaultimate.org in the month of
June 2012 and represent the 2012 USAU open sanctioned tournaments. The data were not
readily accessible and had to be harvested using AutoHotKey, a program that allows the
user to automate keystrokes. Using AutoHotKey, I was able to copy all of the information
on the page for each competitor and paste it into a text file. I then converted the text file
to an excel document and parsed the data to obtain the information necessary to create
two data sets. The first dataset was in correspondence with all of the teams listed under
the RRI webpage (450 teams) while the second dataset included only USAU sanctioned
teams (371 teams). It was necessary to evaluate two separate datasets to obtain a strict
comparison between the proposed algorithm and the USAU algorithm. A comparison is
also done between the USAU sanctioned teams and teams that competed in USAU
sanctioned tournaments to determine if the exclusion of additional teams affects the final
rankings. Similar results are expected when comparing the two datasets. Irrelevant
information was omitted; for instance, several outcomes were listed F-F or F-L, both of
which have no merit when considering the winner of a game. Game outcomes with a
single score reported were not included and are shown as 9-_. A sample of the data set is
included in Figure 2.
12
Figure 2: The column labeled RatePer represents the rating period in which the contest took place; a RatePer corresponding to 1 stands for the first week of competition.
3. Methodology
3.1 Explanation of Elo Algorithm
Elo (1978) defines a proper rating system as one that can effectively rank teams and
provide a measure of the relative strength of competitors, however strength may be
defined. The initial assumption in the model is that each competitor’s rating follows a
normal distribution. There are different approaches proposed by Elo (1978) for instituting
pairwise comparisons. I chose to use his method of continuous rating updates, as I will be
calculating the rating of each competitor on a weekly basis. The continuous rating
method is given by the formula:
𝑅! = 𝑅! + 𝐾(𝑊 −𝑊!), (3)
where 𝑅! is the new rating and mean of the player’s distribution, 𝑅! is the old rating, 𝐾 is
the weighting function and 𝑊 is a binary variable taking the value of 0 for a loss or 1 for
a win. In addition, 𝑊! for a particular team is the expected value of winning a game
against some rated opponent, with rating 𝑅!"", given by:
𝑊! =!"!!/!""
!"!!/!""!!"!!""/!"". (4)
WinningTeam LosingTeam WinnerScore LoserScore RatePer TournamentMississippiState MississippiStateB 11 1 1 Cowbell0Classic02012MississippiState FloridaStateB 11 3 1 Cowbell0Classic02012MississippiState Mississippi 11 8 1 Cowbell0Classic02012MississippiState Auburn 11 6 1 Cowbell0Classic02012MississippiState Mississippi 13 10 1 Cowbell0Classic02012Mississippi FloridaStateB 11 3 1 Cowbell0Classic02012Mississippi MississippiStateB 11 1 1 Cowbell0Classic02012Mississippi Rhodes 11 1 1 Cowbell0Classic02012Mississippi Auburn 10 7 1 Cowbell0Classic02012
13
From (3), based on the game outcome, the new rating is updated by adding or subtracting
points to the old ranking. Because 𝑊 is binary, the winner of the game will always gain
points and the loser of the game will always lose points. This is important because,
regardless of the score of a game, the winning team should not lose rating points. Another
central aspect of the Elo model is that part of the rating update is dependent on the
expected outcome of a game. If a team is rated far higher than its opponent, it is reflected
in the expected value calculation. Therefore, the winner of the contest will gain points,
but a high expected value for winning will result in a low point gain. This accurately
portrays the construct of an ultimate frisbee tournament, as highly rated teams will play
lower rated competition in pool play2. This can also deter teams from playing only low
rated competition to build a ranking that may not accurately represent their strength.
During the season, 𝑅! will move toward the mean of its normal distribution. The variance
for this normal distribution is needed to model the consistency of each team and under
Elo’s methodology reflects the weight of a performance for a time period. In this case, K
is defined as a weight to update the new rating with the previous rating Under the Elo
model, this weight is between 10 and 32, with recent results reflected by greater values.
As a modification to the standard algorithm, I wanted to incorporate the score differential
into the model because it contains a significant amount of information. In an attempt to
include the game score in the algorithm, I looked at the model from a Bayesian
perspective with a normal likelihood and a normal prior. Each team’s rating is assumed to
2 Pool play is a round robin style tournament of a small group of teams, usually four or five; often taking place on the first day of competition in a 2-day tournament.
14
be the result of a normal distribution, so the justification for a normal likelihood is
obvious. The prior distribution quantifies our a priori understanding of the unobservable
quantities of interest (Wikle 2007) and should also be considered normal. The posterior
mean can then be written as:
𝐸 𝑋 𝒚 = 𝜇 + !!!
!!!!!!𝑦 − 𝜇 = 𝜇 + 𝐾(𝑦 − 𝜇). (5)
The prior mean (𝜇) is adjusted toward the sample mean estimate (𝑦) where 𝜏2 is the
variance for the prior, 𝜎! is the variance for the sample estimate and 𝑛 is the number in
the sample. The K function in (5) is a ratio of the variances from the likelihood function
and the prior information. This is very similar to our Elo algorithm and serves as the
motivation for a different update function for K.
The ratio of variances in the context of rating teams is hard to define. The variance is
understood as the under and over -performance of a team, but there is no mathematical
measure to show that a team has over or under -performed aside from differences
between ratings and opinion. Essentially, the ratio of variances weight the reliability of
the new information using a priori knowledge. The score of the game is included in the a
priori knowledge, leading to the decision to incorporate score differential in the
weighting update function.
3.2 Changing the Update Parameter to a Function
When considering the updated weighting function, I first tried to avoid some of the errors
I believe to be present in the USAU algorithm. The primary issue with their updating
scheme is that the curve of the points gained is concave; the value of scoring a point on
15
the eventual winner decreases as the game becomes closer. In order to correct this, I
looked at a function created by Murray (2012):
𝐾 = 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛 + (𝑇𝑜𝑡𝑎𝑙𝑃𝑡 − 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛) (!!!!"##)(!!!!"#$%&'()
, (6)
where, 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛 = Points awarded for winning on universe point3,
𝑇𝑜𝑡𝑎𝑙𝑃𝑡 =Total points possible awarded,
𝑝 =percentage of points awarded,
𝑑𝑖𝑓𝑓 =of the score from the game,
𝑊𝑖𝑛𝑆𝑐𝑜𝑟𝑒 =points the winner scored.
The proposed allocation scheme corrects the concavity of the K weighting function as
seen in Figure 3. Figure 3 also illustrates that there is no cutoff value for beating a team,
instead taking into account all information indicated by the final score of the contest.
Table 2 shows that the marginal point allocation increases as the game becomes closer
and there are not any unexpected outstanding values exhibited. Forfeits are also included
in the model and treated as the maximum amount of points gained for the winner. I
allotted 200 points possible with 50 points awarded for a universe point win and a p of
0.80. My values were chosen on a subjective assessment of point allocation in order to
clearly discern differences in the score. My chosen values have scaled down the point
allocation in comparison to the USAU algorithm to avoid over-inflating the K weight
values in order to try and retain the normality assumption. In order to check the accuracy
of the change when converting the weighting parameter to a function, results were
calculated by setting K equal to 24.
3 Universe point occurs when two teams are tied and the next point will win the game for either team.
16
Figure 3: The proposed allocation’s rating point allocation scheme assuming the winning score is 15
Table 2: A table showing the marginal difference in the proposed rating points gained. The left column represents the two scores that are being compared.
3.3 Explanation of Assumptions for Proposed Algorithm
A continuous rating system updates a new rating after a single game; subsequently, each
new rating is blended into the old rating. Elo (1978) proposes different methods of
blending, but I discuss an alternative approach below. Elo (1978) suggests that long
0.00
50.00
100.00
150.00
200.00
250.00
0 2 4 6 8 10 12 14 16
Loser&Score/Winner&Score 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1716#15 3115#14 3114#13 3113#12 32 2512#11 32 25 2011#10 33 26 20 1610#9 34 26 21 16 139#8 35 27 21 16 13 108#7 36 28 22 17 13 10 87#6 38 29 22 17 13 11 8 76#5 41 30 23 18 14 11 8 7 55#4 45 33 24 18 14 11 9 7 5 44#3 51 36 26 19 15 11 9 7 5 4 33#2 61 41 29 21 16 12 9 7 6 4 3 32#1 83 49 33 23 17 12 9 7 6 4 3 3 21#0 67 39 26 18 13 10 8 6 5 4 3 2 2
17
events should be divided into ratable segments for each application of (3). My
methodology considers the entire weekend tournament to be a segment, treating games
played on Saturday as equivalent to those taking place on Sunday. This assumption has
some weaknesses, but I believe that the merit outweighs the drawbacks. A typical 2-day
tournament seeds each team attending according to their perceived strength before either
pool play or immediate bracket play will commence. The results of pool play are then
used to reseed the teams and put them into bracket play for varying places on Sunday.
It is intuitive that pool play and bracket play are dependent on one another, but I believe
that teams performing to expectations will be playing similarly rated competition for
placement games. If a team over or under -performs, it is reflected in the results of
individual games as opposed to the final placement from the weekend. Therefore, each
game will be treated as independent from the next game.
Another consideration when examining the structure of ratable segments is the order of
the games played and how sequence can affect a team’s rating. For instance, if a team is
overrated and loses to a low rated opponent in the first game, the low rated opponent
reaps the reward of playing the over-ranked opponent first. In an effort to combat rating
points gained as a result of the order of the games, I randomly sample matches at each
tournament without replacement. I believe this will nullify any order-imposed gains a
team may receive. I will sample without replacement 100 times, which I believe is
sufficient. I will then blend the 100 ratings by taking the mean rating for each team which
will yield the new rating (Rn). The integrity of the normality assumption of the rating is
still preserved and any sort of rating points gained from the order of games is foregone.
18
The Elo algorithm requires a provisional rating period of 30 games before a participant is
given a rating. Many teams cannot satisfy this condition, as USAU only requires a team
to participate in ten games to be allowed in the ranking system. Therefore, in lieu of using
a provisional rating period, a reiterative process similar to USAU’s algorithm is used.
Reiteration helps take into account the opponent’s strength at the time of the match. The
process helps to avoid over and under -ranked teams by rerating the previous encounters
with the inclusion of new information provided by the new rating (𝑅!). Once a ratings
period is completed (say R1), that week will be used as the initial values for the first
week. The new initial values will then be computed up through the current ratings period
(say R1new). The rank order of each team, R1 and R1new, will then be compared by the
Spearman rank correlation. If the correlation measure is above .99, the process will
continue on to the next week and then repeat the reiteration process. The Spearman rank
correlation is believed to help reduce the number of reiterations required. If the rank
order of teams becomes stable, then there is no reason to reiterate and unnecessarily
spread the ratings of teams or waste computing time. The Spearman rank correlation is
the most intuitive correlation measure because we are concerned with the ranked order of
the teams and therefore it provides a measure of correlation between iterations. In order
to check the robustness of the reiteration process, Kendell’s tau could be considered.
4. Results
In this section, a sample of only 30 teams is included due to the high dimensionality of
the data set and the belief that these teams have the best chance to attend nationals.
19
Table 3: The column heading involving “Parameter K” references 𝑲 as constant under the original logic of Elo (1978). Alternatively, the “Function K” heading references the weighting point differential function created by Murray (2012). The dataset used for each algorithm is denoted in parentheses. Lastly, the PR column is the final rating calculation produced by each algorithm and dataset.
5. Conclusion
Due to the inherent subjectivity of rating schemes, there is no single best or most efficient
strategy to creating a rating system. The first way to recognize whether a rating scheme is
viable is to evaluate whether or not it agrees with the common belief held by the group
that is being rated. The second evaluation would use the rating scheme as a predictive
tool to see if the model produces results similar to those that occur in the real world. The
second evaluative technique was not the focus of this paper and is a measure that can be
explored at a later time. It should be noted that direct comparison of the PR measure
across datasets and rating schemes is not accurate because of the difference in
dimensionality and the rating update method. Also, when comparing the USAU
Algorithm to “Function K” and “Parameter K” values, only the USAU dataset can be
USAU$Algorithm Parameter$K$(RRI$Dataset) Parameter$K$(USAU$dataset) Function$K$(RRI$Dataset) Function$K$(USAU$Dataset)Rank Team PR Team PR Team PR Team PR Team PR
1 Pittsburgh 1852 Oregon 2528 Oregon 2366 Pittsburgh 3122 Pittsburgh 30012 Oregon 1829 Pittsburgh 2453 Pittsburgh 2290 Wisconsin 2986 Oregon 28643 Wisconsin 1786 Wisconsin 2396 Wisconsin 2227 CarletonCollege 2984 CarletonCollege 28604 CarletonCollege 1769 CarletonCollege 2298 CarletonCollege 2129 Oregon 2984 Wisconsin 28595 Tufts? 1740 Minnesota 2264 Tufts 2105 Tufts 2797 Tufts 26746 Iowa 1692 Tufts 2257 Minnesota 2098 Minnesota 2762 Minnesota 26387 Minnesota 1674 CentralFlorida 2227 CentralFlorida 2077 CentralFlorida 2737 CentralFlorida 26098 Colorado 1655 Texas 2116 Texas 1964 Luther 2651 Luther 25279 CentralFlorida 1648 California 2091 NorthCarolina 1941 California 2601 California 247710 Michigan 1616 NorthCarolina 2091 California 1933 TexasAM 2594 TexasAM 247311 California 1612 Luther 2087 TexasAM 1929 Colorado 2587 Colorado 246312 NorthCarolina 1600 Stanford 2077 Luther 1928 Stanford 2578 Stanford 245313 Luther 1594 Colorado 2075 Stanford 1923 Texas 2553 Texas 243114 Texas 1588 TexasAM 2075 Ohio 1916 Iowa 2550 Iowa 242315 Washington 1578 Ohio 2056 Colorado 1915 NorthCarolina 2544 NorthCarolina 241716 Stanford 1566 Iowa 2037 Iowa 1884 Michigan 2507 Michigan 237817 Whitman 1550 MichiganState 2008 MichiganState 1860 Ohio 2494 Ohio 236718 Connecticut 1542 Michigan 1992 Connecticut 1848 MichiganState 2430 MichiganState 230119 Illinois 1530 Connecticut 1983 Michigan 1845 OhioState 2414 OhioState 228820 MichiganState 1527 Washington 1982 GeorgiaTech 1835 GeorgiaTech 2402 GeorgiaTech 228021 Ohio 1527 GeorgiaTech 1981 NorthCarolinaWilmington 1826 NorthCarolinaWilmington 2400 NorthCarolinaWilmington 227722 Vermont 1521 NorthCarolinaWilmington 1969 SouthCarolina 1824 Washington 2397 Washington 227323 OhioState 1516 SouthCarolina 1967 Washington 1824 Connecticut 2374 Connecticut 224824 TexasA&M 1512 Florida 1964 OhioState 1822 Florida 2364 Florida 224325 NorthCarolinaNWilmington 1505 OhioState 1959 Florida 1810 Illinois 2350 Illinois 222926 Georgia 1499 Illinois 1939 Illinois 1787 SouthCarolina 2339 SouthCarolina 221727 Kansas 1497 Georgia 1906 Georgia 1758 Georgia 2311 Georgia 218428 Dartmouth 1488 Vermont 1894 Vermont 1756 Vermont 2264 Vermont 2143
20
used. To understand how closely related all of the results were; the results can be seen in
Table 3.
Results Compared Correlation USAU Algorithm vs. Function K 0.974 USAU Algorithm vs. Parameter K 0.868 Parameter K vs. Function K (USAU dataset) 0.986 Parameter K vs. Function K (RRI dataset) 0.991
Table 4: The Spearman rank correlation is used to compare the datasets and the different weight update methodologies
The RRI dataset was included to assess whether a difference existed between the final
rankings in the top 30 teams of the competitors in USAU sanctioned tournaments and
USAU sanctioned teams. In the “Parameter K” top 30 results, minor changes in the order
of teams can be observed. In the case of the “Function K” results, only two teams
(Oregon and Wisconsin) differ in rank between the two datasets. This would suggest that
the excluded teams are likely lower level teams that have little impact on the teams vying
for a spot at nationals. Next, the “Parameter K” model was included to provide evidence
that a change in weighting scheme is appropriate by giving baseline results for ranking
teams. From Table 4, a correlation of 0.986 and 0.991 between the “Parameter K” and
“Function K” results is high enough to suggest that changing the weight function to
reflect score is appropriate. When comparing the overall outcomes from the USAU
algorithm with my proposed algorithm of “Function K”, the resulting Spearman rank
correlation stands at 0.974; this high level of correlation suggests that the results are very
similar. In conclusion, based on a subjective assessment of my final results and the high
correlation measure observed, I believe I have developed a sound rating method.
21
6. Discussion
In this section, I will be discussing specific details of the USAU algorithm and my
“Function K” algorithm for the USAU dataset, along with ideas for future work. I believe
I have established a quicker method to rate teams when compared to the USAU
algorithm. The USAU algorithm uses 400 iterations while my algorithm used only 122
iterations. The high correlation threshold allows my program to continue to the next week
if there is insignificant change to the order of the teams. This is different from the USAU
approach because their rating values have the ability to converge to a number while mine
do not. Convergence is attained because winning teams may still lose rating points,
whereas I adjusted the assumption so that winners consistently gain rating points. I
cannot comment on the statistical approach for the USAU algorithm because there is no
accessible formal paper detailing Shalom Simon’s approach.
Burruss (2012) explains that for a team to attain a high rating under the USAU algorithm,
all they should do is win. Simply put, he explains that strength of schedule has little
bearing on a team’s ability to rank in the top twenty, although the team cannot solely play
weak competition as explained in Section 1.3. This idea is conveyed by Table 3 and will
be specifically discussed in the cases of Whitman, Iowa, and Texas A&M. Whitman is a
textbook example of what was described earlier as “gaming the rankings.” They are
within the top 20 in the USAU algorithm, yet outside the top 30 in my algorithm at the
end of the year; I believe this is largely due to the inclusion of forfeits in my model.
Whitman was able to play on par with several of the elite college teams, but at two
tournaments where their rating was in jeopardy, they decided to forfeit. As previously
22
discussed, I included forfeits in my calculations because neglecting them discourages
game play. Iowa and Texas A&M are similar when discussing the effects of winning on a
team’s rating for the USAU algorithm. They earned high ratings by simply winning
games, many by large margins, and had a regular season records of 23-5 and 29-2
respectively. The final ratings of Iowa and Texas A&M fell within my top twenty, and
reflect that my algorithm also values winning as a determinant of higher ratings.
The last part of the discussion will focus on future analyses of this methodology and
other approaches of interest. Ideally, I would establish a less arbitrary means of choosing
the parameters for my proposed function. While it would be possible to estimate these
values, there is still a measure of subjectivity because the true rank of each team is
unknown. Using values between 10 and 30, I could model the K function based on the
volatility of a team and the score of a game. While this would preserve Elo’s normality
assumption, I believe that the results would be similar to my findings, albeit rescaled. I
would like to be able to check the accuracy of my method by predicting game outcomes
at nationals. The only way I can currently accomplish this is through the “Parameter K”
method, as I do not have priors to predict the score of each team. I believe it would be
possible to create such a model, and my current research reflects a good starting point for
this. Next, I would like to do some research in the field of network analysis. I believe that
network analysis would be very promising when discussing the bid allocation process for
regional qualifiers and the national tournament. However, I think it would be difficult to
utilize in rating teams due to the structure of the regular season. Network analysis relies
on using clusters to rank its components, but teams rarely play within their entire
23
conference or region prior to qualifying tournaments. This concern is purely speculative,
and necessitates further research into different methods of pairwise comparisons.
24
References
Burruss, Lou. (2012). Rankings Under the Hood, Skyd Magazine. Retrieved from
http://skydmagazine.com/2012/03/rankings-under-the-hood/
Elo, A.E. (1965). Age Changes in Master Chess Performances, Journal of Gerontology,
20, 289-299.
Elo, A.E. (1978). “The Rating of Chess Players Past and Present.”
McClintock, W. (1977). Statistical Studies of the Elo Rating System, 1974-77. Report to
USCF Policy Board, privately produced.
Murray, T. (2012, September 17). An Interpretation and Critique of the USAU Ranking
Algorithm. Retrieved from http://skydmagazine.com/2012/09/an-interpretation-
and-critique-of-the-usau-ranking-algorithm/
Wikle, C & Berliner, M. (2007). A Bayesian tutorial for data assimilation. PhysicaD,
230, 1-16.
(2012). “USA Ultimate College Rankings.” Retrieved from
http://www.usaultimate.org/competition/college_division/college_season/college_
rankings.aspx