statistical modelling of english premier league position

18
MATH390 Group Report - Statistical Modelling of English Premier League Position. Jack Waudby, Jack O’Reilly, Ben Sichna, Ben Asmah, Sophie Hobley and Henry Sloan November 7, 2014 Abstract In the 2012/13 season the English Premier League recorded a total revenue of $2.9billion, making the UK’s top football league the most lucrative in the world. This project will investigate how to devise a logistic regression model based on three explanatory variables: match attendance, transfer spending, and total wage bill, in order to try and find the probabilities of each premier league team finishing in the top six next season. Ultimately the goal is to predict the results table of the next season. By collecting data from previous seasons and analysing it thoroughly, this model can be created and its effectiveness tested by calculating a prelim- inary premier league table from the 2013/14 season, and comparing it with the actual table. In summary, many people would love to be able to predict the premier league standings, this project explores some of the statistical methods that may make this a possibility. 1

Upload: jack-oreilly

Post on 17-Jan-2017

543 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Modelling of English Premier League Position

MATH390 Group Report - Statistical Modelling of English Premier

League Position.

Jack Waudby, Jack O’Reilly, Ben Sichna, Ben Asmah, Sophie Hobley and Henry Sloan

November 7, 2014

Abstract

In the 2012/13 season the English Premier League recorded a total revenue of $2.9billion, making the UK’stop football league the most lucrative in the world. This project will investigate how to devise a logisticregression model based on three explanatory variables: match attendance, transfer spending, and total wagebill, in order to try and find the probabilities of each premier league team finishing in the top six next season.Ultimately the goal is to predict the results table of the next season. By collecting data from previous seasonsand analysing it thoroughly, this model can be created and its effectiveness tested by calculating a prelim-inary premier league table from the 2013/14 season, and comparing it with the actual table. In summary,many people would love to be able to predict the premier league standings, this project explores some of thestatistical methods that may make this a possibility.

1

Page 2: Statistical Modelling of English Premier League Position

Contents

1 Introduction 3

2 Literature Review 3

3 Methods 43.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Results 54.1 Model for predicting Premier League position . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Competitive Balance: How Competitive is the English Premier League? . . . . . . . . . . . . 9

4.2.1 How is Competitiveness measured? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2.2 Analysis of the Competitive Balance Ratio . . . . . . . . . . . . . . . . . . . . . . . . 94.2.3 Further Points on Competitiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3 Liverpool and Chelsea Case Study: Does transfer spending equal success? . . . . . . . . . . . 104.3.1 Effects of Transfer Spending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3.2 Effects of Champions League involvement . . . . . . . . . . . . . . . . . . . . . . . . . 114.3.3 Analysis using Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.4 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.4 Analysis of the Manchester United Dynasty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4.1 Managerial Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4.2 Ferguson Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4.3 2013/14 Season . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.5 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Conclusion 145.0.1 How the model can be improved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.0.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A Bibliography 16

B Data 17

C R-code 18

2

Page 3: Statistical Modelling of English Premier League Position

1 Introduction

The aim of the project was to investigate the way in which different factors within a premier league team affectoverall performance in the season. Establishing a relationship between, say, wages, and overall performancecould then provide a resource which could be used to predict where in the league different teams wouldfinish. Initially this project involved gathering large amounts of data on past transfer spending, averageattendance and wage of players, which were factors that seemed likely to have a significant impact on ateams performance, and then data on the league positions, which was how team performance was measured.Firstly, data from before the 2013/14 season was gathered, and used to generate a prediction which could becompared to the actual league table for 2013/14 allowing a test of how accurate the prediction was. Whilst itseemed likely that the three factors in our initial model are important factors in team performance, this doesnot mean that there are not other factors to consider, and this would be relevant to explore before the modelcould be used to make predictions about future seasons. As it was an important part of the project that otherinfluences were investigated and considered, minor case studies were completed, investigating parts of theprediction which were inaccurate or seemed counter-intuitive. We put a specific focus on transfer spending,which was shown to be not significant in the initial model and yet there did seem to be some correlation,so a case study comparing Chelsea and Liverpool was done to investigate whether transfer spending didhave any significant impact which was just not demonstrated in the initial model. A second case study wasdone into Manchester United, which investigated the effect of a change in manager, as well as the effect ofa manager who led to significant improvement in a teams performance. This was done due to the fact thedata predicted Manchester United had a very high change of ending up in the top six and then they didn’t,and so as an outlier it was interesting to investigate the cause of this. A final section was devoted to lookingat competitiveness within the league, by looking at how much upwards and downwards movement there waswithin the league tables, and therefore how much past performance alone could be used to predict futureperformance. Primarily throughout the report, logistic regression was used to create explanatory variablesshowing how much impact each of the factors had on league positioning.

2 Literature Review

Football is one of the worlds most popular sports if not the most popular sport. FIFAs flagship event theworld cup broke several viewing records in 2014 and Deloittes Sports Business group (2014) estimate the valueof the top twenty European Football clubs to be in excess of e5.4Billion. Dixon and Coles (1996) discussedthe surprising lack of statistical modelling for Association Football final league positions especially comparedto the USA National Football League. Lee (1997) demonstrated that determining the final league position ofa football team using likelihood inference is NP hard this may explain why some statisticians are discouraged.Football is clearly a game of both skill and chance, therefore analysis would involve intensive calculations ofboth deterministic and stochastic elements as highlighted by Mode and Sleeman (2009). Furthermore, Hill(1973) highlighted the complexities involved in modelling Association football and debated the merits of theopposing schools of inference, Likelihood vs Bayesian.

Both likelihood and Bayesian approaches have produced useful research: Fahrmeir & Tutz (1994) andBrillinger (2009) have studied similar ordinal valued time series data, looking at the German Bundesligaand Chinese Super league respectively. They used advanced extensions to the generalised linear model.Whilst Karlis and Ntzoufras (2009) adopted the Bayesian approach and used data from the English premierleague.

In recent years the popularity and finance generated by the game particularly in the gambling industryhas fuelled greater amounts of research. Although this trend has moved research away from final leaguestandings and towards in game elements which have a more stochastic nature such Titman et al (2014) whodeveloped multivariate counting process to predict goals and bookings as a football match progressed.

Also, a final thank you to Dr Robert Simmons of Lancaster University Economics department for pro-viding the data for wages and transfer fees used throughout this report.

3

Page 4: Statistical Modelling of English Premier League Position

3 Methods

3.1 Data Collection

To collect the data required for this project, data was scraped from websites using the XML package for R.XML is an extensible markup language designed to describe data. The information in an XML documentis wrapped in tags and other software must be used to display or send it. The XML package allows you to“parse” a document and navigate an XML document to find the information required.

3.2 Linear Regression

A linear regression model, is a statistical model for linear relationships between variables given by

Yi = β1xi,1 + β2xi,2 + ...+ βpxi,p + εi for i = 1, ..., n. (1)

As shown in equation (1) a linear regression model has four components.

• Response variables, Yi. These are random variables, which are dependent on their correspondingexplanatory variables xi,1, .., xi,p.

• Explanatory variables, xi,1, .., xi,p. These are non-random and influence the outcome of the responsevariables. They can either be a covariate, (quantitative), or factor, (qualitative), variable. Possiblevalues for the factor are know as levels and represented by indicator variables.

• Regression Coefficients, β1, β2, ..., βp. βj describes the effect of the jth explanatory variable on theexpected value of the response variable.

• Regression Residuals, εi. These are independent and identically distributed and follow a normaldistribution.

Hence, a linear regression model shows how the expected value of a response variable changes linearly withthe value of some explanatory variables. Furthermore, it is useful for the description and prediction of aphysical process as will be seen in this report. It is also worth noting that due to the assumption that theresiduals are normally distributed, it is the normal linear regression which has been used in this project.

3.3 Logistic Regression

A logistic regression model is a statistical model very similar to a linear regression model. It differs inthat a linear regression model predicts the change in a dependent/response variable for a unit change insome independent/explanatory variable whereas, a logistic regression estimates the probability of an eventhappening (1) or not happening (0). As the relationship between response and explanatory variables isobviously non-linear the logistic regression function is used to establish the relationship and give a probability.Regression Coefficients, here indicating how probability changes with a unit, no longer has a straightforwardinterpretation.

logit(πi) = β0 + β1xi,1 + β2xi,2 + ...+ βpxi,p for i = 1, ..., n (2)

Logistic Regression Function

P =1

1 + e{−logit(πi)}(3)

As you can see in equation (2) a logistic regression model has three components.

• Response variables, πi. These are either values 0 or 1 and are dependent on the correspondingexplanatory variables xi,1, .., xi,p.

4

Page 5: Statistical Modelling of English Premier League Position

• Explanatory variables, xi,1, .., xi,p. These are non-random and influence the outcome of the logisticregression function that can either be a covariate, or factor, which has possible values called ‘levels’shown by an indicator function.

• Regression Coefficients, β1, β2, ..., βp. βj indicates how probability changes with every unit increase.This no longer has a straightforward interpretation as in linear regression.

To model using logistic regression the General Linear Model (GLM) function, with a binomial link, in R willbe used.

3.4 Model Selection

To select the most suitable model the Akaikes Information Criterion (AIC) will be used. The AICscores a model in the following way:

AIC = 2{`(θ̂) + p}, (4)

Where ‘p’ is the number of parameters in the model, and ‘`(θ̂)’ is the maximized value of the log likelihoodfunction for the model.

From a range of models, the model with the lowest AIC is the chosen model, as this is the one that minimizesthe Kullback-Leibler distance between the model and the truth. AIC is a good method of selection as itnot only rewards a better fit, but also includes a penalty that is an increasing function of the number ofestimated parameters. This penalty discourages overfitting, as the aim is to seek a parsimonious model, thesimplest model possible that gives an adequate description of the data we have available to us. Furthermore,stepwise AIC regression with backward elimination will be used via the “step()” function in R to provide thebest model.

3.5 Model Diagnostics

To analysis the appropriateness of the model Pearson residuals will be calculated after fitting the model andwill be plotted against the corresponding linear predictor. If the model is good there will be a horizontalband of residuals on the graph in the +/−3 range. If there is a visible curvature of the residuals on the graphit means the model is mis-specified and one or more of the explanatory variables may need transforming.The Pearson residual formula is

ri =yi − µ̂i√V̂ (yi)

=yi − niπ̂i√niπ̂i(1− π̂i)

, (5)

Where the mean of the dependent variable yi, is µi = niπi. And the variance is V (yi) = niπi(1− πi).

4 Results

4.1 Model for predicting Premier League position

The model gives the probability of a team finishing in the top six of the Premier League given their: totalwage bill, average attendance, and total spending on transfers. So, to predict a table for the 2013/14 seasonthe data about the above variables was collected, and placed into the model. The teams were then ranked,with the team with the highest probability of a top six finish coming first, second highest coming second inthe league and so on. Throughout the project it is assumed that the team with the highest probability offinishing in the top six is the team most likely to finish first. The data was collected for all Premier Leagueseason up until the 2013/14 season. The wages and spending data was adjusted for inflation by the RetailPrice Index (RPI).

5

Page 6: Statistical Modelling of English Premier League Position

Firstly the three explanatory variables were analysed. Figure 1 shows that there is a strong negative cor-

Figure 1: Average attendance per position since the 1992/93 season

relation between position and average attendance per position (correlation of −0.878). This would suggestthat teams who have higher match-day attendance enjoy a loftier position. Therefore, these teams have agreater probability of a top six finish. This is likely to be because more successful clubs attract more fans,and therefore have larger match attendances. Figure 2 again shows a negative relationship between position

Figure 2: Average total wages per position since the 1992/93 season

and average total wage bill per position (also a correlation of −0.878). This would imply teams paying higherwages to have a greater chance of a top six finish. Looking further at Figure 2 you can see teams who finishin the top four pay higher wages then the line of best fit suggests. This is also true for teams finishing in thebottom four. Whereas, teams with mid table finishes spend less then expected. This could be down to teamsin the top four trying to attract the best talent from across the world and having to pay higher wages to beatteams from across Europe’s top divisions to their targets. Teams that finish in the bottom four have oftenonly just come up into the Premier League or have come up in recent years, these teams, buoyed with thenew incomes that the Premier League brings, often spend more money to try and build a squad capable ofstaying up. Teams in mid table are normally well established teams and have more control over their financesand wage structure, they also tend to be “bigger teams” with a greater reputation in football than newlypromoted teams from the Championship, so they don’t have to pay as much in wages to attract players asthe newly promoted “smaller teams” do.

Figure 3: Average transfer spending per position since the 1992/93 season

6

Page 7: Statistical Modelling of English Premier League Position

Figure 3 is a very unusual graph, the line of best fit incorporates hardly any of the data, with almost all ofit been below the line. From Figure 3 we can see average spending on transfers is quite constant for teamsfinishing 7th to 20th at around £20 million. Then there is a significant increase to around £30 million for 3rd

to 6th this could possibly be explained by these teams participating in the Europa League and them needingto spend more money to buy a higher caliber of player. There is another increase visible to teams finishing1st to 3rd, again this could be down to competing in the Champions League, Europe’s elite competition, andto acquire players that are capable of making a team European Champions you have to pay higher wages.Moreover, the difference in spending could be down to teams in the top six aiming to compete in severaldifferent competitions, not only the league but the Capital One Cup and the FA Cup as well as the Europeancompetitions. To improve a teams chance of been successful in multiple competition, a lot of money will needto be spent on constructing larger squads which inevitably means a higher wage bill.

Performing logistic regression on our data provided the results in Table 1.

Table 1: Model 1

Estimate Standard Error z Value Pr( > |z|)

Intercept -5.0168 0.5309 -9.449 0.0001***

Attendance 0.1077 0.0173 6.228 0.0001***Wages 0.0079 0.638 1.612 0.107Transfer Spending 0.0053 0.0061 0.868 0.385

Significance codes: ∗∗∗ = 0, ∗∗ = 0.001, ∗ = 0.01, . = 0.1, Nothing = 1

This leads to the logistic regression function:

Pr(Top six finish) =1

1 + e−5.0168+0.1077x1+0.0079x2+0.0053x3, (6)

where attendance is denoted by x1, wages by x2, and transfer spending by x3.

After inputting the wages, attendance, and transfer spending for each team in the 2013/14 season intothe logistic regression function and ranking them in highest to lowest probability of a top six placing, thestandings seen in column 2 of Table 3 are achieved.

Now applying stepwise AIC regression with backward elimination. Gives an AIC value of 283 for Model1 and then eliminates transfer spending from Model 1. Model 2 comprising of wages and attendance gives anAIC value of 281, this value is smaller than that of Model 1 and hence is our chosen model. The coefficientsof Model 2 are shown in Table 2.

Table 2: Model 2

Estimate Standard Error z Value Pr( > |z|)

Intercept -5.0570 0.5325 -9.496 0.0001***

Attendance 0.1097 0.0172 6.367 0.0001***Wages 0.0104 0.0042 2.486 0.0129*

Significance codes: ∗∗∗ = 0, ∗∗ = 0.001, ∗ = 0.01, . = 0.1, Nothing = 1

The probabilities and league standings from Model 2 can be seen in Column 3 of Table 3 below.

In analysising our results we begin with comparing each model to reality. Model 1 is actually quite sim-ilar to what happened in reality, with teams experiencing a 3.8 position deviation on average. This isreasonable given the simplicity of the model, it only having three variables. After applying AIC regression,the transfer spending variable was removed, and Model 2 was produced. The relevance of transfer spending

7

Page 8: Statistical Modelling of English Premier League Position

on position is discussed later on using the examples of Liverpool and Chelsea. Model 2 is only a minor im-provement on Model 1 with probabilities of a top six finish only marginally different, leading to a extremelysimilar prediction of the table. This implies that it has little or no impact on position. The next point

Table 3: Prediction of 2013/14 Season (Probability of a top six finish)

Position Actual 2013/14 Table Model 1 Prediction Model 2 Prediction

1 Manchester City* (C) Manchester United (0.992) Manchester United (0.994)2 Liverpool* Arsenal (0.951) Arsenal (0.961)3 Chelsea* Manchester City (0.925) Manchester City (0.932)4 Arsenal* Chelsea (0.828) Chelsea (0.812)

5 Everton** Liverpool (0.764) Liverpool (0.785)6 Tottenham Hotspur** Newcastle United (0.725) Newcastle United (0.764)7 Manchester United Tottenham Hotspur (0.574) Tottenham Hotspur (0.508)8 Southampton Sunderland (0.501) Sunderland (0.505)9 Stoke City Everton (0.430) Everton (0.442)10 Newcastle United Aston Villa (0.389) Aston Villa (0.418)11 Crystal Palace West Ham United (0.325) West Ham United (0.340)12 Swansea City Southampton (0.238) Southampton (0.231)13 West Ham United Norwich City (0.190) Norwich City (0.196)14 Sunderland Cardiff City (0.167) Fulham (0.169)15 Aston Villa Fulham (0.159) Stoke City (0.162)16 Hull City** Stoke City (0.148) Cardiff City (0.153)17 West Bromwich Albion West Bromwich Albion (0.136) West Bromwich Albion (0.142)

18 Norwich City (R) Hull City (0.120) Hull City (0.114)19 Fulham (R) Crystal Palace (0.111) Crystal Palace (0.102)20 Cardiff City (R) Swansea City (0.094) Swansea City (0.096)

(C) = Champion; (R) = Relegated; * = Champions League Qualification; ** = Europa League Qualification

which can be drawn from these results is that of competitiveness: how competitive is the Premier League?This is analysised in much greater depth later on in the report, but from looking at our prediction you cansee leagues within a league arising, with nine teams having below a 25% chance of a top six finish, anotherfive teams with between a 30%-60% chance of a top six finish and six teams with a probability of 70% or higher.

Now looking at outliers. One major case in the results is Stoke City, who came 9th in the actual 2013/14season, but Model 1 predicted a 16th place finish, with a top six finish only having a 14% chance of occurring.Moreover, Model 2 predicted a 15th placing with a top six probability of 0.162. It must be investigated whythe model results are so different from reality. In the 2013/14 season, Stoke City’s manager Tony Pulis wasnamed manager of the year, and this is one potential explanation for the difference between the model andreality. This shows that this model is in no way perfect, and doesn’t encapsulate all factors, such as the effectsa good manager can have on a team’s chances of a top six finish. The next visible difference in the standingsbetween what actually happened what was predicted, is Liverpool. In the 2013/14 season they finished 2nd

but our two models predicted they would come 5th with probabilities of a top six position in the high 70%′s.This difference could be largely be attributed to Luis Suarez who finished the season as the Premier League’stop scorer with 31 goals, the next highest scorer was extremely far behind Suarez but was also a Liverpoolplayer, Daniel Sturridge with 21 goals. The effect to which a high quality goal scorer can have on your teamsprobability of a top six finish is not captured by our model and could be the reason why Liverpool finishedhigher then anticipated. These two possible variables, the transfer of key players and injuries to key playershighlight two possible ways this model could be improved, and this will be discussed in more depth later onin the report. The last standout disparity between the model’s prediction and reality is Manchester United’slow finish of 7th. Both the models said Manchester United had a 99% chance of finishing in the top six, inother words they should have won the league, or at the very least finish in the top six. This difference couldbe attributed to the “Moyes effect”. The 2013/14 season was the first time Manchester United had changedtheir manager for over 20 years and the club lost a massive amount of stability. It should also be noted chiefexecutive David Gill also left the club at the beginning of the season, meaning the club wasn’t as successful as

8

Page 9: Statistical Modelling of English Premier League Position

bringing in higher quality players as his replacement was not as skilled in the negotiations of deals for players.This will be looked at in more detail in the “Analysis of the Manchester United Dynasty” section of the report.

Finally, Model 1 and Model 2 were then both changed, so that they gave teams probabilities of top four, andtop eight finishes. Again the transfer spending variable was omitted after applying AIC model selection. Theresulting tables, where teams are ranked in order of their respective probabilities, differ very marginally fromthe table produced by the models producing probabilities of top six finishes. However, the actual probabilitiesof each team changed greatly, the top four model having nine teams with only a 10% chance of a top fourplacing and just three teams with a 70% or higher probability of a top four league standing, compared toone team with 10% or less and five teams with 70% or higher probabilities of a top six probabilities in theoriginal model. This implies normally a top four spot can only be achieved by small handful of clubs eachseason giving to substance to the theory of a“Big Four”, a selection of clubs that regularly occupy the higherechelons of the table. Whereas, the top six would have more variation with different teams in addition to theregular “Big Four” clubs featuring season to season. The model producing probabilities of a top eight finishhas a more even spread of probabilities, but no team has less than a 10% chance of finishing in the top eight,which is reasonable, as there are more positions to play than in the top six and top four models. (The tablescan be seen in the appendix).

These results just go to show there are more factors to a teams position than the three variables whichhave been included in this model, ways of improving the model and results will be suggested later on.

4.2 Competitive Balance: How Competitive is the English Premier League?

4.2.1 How is Competitiveness measured?

At face value it can be said that the English Premier League is highly uncompetitive, particularly whencompared to other leagues across European football. Just 5 of 46 teams to have played Premier Leaguefootball have won the competition as of the 2013/14 season, with Manchester United topping the tablethirteen times since the Premier Leagues creation in 1992. Whilst this may be the bluntest instrument usedto decide on how competitive a league is, there are other methods that can be employed such as those used byProfessors Bjorn Bloching of Roland Berger and Tim Pawlowski of the University of Tubingen in their 2013paper titled “How Exciting are the Major European Football Leagues?” The pair used a system known as theCompetitive Balance Ratio (CBR). This ratio comprises of two elements, the Standard Deviation ofLeague Points (SDLP) and the Standard Deviation of team points (SDTP) As the CBR increases,the level of competitiveness in the league goes up as well. Mathematically put, the CBR is defined as thestandard deviation of team points divided by the standard deviation of league points. Or, when expressedmathematically:

CBR =SDTP

SDLP(7)

Where

SDLPi =

√∑(TPt,i− ¯TPi)2

Nt(8)

And

SDTPi =

√∑(TPt,i− ¯TPi)2

T(9)

Here, the SDTP is calculated by the individual team points TPt,i and the average points of that team withina period ‘t’, and the SDLP is calculated by an individual team’s points per season TPt,i and the averagepoints in a league of ‘N’ teams.

4.2.2 Analysis of the Competitive Balance Ratio

Over the course of the Premier League seasons, it is fairly reasonable to say that the competitiveness of theleague has fallen dramatically. The Premier League’s CBR has fallen from 0.44 to 0.33 between the periods1991− 2001 and 2002− 2011. Financial regulations such as Financial Fair Play may alter the course of this

9

Page 10: Statistical Modelling of English Premier League Position

in the future, but its hard to see a change from the current pattern at the moment.

To say that only reason to compete in the Premier League is to win the competition would be crude, however.Whilst the chief ambition of all competing clubs is to win the league, a team would not be dissatisfied withfinishing inside the top six due to the prospect of playing in European wide competitions such as the EuropaLeague or the lucrative Champions League tournament. Since the 1992/93 Champions League season, nineteams have made the group stages of the top European footballing competition. This gives a new dimensionto the argument over how competitive the Premier League actually is. Lately, the top four Premier Leaguesides enter the Champions League in the next season. The reward that comes with this is considerable withe 8.6 million coming from making the group stages (as of the 2012/13 season) and an added e 1 million fora group stage win and e 500 thousand for a group stage draw, as well as added sponsorship capabilities andgate sales that may come with the added status. Clearly the Premier League, rather than just being aimedsolely towards winning the title, can also serve as a platform for the Champions League and can become acompetition for the rewards from Europe rather than drifting towards an occasionally obvious finish in thePremier League.

4.2.3 Further Points on Competitiveness

As many viewers know, the Premier League is made up of small leagues within the league itself. There’s agroup of teams who have the potential to win the league such as the “Big 4” (Manchester United, Chelsea,Arsenal and Liverpool) or more recently Manchester City as well. Then there are teams who are aiming forChampions League spots, such as Tottenham Hotspur. For the past few seasons, Liverpool have been lookingto make Champions League football; a more realistic target than winning the league. After this group, thereare the mid-table teams who recur over and over again, such as Stoke or West Bromwich Albion. Finally,there’s a group of teams present, who appear to be in a constant relegation battle such as Sunderland orWigan. In short, the Premier League is extremely uncompetitive when it comes to winning the title, andbecomes only slightly more competitive when looking at the number of teams playing European football.

4.3 Liverpool and Chelsea Case Study: Does transfer spending equal success?

As intuition would suggest that spending more money on transfers would have a significantly positive effecton a teams performance, the fact that the model eliminated this factor seemed something worth furtherinvestigation. During the premier league era both Chelsea and Liverpool have invested heavily in transferwindows, and have had varied league positions throughout, and therefore these two teams have been used fora case study investigating whether transfer spending had any significant impact on performance, or whetherultimately it came down to other factors. Table 4 and Table 5 below show: transfer spending, intake,Net Spending, Campaign points, and League positions, for the past 16 seasons of Chelsea and Liverpoolrespectively. This was the data used to analyse whether transfer spending was significant in the positioningof these two teams.

4.3.1 Effects of Transfer Spending

Modelling the transfer spending data in the above table using normal linear regression for:

Chelsea F.C : Yi = 73.99 + 0.047x1

Liverpool F.C : Yi = 66.44 + 0.0312x1

Where points in a season is denoted Yi and transfer spending is denoted xi for i = 1, 2, ..., 16.

Correlation between points and spending for Chelsea is 0.208 and Liverpool is 0.058. Due to the factthat there is such a weak correlation between transfer spending and season performance it seems to agreewith the result of the AIC regression earlier, which eliminated transfer spending from the model, suggestingthat transfer spending had little impact on overall results. There is a positive correlation for both teams,

10

Page 11: Statistical Modelling of English Premier League Position

Table 4: Chelsea F.C

Season Spending Points Net Spending Position Champions League

13/14 105.9 82 49.3 3 012/13 92 75 72 3 111/12 87.7 64 63.2 6 110/11 94.6 71 82.6 2 109/10 23.5 86 17.5 1 108/09 24.2 83 -10.8 3 107/08 40.5 85 7.5 2 106/07 12 83 -6.8 2 105/06 111.9 91 91.1 1 104/05 59.85 95 47.15 1 103/04 153.45 79 153.35 2 002/03 0.5 67 -0.5 4 001/02 15 64 8.38 6 000/01 26.7 61 -2.6 6 199/00 45 65 37.05 5 098/99 0.33 75 -1.67 3 0

Table 5: Liverpool F.C

Season Spending Points Net Spending Position Champions League

13/14 49.8 84 20.3 2 012/13 49.3 61 41.3 7 011/12 56.4 52 35.35 8 010/11 80.45 58 -5.15 6 109/10 36 63 -8.65 7 108/09 39 86 6.25 2 107/08 69.75 76 39.85 4 106/07 28.04 68 15.66 3 105/06 35.14 82 25.64 3 104/05 39.8 58 25.3 5 003/04 8.5 60 2.25 4 002/03 13.7 64 7.95 5 101/02 30.85 80 12.36 2 000/01 18.5 69 6.2 3 099/00 35.85 67 26.6 4 098/99 12.05 54 7 7 0

although, as this is much stronger for Chelsea, it is very possible that there are factors which may increasethe effect of transfer spending (such as past spending, sudden spends on particular players, injuries etc) andthis may explain the intuitive feeling that transfer spending would be significant in the outcome. Evidentlyleague positioning cannot be predicted on transfer spending alone, and this is reiterated by the fact that theintercept terms are different, as if transfer spending was the primary influence, then when this was equal tozero the positioning of both teams would be almost (or exactly) equal.

4.3.2 Effects of Champions League involvement

Next we look to measure the effects of involvement in the Champions League, again modelling using normallinear regression. Let xi be an indicator function such that;

xi,1 =

{0 if in season i.1 otherwise.

11

Page 12: Statistical Modelling of English Premier League Position

Figure 4: Transfer spending vs Chelsea’sseason points

Figure 5: Transfer spending vs Liverpool’sseason points

For i = 1, 2, ..., 16.

Chelsea F.C : Yi = 72 + 7.4x1

Liverpool F.C : Yi = 65 + 6x1

Correlation between points and Champions League involvement for Chelsea is 0.355 and 0.28 for Liver-pool. From this we can see that Champions league involvement has a strong positive link on points. So itis expected a team being in the Champions league will positively effect their league points. This contradictsintuition as it is popularly believed that too many games would negatively effect league performance. How-ever, it is expected that teams which finish high and reach the Champions League will continue their forminto the next season. It may be interpreted that rather than Champions league positively effecting the leaguepoints; the form of the team that reached the Champions League continues into the next season. This iswhere a Markov model would be able to introduce form by including previous seasons.

4.3.3 Analysis using Logistic regression

Although using the normal linear regression above shows correlations between variables well; using this modelwill result in predicted values being continuous rather than discrete (for example, predicted points could be82.569). Instead, by using logistic regression, the probability of a team coming within the top four can becalculated as reasonable measure.

Table 6: Chelsea F.C

Estimate Standard Error z Value Pr( > |z|)Intercept 0.309381 1.064537 0.291 0.771

Champions League 0.646937 1.193573 0.542 0.588Transfer Spending 0.007894 0.013629 0.579 0.562

Table 7: Liverpool

Estimate Standard Error z Value Pr( > |z|)

Intercept 0.82808 1.14846 0.721 0.471

Champions League 0.24421 1.06744 0.229 0.819Transfer Spending -0.01799 0.02732 -0.658 0.510

This leads to the logistic regression functions:

Chelsea F.C :

Pr(Top four finish) =1

1 + e0.309381+0.646937x1+0.007894x2(10)

12

Page 13: Statistical Modelling of English Premier League Position

Liverpool F.C :

Pr(Top four finish) =1

1 + e0.82808+0.24421x1−0.01799x2(11)

Where Champions League involvement is denoted by x1 and transfer spending by x2.

From these equations and using the data from 2013/14 season, the probabilities of a top four finish canbe calculated:

So the probability of Chelsea finishing in the top four in the 2014/15 season is 0.8572081, and the prob-ability of Liverpool finishing in the top four in the 2014/15 season is 0.5484421.

This appears reasonable as the current odds of finishing in the top four for Chelsea and Liverpool arerespectively 1/500 and 7/4 which relate to the probabilities 0.998004 and 0.363636.

4.3.4 Summary of findings

To conclude, transfer spending obviously has little effect on the performance of teams and more variablesneed to be considered for the final model. Although there is a strong correlation between points and inclusionin the Champions League, it is more likely a result of continued form. This is where a Markov model couldbe used to incorporate past seasons into the modelling of this data. Obviously these findings cannot becompletely generalisable as only Liverpool and Chelsea have been chosen. However, it would be hard topredict the teams league position from their own data alone, due to the random nature of the game. Apossibility to create a more accurate model would be to collect data based on league position rather thanindividual teams. This would take into consideration the nature of the league, becoming more competitive,and model teams relative to each other.

4.4 Analysis of the Manchester United Dynasty

Manchester United are, without a doubt, one of the most widely recognised, supported and successful clubsin the English Premier League history. The key to a dominating Football club is sought after by many,although none have been able to replicate the triumph of Manchester United. This section will analyse theirhistory in order to gain an insight to why they have been such a success.

4.4.1 Managerial Stability

Over the past 70 years Manchester United have had a number of managers some of which were very successfuland won many trophies, others not so much. The success of a club can be interpreted by the number of titlesthey have won, adding up the titles Manchester United have won in each decade from 1940′s to 2000′s andplotting against the number of managers at the club during this decade can give us an idea to whether a longterm manager has a link to success. See Figure 6 below.

Figure 6: Manchester United’s number of managers against titles per decade

Correlation between the number of managers and titles is −0.58.

13

Page 14: Statistical Modelling of English Premier League Position

4.4.2 Ferguson Era

Manchester United’s peak of achievements are between the years 1986 − 2013, during this period only onemanager was in charge, Sir Alex Ferguson. A total of 38 trophies were accumulated across a 23 year period,however none were received in his first three years in charge. Throughout this time, routine and stabilitywould have helped the club hugely to achieve what it did, but also attendance at Old Trafford is one ofthe biggest things that gave United the edge on other premier league teams. Figure 7 shows the averageattendance of all the current premier league clubs over the Ferguson Era, compared with the match attendanceof Manchester United; from 2001 to 2013 United has over 210% of the average match attendance of otherteams. Consequently, this generates huge amounts of income for the club not only through direct ticket salesbut also from attracting larger sponsorship deals. In the 2011/12 season alone the average premier leagueclub revenue was £118 million, Man United recorded a revenue of £320 million. Furthermore, this enablesUnited to gain signatures of top players and staff by offering large wage bills, and in turn stabilizing the clubstructure even further.

Figure 7: Average club attendance vs average Manchester United attendance

4.4.3 2013/14 Season

Soon after Sir Alex Ferguson retired, David Moyes became the new manager of Manchester United. Ourmodel predicted that in the 2013/14 season according to all our data United should have finished at the topof the league, when in actual fact they finished 7th place. Possible reasons behind this, are the fact that themodel doesn’t take into consideration the politics behind the transition period as one manager takes overfrom another, especially when the previous was one of the most successful in the premier league history.Arguably, David Moyes’ biggest error was getting rid of Sir Alex Ferguson’s staff, who knew the club wellalready and were already part of the clubs solid structure. Essentially, he tried fixing something that wasn’tbroken, and many other problems stemmed from this error. Finally, the team itself is partly to blame for thepoor season as there was too high of an expectation for any manager who was to succeed Sir Alex Ferguson,as realistically not many would have been able to continue where he left off.

4.5 Diagnostics

In Figures 8 and 9 there is the Pearson residuals plotted against the linear predictors for Models 1 and 2.From these graphs, the conclusion that can be draw is that both of the models are poor as neither of them fitthe criteria of an appropriate model. There is a visible curvature of the residuals implying a transformationmay be needed, also several residuals fall outside of the +/ − 3 range. Furthermore, there is almost nodifference between the two plots for Model 1 or 2, re-confirming what was found earlier on, transfer spendinghas hardly any impact on probability of a top six finish. Therefore, omitting it only improves the modelmarginally.

5 Conclusion

The analysis provided has shown varying degrees of significance for the three variables chosen, wages andattendance appear to have significant impact on final Premier League position, whereas transfer spendingdoes not. This ties in nicely with the opinions of Simon Kuper and Stefan Szymanski’s “Soccernomics”(2010) who also argue that transfer fees aren’t significant compared to the wages paid to players. Having

14

Page 15: Statistical Modelling of English Premier League Position

Figure 8: Pearson residuals vs Fitted valuesfor Model 1

Figure 9: Pearson residuals vs Fitted valuesfor Model 2

said all of this, the model doesn’t fit the league’s actual positions perfectly. This will likely be due to the factthat the model does not have enough variables to narrow the fluctuations, and because of individual clubsexperiencing things which a model will find impossible to account for. An example of this is ManchesterUnited; a team who’d experienced years at the top end of the Premier League and who had been predicted tofinish top by both models, yet ending up finishing in seventh place. Statistically Manchester United have hada very poor season, yet the change in manager from Sir Alex Ferguson to David Moyes isn’t accounted for.Accounting for it in model form is almost impossible as quantifying the standards of a manager is difficult.

5.0.1 How the model can be improved

There are some ways in which this model could be improved given the opportunity to further this project.Adding more variables would be one way in which we could add more realism to the model’s output. Perhapsaccounting for the number of injuries a team’s squad face, accounting for the average age of the squad,totalrevenue of the club, cup involvement and quality of head coach would add some more relative accuracy tothe model’s prediction.

Another way this study could be improved would be to introduce a Markov model into the analysis. Addinga Markov model allows for repeated observations to be made of the teams over a number of years. One of theweaknesses with the current model used is that there is no time series accountability. This means that a teamsuch as West Bromwich Albion, should their owner suddenly spend large quantities of money on players, theywould be able to come high up in the League tables. History and intuition tells us that this is extremelyunlikely, given that a team needs time to acclimatise to new players if they’re ever bought. When observationsare made on the same individual, it is likely that the response variables will be correlated (Rahman M.S etal, 2007). The study of the Premier League could be furthered by considering this interdependence betweenyears. Using a logistic regression model similar to that employed by Rahman would yield results which wouldeffectively incorporate time into the analysis given so far. The Markov model would be incorporated with alogistic regression function as used previously in the report with the binary random variable being:

Yt = 0 if the team fails to finish in the top six.Yt = 1 if the team succeeds in making the top six.

Here, ‘t’ is the year of observation, where t = 1, 2, 3, ..., n

The idea of incorporating a Markov model into the group’s analysis leads nicely towards further study ofPremier League position, but due to space limitations and the scope of study, this has been omitted for now.

5.0.2 Concluding Remarks

Concluding remarks to end on would be that the salaries played to the team and attendance have a significantimpact team performance in this league whereas the transfer spending element has little significance. Theleague itself is pretty uncompetitive with only a handful of clubs able to compete for the top spots, indicatedby pretty much the same teams qualifying for the Champions League and only five sides actually winningthe league since the Premier League began in the 1992/93 season.

15

Page 16: Statistical Modelling of English Premier League Position

A Bibliography

References

Brillinger, D. R. (2009). An analysis of chinese super league partial results. Science in China Series A:Mathematics, 52(6):1139–1151.

Consultants, R. B. S. (2013). How exciting are the major european football leagues?— publications— rolandberger.

Dixon, M. J. and Coles, S. G. (1997). Modelling association football scores and inefficiencies in the footballbetting market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46(2):265–280.

East Midlands Football (2014). League tables. http://www.emfootball.co.uk/leaguetables.html[Dateaccessed:30-10-2014].

Fahrmeir, L. and Tutz, G. (1994). Dynamic stochastic models for time-dependent ordered paired comparisonsystems. Journal of the American Statistical Association, 89(428):1438–1449.

Football Association Premier League Limited (2014). Premier league tables.http://www.premierleague.com/en-gb.html[Date accessed:30-10-2014].

Hill, I. (1974). Association football and statistical inference. Applied statistics, pages 203–208.

Hothorn, T. and Everitt, B. S. (2009). A handbook of statistical analyses using R, volume 12. CRC Press.Chapter 6 - Logistic Regression and Generalised Linear Models: Blood Screening, Womens Role in Society,and Colonic Polyps.

Kuper, S. and Szymanski, S. (2012). Soccernomics. HarperCollins UK.

Lee, A. J. (1997). Modeling scores in the premier league: is manchester united really the best? Chance,10(1):15–19.

Mode, C. J. and Sleeman, C. K. (2002). An algorithmic synthesis of the deterministic and stochastic paradigmsvia computer intensive methods. Mathematical biosciences, 180(1):115–126.

Rahman, M. S. and Islam, M. A. (2007). Markov structure based logistic regression for repeated measures:An application to diabetes mellitus data. Statistical Methodology, 4(4):448 – 460.

Rodrguez, G. (1992). Introducing R section 5: Generalized linear models.http://data.princeton.edu/R/glms.html[Date accessed:30-10-2014].

Sports Business Group (2013). Annual review of football finance 2013 highlights. Technical report, DeloitteLLP.

Szymanski, S. (1998). Why is manchester united so successful? Business Strategy Review, 9(4):47–54.

Titman, A., Costain, D., Ridall, P., and Gregory, K. (2014). Joint modelling of goals and bookings inassociation football. Journal of the Royal Statistical Society: Series A (Statistics in Society).

Transfer League (2014). The transfer league. http://www.transferleague.co.uk[Date accessed:30-10-2014].

Transfermarkt (2013). Clubs premier league - transfers. http://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1[Date accessed:30-10-2014].

Union des Associations Europennes de Football (2014). Uefa champions league.http://www.uefa.com/uefachampionsleague/history/index.html[Date accessed:30-10-2014].

16

Page 17: Statistical Modelling of English Premier League Position

B Data

Table 8: Prediction of 2013/14 Season (Probability of a top four Finish)

Position Actual 2013/14 Table Model 1 Prediction Model 2 Prediction

1 Manchester City* (C) Manchester United (0.990) Manchester United (0.991)2 Liverpool* Arsenal (0.923) Arsenal (0.928)3 Chelsea* Manchester City (0.847) Manchester City (0.848)4 Arsenal* Chelsea (0.637) Chelsea (0.634)

5 Everton** Liverpool (0.619) Newcastle United (0.624)6 Tottenham Hotspur** Newcastle United (0.614) Liverpool (0.617)7 Manchester United Sunderland (0.320) Sunderland (0.320)8 Southampton Tottenham Hotspur (0.301) Tottenham Hotspur (0.296)9 Stoke City Everton (0.256) Everton (0.257)10 Newcastle United Aston Villa (0.231) Aston Villa (0.233)11 Crystal Palace West Ham United (0.177) West Ham United (0.178)12 Swansea City Southampton (0.108) Southampton (0.108)13 West Ham United Norwich City (0.084) Norwich City (0.085)14 Sunderland Fulham (0.070) Fulham (0.070)15 Aston Villa Stoke City (0.068) Stoke City (0.069)16 Hull City** Cardiff City (0.068) Cardiff City (0.067)17 West Bromwich Albion West Bromwich Albion (0.059) West Bromwich Albion (0.059)

18 Norwich City (R) Hull City (0.047) Hull City (0.047)19 Fulham (R) Crystal Palace (0.042) Crystal Palace (0.042)20 Cardiff City (R) Swansea City (0.036) Swansea City (0.036)

(C) = Champion; (R) = Relegated; * = Champions League Qualification; ** = Europa League Qualification

Table 9: Prediction of 2013/14 Season (Probability of a top 8 Finish)

Position Actual 2013/14 Table Model 1 Prediction Model 2 Prediction

1 Manchester City* (C) Manchester United (0.994) Manchester United (0.994)2 Liverpool* Arsenal (0.969) Arsenal (0.969)3 Chelsea* Manchester City (0.950) Manchester City (0.950)4 Arsenal* Chelsea (0.870) Chelsea (0.870)

5 Everton** Liverpool (0.849) Newcastle United (0.850)6 Tottenham Hotspur** Newcastle United (0.831) Liverpool (0.832)7 Manchester United Tottenham Hotspur (0.646) Sunderland (0.644)8 Southampton Sunderland (0.637) Tottenham Hotspur (0.637)9 Stoke City Everton (0.584) Everton (0.584)10 Newcastle United Aston Villa (0.562) Aston Villa (0.563)11 Crystal Palace West Ham United (0.488) West Ham United (0.489)12 Swansea City Southampton (0.370) Southampton (0.370)13 West Ham United Norwich City (0.329) Norwich City (0.329)14 Sunderland Fulham (0.294) Fulham (0.295)15 Aston Villa Stoke City (0.284) Stoke City (0.284)16 Hull City** Cardiff City (0.272) Cardiff City (0.271)17 West Bromwich Albion West Bromwich Albion (0.256) West Bromwich Albion (0.257)

18 Norwich City (R) Hull City (0.216) Hull City (0.215)19 Fulham (R) Crystal Palace (0.197) Crystal Palace (0.196)20 Cardiff City (R) Swansea City (0.188) Swansea City (0.188)

(C) = Champion; (R) = Relegated; * = Champions League Qualification; ** = Europa League Qualification

17

Page 18: Statistical Modelling of English Premier League Position

C R-code

top6<−matrix ( rep ( c ( 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) , 2 1 ) )top4<−matrix ( rep ( c ( 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) , 2 1 ) )x<−matrix ( c ( 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 ) )

Explanatory Ana lys i splot (x , attendance , ylim=c ( 0 , 70 ) , y lab=”Average attendance ( ’000 s ) ” , xlab=”League Pos i t i on ” , pch=19)abline (lm( attendance˜x ) , col=2, l t y =1)

Reading data i n t o Ra<−read . csv ( f i l e=” aaplp v3 . 0 . csv ” )attendance2<−matrix ( a [ , 4 ] )attendance2b<−read . csv ( f i l e=” tspp v2 . 1 . csv ” )transpend2<−matrix (b [ , 1 ] )transpend2c<−read . csv ( f i l e=”wppia v2 . 0 . csv ” )wages2<−matrix ( c [ , 1 ] )wages2wages3<−wages2∗0.000001wages3d<−read . csv ( f i l e=”2013 14 data . csv ”d [ , 1 ] ” Trans fe r Spending ”d [ , 2 ] ”Wages”d [ , 3 ] ”Attendance”

F i t t i n g modell r f i t<−glm( formula=top6˜attendance2+wages3+transpend2 , family=binomial )l r f i tsummary( l r f i t )l r f i t 1<−glm( formula=top6˜attendance2+wages3 , family=binomial )l r f i t 1summary( l r f i t 1 )

Applying l o g i s t i c r e g r e s s i o n functiong<−−5.05700+(0.10971∗d [ , 3 ] ) + ( 0 . 0 1 0 3 5∗d [ , 2 ] )h<−1/(1+exp(−g ) )h

Ordering p r o b a b i l i t i e ssort (h , de c r ea s ing = TRUE)

Creat ing Pearson p l o t sr e s i d u a l P l o t ( l r f i t , variable=” f i t t e d ” , type=” pearson ” )r e s i d u a l P l o t ( l r f i t 1 , variable=” f i t t e d ” , type=” pearson ” )

Applying AIC model s e l e c t i o nAIC(glm( formula=top62˜attendance2+wages3 , family=binomial ) )lm1<−glm( formula=top62˜attendance2+wages3+transpend2 , family=binomial )step ( lm1 )

18