the use of data mining for predicting injuries in

114
The use of Data Mining for Predicting Injuries in Professional Football Players Garth Theron Thesis submitted for the degree of Master in Programming and System Architecture 60 credits Department of Informatics Faculty of Mathematics and Natural Sciences UNIVERSITY OF OSLO Spring 2020

Upload: others

Post on 25-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

The use of Data Mining for Predicting

Injuries in Professional Football Players

Garth Theron

Thesis submitted for the degree of

Master in Programming and System Architecture

60 credits

Department of Informatics

Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

Spring 2020

The use of Data Mining for PredictingInjuries in Professional Football Players

Garth Theron

c© 2020 Garth Theron

The use of Data Mining for Predicting Injuries in Professional Football Players

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

Abstract

Injuries in professional football represent a financial burden to clubs and have a negative impact onteam performance. This has inspired a wide range of studies which attempt to gain insight into the riskfactors associated with injury. Until recently, the majority of research has been limited to univariatestudies. However, the emergence of machine learning has given rise to numerous multivariate studies.The work covered in this thesis forms part of a partnership between the University of Oslo (UIO) andthe Norwegian School of Sports Science (NIH), which aims to research relationships between athleteworkloads, injury and illness. With this thesis serving as the first work undertaken in the partnership, twogoals are identified. The first is to build a data warehouse to store all available training, competition, injuryand illness data generated by a professional Norwegian football team. The data warehouse serves as aunified representation of the clubs data, providing data for both the second goal of this thesis, as well asfuture research. The second goal is to conduct a data mining study using player workload and injury datato predict future injury.In the first phase of the thesis, a data warehouse is constructed using a traditional four-phase modellingapproach. The choice of a data warehouse is primarily motivated by three factors, namely, user needs,the granularity of the data available, and the update frequency of the data store. Combing data fromthree source systems and several internal sources, a star schema consisting of a single fact table and sixdimensions is created. The fact table includes six workload measures, namely, total distance, accelerationload, player load, V4 distance, V5 distance, and HSR distance. These are aggregated using the SUMoperation, and provide users with the ability to extract data at granularities ranging from one-secondintervals up to the level of an entire season. The six dimensions included in this thesis are date, session,injury, illness, player and training log. As the final step in the design phase, a materialised view is createdand includes workload and injury data aggregated at the session-level. The data included in the view isexplicitly selected for the analysis needs of the data mining completed in this thesis and does not take intoconsideration the analysis needs of future work.In the second phase of this thesis, data mining is conducted using the CRISP-DM framework. Withthe aim of predicting future injury from workload data, five phases from this framework are carriedout, namely business understanding, data understanding, data preparation, modelling, and evaluation.Four objectives are identified, all of which aim to predict future injury using workload and injury datagathered from a single season of training and competition. Two definitions of injury are provided toinvestigate whether models perform better for specific injury types. Four models of varying complexityand interpretability are used in the modelling phase and include decision trees (DT), random forests (RF),logistic regression (LG), and support vector machines (SVM). Model evaluation makes use of four classindependent evaluation metrics, precision, recall, F1 score and area under the curve (AUC), to assess amodels predictive performance. Models of the relationship between workloads, a previous injury feature,and injury show limited ability to predict future injury among players from a professional Norwegianfootball team. Mean AUC scores are below 0.52 for all modelling approaches indicating that injurypredictions are no better than those expected by random chance. Precision scores are higher than recallscores for all modelling approaches, with a highest score of 0.65±0.17 being achieved. The inability toachieve recall scores higher than 0.02 means that all modelling approaches generate a large number offalse-negative predictions, indicating that models are unable to identify the injury class.

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Motivation for a Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Overview of Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 The CRISP-DM Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Machine Learning in Football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Workloads in Predictive Modelling . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Source Data 13

3.1 GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Injury Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Training Log Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 First Impressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Data Warehouse Modelling 19

4.1 Requirements Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.1 Analysis Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.2 Source-Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Conceptual Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

CONTENTS

4.2.1 The Initial Analysis-Driven Conceptual Schema . . . . . . . . . . . . . . . . . . 28

4.2.2 The Initial Source-Driven Conceptual Schema . . . . . . . . . . . . . . . . . . 30

4.2.3 Conceptual Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.4 The Final Conceptual Schema and Mappings . . . . . . . . . . . . . . . . . . . 31

4.3 Logical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 The Logical Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 Definition of the ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.1 Materialised Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 The ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Overview of the ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.2 Cleaning of GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5.3 Cleaning of Injury and Illness Data . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.4 Cleaning of Log Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.5 Load Date Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.6 Load Session Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5.7 Load Player Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.8 Load Injury and Illness Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.9 Load Log Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5.10 Load GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5.11 Insert Special Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Data Mining 55

5.1 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1.2 Injury Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1.3 Data Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Summary Statistics for the Workload Features . . . . . . . . . . . . . . . . . . . 57

5.2.2 Summary Statistics for the Injury Data . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.3 Comparison of Workload Features for Injured and Non-Injured Players . . . . . 59

5.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.1 The Exponentially Weighted Moving Average (EWMA) . . . . . . . . . . . . . 62

5.3.2 The Mean Standard Deviation Ratio (MSWR) . . . . . . . . . . . . . . . . . . . 62

5.3.3 Injury Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.4 Creation of the Final Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

iv

CONTENTS

5.3.5 Correlation Matrix for the Features in the Final Data Set . . . . . . . . . . . . . 65

5.4 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.2 Adaptive Synthetic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4.3 Stratified K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4.5 Features From Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4.6 Modelling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5.3 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Conclusion 81

6.1 Data Warehouse Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.2 Data Mining Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Appendices 91

A Source Code 93

B MultiDim Model for Data Warehouses 95

C BPMN Notation 98

v

List of Figures

1.1 Two phase approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 An illustration of a three-dimensional data cube . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The CRISP-DM data mining life cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Phases in the data warehouse design process . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Steps in the requirements specification phase using an analysis/source-driven approach . 20

4.3 Steps in the conceptual modelling phase using an analysis/source-driven approach . . . . 26

4.4 Illustration of the many-to-many relationship between a fact and a dimension . . . . . . 26

4.5 A decomposition of the injury/illness dimension . . . . . . . . . . . . . . . . . . . . . . 27

4.6 Unbalanced hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.7 The initial conceptual model using an analysis-driven approach . . . . . . . . . . . . . . 29

4.8 The initial conceptual model using an source-driven approach . . . . . . . . . . . . . . 30

4.9 The final conceptual model generated from the analysis-driven and the source-driven models 32

4.10 Steps in the logical modelling process . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.11 A logical data warehouse schema for football data . . . . . . . . . . . . . . . . . . . . . 35

4.12 The use of placeholders to transform unbalanced hierarchies into balanced hierarchies . . 36

4.13 SQL code for generating a materialised view . . . . . . . . . . . . . . . . . . . . . . . . 39

4.14 A query for selecting data from a specific session . . . . . . . . . . . . . . . . . . . . . 40

4.15 Overview of the ETL process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.16 Tasks involved in the cleaning of GPS training data . . . . . . . . . . . . . . . . . . . . 45

4.17 Loading of the Date dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.18 Loading of the Session dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.19 Loading of the Player dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.20 Loading of the Injury and Illness dimensions . . . . . . . . . . . . . . . . . . . . . . . . 50

4.21 Loading of the fact table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Visual overview of injury occurrence for the 2019 season . . . . . . . . . . . . . . . . . 60

vi

LIST OF FIGURES

5.2 Boxplots comparing the workload features of injured and non-injured players with respectto NC injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Boxplots comparing the distance features of injured and non-injured players with respectto NC injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Boxplots comparing the distance features of injured and non-injured players with respectto NCTL injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.5 Representation of the final data set constructed in the data preparation phase . . . . . . . 64

5.6 Correlation matrix of the injury and workload features in the final data set . . . . . . . . 65

5.7 An illustration of a test/train split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 An illustration of 5-Fold Cross Validation with ADASYN . . . . . . . . . . . . . . . . . 70

5.9 An illustration of Feature Elimination combined with K-Fold Cross Validation and ADASYN 72

5.10 Boxplots comparing the area under the curve for models using ADASYN and StratifiedK-Fold Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.11 Boxplots comparing the area under the curve for models using the previously identifiedinjury features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.12 Comparison of AUC scores for models using RFECV in combination with ADASYN andStratified K-Fold CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.13 Comparison of feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B.1 Dimension level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.2 Fact table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.3 Cardinatlities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.4 Dimension types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.5 Balanced hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.6 Ragged hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C.1 General notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

C.2 More general notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

C.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

C.4 Gateways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vii

List of Tables

3.1 Features extracted from Catapult OptimEye X4 . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Features recorded for player injuries and player illnesses . . . . . . . . . . . . . . . . . 15

3.3 The features recorded by a player after the completion of each training session . . . . . . 15

4.1 Fact summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Dimension summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Summary of the multidimensional elements using a source-driven approach . . . . . . . 25

4.4 An example of double-counting in a many-to-many dimension. . . . . . . . . . . . . . . 27

5.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Summary statistics after smoothing by bin means . . . . . . . . . . . . . . . . . . . . . 58

5.3 Injury statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Comparison of ML data set and data from data warehouse . . . . . . . . . . . . . . . . 64

5.5 A confusion matrix for binary classification . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6 Summary results of all models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.7 Confusion matrix for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.8 Confusion matrix for DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.9 A comparison of confusion matrices for SVM and DT models using NC data . . . . . . 75

viii

Acknowledgements

I would like to thank my supervisors, Vera Hermine Goebel, Thomas Peter Plagemann, Torstein Dalen-Lorentsen and Thor Einar Gjerstad Andersen for their guidance and support during the writing of thisthesis. I would also like to express my gratitude to Anders Larsen for retrieving the GPS data required forthis project.

ix

Chapter 1

Introduction

1.1 Motivation

Injuries in professional football represent a major concern, as they negatively impact team performance[15], and the process associated with player rehabilitation is often both costly and time-consuming.Significant associations between low injury rates and increased performance have been shown in bothdomestic and professional football teams [15]. This is particularly concerning as injuries among footballplayers are prevalent, and players are expected to incur around two injuries per season [9]. Prevention ofinjuries is also of enormous importance for the development of individual players, as frequent injuriesmay prevent players from achieving their maximum skill potential due to an absence from training andcompetition [36].Another point of concern is the economic impact of injuries, resulting from medical costs as well asthe missed earnings derived from a players popularity [30]. This represents a financial burden uponprofessional teams, with the average cost of an injury to a Premier League player being estimated to thevalue of £290000 1.With this in mind, it is not surprising that there is a growing interest in the field of injury forecasting amongresearchers, managers and coaches. Despite numerous efforts, previous work in the evaluation of statisticalmodels for the prediction of injury have had little success. This may be attributed to a combination of twofactors, namely the limited availability of data describing the activity of players, and the fact that existingstudies have been limited by single-dimension estimations [39].Advances in technology have provided researchers with enhanced player tracking, giving them access toaccurate workload data for players [41], as well as a wealth of open machine learning libraries. This hassubsequently lead to a shift in focus towards multi-parameter statistics [43].

1.2 Problem Statement

This thesis is inspired by recent research which has shown promising results in the prediction of injuryamong football players using GPS data [39]. The intention behind this work is to conduct a similar study

1https://www.forbes.com/sites/bobbymcmahon/2019/08/22/report-shows-that-an-injury-to-a-premier-league-player-costs-on-average-350000/#5f22640410e2

1

CHAPTER 1. INTRODUCTION

using training and competition data from a professional football club in Norway. We propose a data miningapproach with a twofold purpose.The first of these is to create a data warehouse which will serve as a unified representation of all the trainingand competition data generated by the club. The purpose of the data warehouse is to support both the datamining study presented in this thesis as well as future research in the area of football-related injury andillness. The requirements specification for designing the data warehouse thus takes into considerationnot only the research needs of this thesis but includes the user needs of researchers at NIH. The aim forthe data warehouse is, therefore, to provide researchers with a unified multidimensional data store whichcontains data relevant for the analysis of football-players’ workloads, injuries and illnesses, and whichsupports a variety of online analytical processing (OLAP) operations.The second purpose of this thesis is to conduct a data mining study which aims to predict the onset ofplayer injury. Following a data mining cycle outlined in Chapter 2, a data set is generated from summarisedworkload and injury data. The following is then used to create prediction models using classificationtechniques such as decision trees, random forests, support vector machines and logistic regression.For the purpose of clarity, the user-needs relating to the data warehouse cover a broader scope than that ofthe data mining requirements in this thesis. They will be referred to throughout the thesis as "project goals",and are specified in Chapter 4, which covers data warehouse modelling. The "thesis goals" specificallyrefer to the data mining goals for this thesis and are covered in Chapter 5.

1.3 Approach

Given the project goals presented in Section 1.2, and data collected from three source systems during the2019 season, the problem statement for this thesis is approached in two parts:

1. The creation of a data warehouse.

2. Data mining to predict future injury.

An overview of the approach taken in this thesis is presented in Figure 1.1. The first phase involves thecreation of a data warehouse to aid the analysis goals of the project. Starting with the source systems,data is first explored to gain an understanding of the data available. This is process involves tasks suchas identifying data formats, describing data metrics, and identifying potential issues which exist withinthe data. Information gathered from the source data is then used in combination with user requirementsto construct a data warehouse. Following a traditional four-phase design approach, a warehouse is firstmodelled before data is extracted, transformed and then loaded into the data warehouse. The final task inthis phase involves the creation of a materialised view, which provides summarised data necessary for thenext phase of the thesis.On completion of the first phase, data mining is carried out following the first five phases of the frameworkdescribed in Chapter 2. With a primary goal of predicting future injury, clear objectives are first establishedin the business understanding phase, before the data understanding phase is used to gain initial insightsinto the data set. Based on these insights, a final data set is constructed in the data preparation phase, andthen used for the development of prediction models in the modelling phase. The second phase ends with anevaluation of the prediction models’ performance and suggestions for improving the results of future work.

2

CHAPTER 1. INTRODUCTION

Figure 1.1: Two phase approach

1.4 Scope of this Thesis

Due to a combination of the broad subject area covered in this thesis, and the limited time frame available,it is essential to provide a clear definition of the scope of this work. Section 1.2 presents two objectives forthis thesis, namely, the construction of a data warehouse and the prediction of future injury using data madeavailable from several source systems. The scope of the data warehouse is limited to creating a simple datawarehouse which can address the analysis needs of the data mining goals presented in this thesis, as wellas several additional analysis goals specified by researchers at NIH. This involves modelling a functionaldata warehouse which provides access to data necessary for achieving these goals. Additionally, basic dataextraction, transformation and loading tasks are required to ensure that data is successfully loaded into thewarehouse in a format which is consistent with the warehouse modelled. As a result of the 2019 seasonterminating in December, and the generation of source files being a timely endeavour, the vast majority ofthe data used in this thesis was only made available at a late stage during this work. Little time is thusavailable for extensive testing and performance enhancement, and for this reason, is not covered in greatdetail.Furthermore, the data used in this thesis is limited to workload, injury and illness data gathered during asingle season. As much of the related work in this field is conducted using multiple seasons of data, thework presented in this thesis serves as the groundwork for future studies. Lastly, this thesis makes use ofall data made available by the football club. Since the commencement of this project, several studies haveproduced findings using data which was unavailable at the time this work was carried out. This is notedand discussed in the section concerning future work.

1.5 Structure of this Thesis

The structure for this thesis is as follows:

• Chapter 2 provides a background of the three subject areas covered in this thesis. Starting with data

3

CHAPTER 1. INTRODUCTION

warehousing, Section 2.1 provides a definition of what a data warehouse is and how it differs from atraditional database, before providing an overview of typical data warehouse functions. Next, theconcept of data mining is defined in Section 2.2, and the data mining framework used in this thesisis presented. Lastly, a summary of related work in the field of injury prediction is given.

• Chapter 3 provides a description of the three data sources, before commenting on their quality.

• Chapter 4 covers the four stages of data warehouse modelling. Using an approach which takesinto consideration both user requirements and the source data available, this chapter provides anaccount of the processes involved in designing, implementing and loading the data warehouse forthis project.

• In Chapter 5 the first five phases of the data mining framework used in this thesis are presented.Using data extracted from the data warehouse, a feature set is created and used in a series offour modelling approaches to assess the ability of different machine learning models to predictfuture injury. A discussion of the results is then presented, highlighting several of the limitationsencountered in this work.

• Finally, Chapter 6 summarises the work presented in this thesis and provides direction for the futurework of this project.

4

Chapter 2

Background

2.1 Data Warehousing

As discussed in the previous chapter, it is the goal of both the present study and future studies to be ableto gain insights into the relationship between training workloads and injury, illness and performance.As a means of achieving this goal, a data warehouse is proposed as an architecture for integrating theorganisations data and supporting structured queries, analysis and decision making. The motivation forsuch an architecture arises from the organisations need for a semantically consistent data store which canserve a variety of research studies. Furthermore, a data warehouse provides online analytical processing(OLAP) tools for the analysis of multidimensional data, which facilitates effective data generalisation anddata mining [17].This section begins with a presentation of the key motivating factors for choosing a data warehouse beforea brief overview of several core data warehouse concepts is provided. As Chapter 4 provides a moredetailed account of many of these concepts, only a brief overview is presented in this section.

2.1.1 Motivation for a Data Warehouse

The choice of a data warehouse instead of a traditional database is motivated by three factors. The firstof these is associated with the project’s user needs, which primarily focus on analysis tasks. Operationaldatabases, or online transaction processing (OLTP) systems, are tuned to support the daily operations of anorganisation. Their primary concern is to ensure fast concurrent access to data, and are transaction-oriented.Due to OLTP systems having to support heavy transactional loads, their design is focused on preventingupdate anomalies, thus resulting in databases which are highly normalised. They, therefore, perform poorlywhen aggregating large volumes of data, or executing complex queries which join many relational tables[46]. Data warehouses, however, are designed to support online analytical processing (OLAP), whichfocus on analytical queries and are designed to support heavy query loads.The second motivating factor is that of data granularity and data aggregation. As is discussed in Chapter 3,much of the project’s data is extracted at a very fine level of detail. This provides users with the abilityto analyse data at multiple levels of granularity. Despite OLTP and OLAP systems both providing userswith this functionality, OLAP systems are far better suited to aggregating large volumes of data. Thefinal motivating factor is the update frequency of the database. OLTP systems store and process data in

5

CHAPTER 2. BACKGROUND

real-time, involving operations which regularly insert, update and delete data. The research conducted inthis project, however, involves analysing data collected over an entire season and is not dependant uponregular update operations which are typical in OLTP systems. For this reason, the projects database canbe updated relatively infrequently without affecting the success of the project. Lower update frequenciesare common in OLTP systems which focus primarily on read operations, and typically make use of batchupdates.Given the analytical needs of users, the fine granularity of the source data, and the infrequency at whichthe database needs to be updated, a data warehouse is considered an ideal architecture for integrating theorganisation’s data and supporting the project’s research.

2.1.2 Overview of Data Warehousing

A data warehouse can be defined as a repository of integrated data obtained from several sources for thespecific purpose of multi-dimensional data analysis [46]. The data collected in data warehouses shareseveral characteristics which are worth mentioning. The first of these is that data are subject-oriented,meaning that the warehouse focuses on the analytical needs of the organisation. In the case of thisproject, the analysis focuses on workload, injury and illness features of the source data. Another importantcharacteristic is that of non-volatility, meaning that the data is not modified or removed, thus ensuring itsdurability. A final characteristic of importance is that of the historical nature of the data. As it is the longterm goal of this project is to be able to record and analyse data over a period of multiple seasons, the datawarehouse must have the ability to store historical workload and injury records.Data warehouses use a multi-dimensional model to view data in an n-dimensional space, often referred toas a cube or a hypercube. Cubes are comprised of two primary components, a fact, and a dimension. Afact is the central component in the structure and is associated with a numeric value, known as a measure.A measure is given perspective with the help of dimensions. Dimensions thus represent the granularityof the measure for each dimension of the cube. The level of granularity specified for each dimension isknown as the dimension level [46].To illustrate this concept, an example of a three-dimensional cube is provided in Figure 2.1. The cubehas three dimensions: Time, Player and Injury. Using the example of the Time dimension, data can beaggregated at several levels. Examples of these are date, week, month, and season, and are referred to asmembers. The members represented in the figure are thus: date, player and injury-type. The cells in thefigure represent a fact and are associated with numerical values called measures. In the case of the exampleused here, the measure represents the distance covered by a player in kilometres. A fact typically includesseveral measures, meaning that multiple workload features can be included in this thesis. Measures areused for the task of quantitative analysis. From the figure, it can thus be seen that player 2 covered ninekilometres on the sixth of June 2019 while the player was not injured. However, if the data were to beaggregated at the season level, it might be seen that player 2 completed 1200 kilometres while not injured,80 kilometres while carrying an overuse injury and 0 kilometres while being acutely injured.

6

CHAPTER 2. BACKGROUND

Figure 2.1: An illustration of a three-dimensional data cube

The previous example of aggregating data at a higher level of granularity illustrates one of the severaloperations available in online analytical processing (OLAP). Listed below are brief descriptions of theprimary operations which are used in this project.

1. Drill down disaggregates measures along an axis of the cube to obtain data at a finer granularity.

2. Roll up aggregates measures along an axis of the cube to obtain data at a coarser granularity.

3. Slice removes a dimension from the cube.

4. Dice filters cells according to a Boolean condition.

5. Add measure adds new measures to a cube. The new measures are calculated from other measuresor dimensions.

As the data warehouse modelled in this project does not represent a collection of data from the entireorganisation, it is technically classified as a data mart. For the purpose of this study, however, it is referredto as a data warehouse.

2.2 Data Mining

Data mining is an interdisciplinary subject that has several definitions. In industry and the researchmilieu, data mining is often seen as being synonymous with the process of knowledge discovery fromdata, or KDD. This process is comprised of a sequence of steps which aim to extract interesting patternsrepresenting knowledge from data. In other cases, data mining is defined as just one of the multiple stepsinvolved in this process of knowledge discovery. This thesis adopts a broad view of the data mining termand defines it as the process of discovering interesting patterns and knowledge from large quantities ofdata [17].

7

CHAPTER 2. BACKGROUND

2.2.1 The CRISP-DM Process Model

The Cross-Industry Standard Process for Data Mining was first introduced in 2000 and has since become apopular framework for a wide variety of data mining projects. The CRISP-DM reference model is shownin Figure 2.2 and provides an overview of the life cycle of a data mining project. The life cycle is dividedinto six phases. These are connected by arrows which highlight the respective dependencies betweenthem. The order in which the phases are carried out is not strict, and the outcome of a particular phase willdetermine which phase should be carried out next [50].

Figure 2.2: The CRISP-DM data mining life cycle

Each phase of the data mining cycle is outlined as follows:

1. Business Understanding

The initial phase of the cycle involves understanding the project objectives and requirements. Thisknowledge is used to define the data mining problem [50].

2. Data Understanding

This phase involves the collection of data from its sources as well as the process of becomingfamiliar with the data. The latter usually includes activities such as the identification of data quality

8

CHAPTER 2. BACKGROUND

problems, the discovery of first insights into the data, and the detection of interesting subsets of data[50].

3. Data preparation

Data preparation encompasses all the activities which are required to construct the final data setthat will be used by the modelling tools. The process involves tasks such as attribute selection, datacleaning, attribute construction, and data transformation [50].

4. Modeling

This phase involves the selection of modelling techniques, the generation of a test design, modelbuilding, and model assessment [50].

5. Evaluation

In the evaluation phase, the constructed model(s) are evaluated to ascertain whether they achieve thebusiness objectives [50].

6. Deployment

In the final phase, the knowledge gained from the data mining is organised and presented in a waywhich can be used by the organisation [50].

2.3 Machine Learning in Football

As a result of injuries representing a financial burden to football clubs and having a negative impact onteam performance [15], there has been a great deal of research in the field injury prediction. Moreover,much of this research focuses on the relationship between athlete workloads and injury risk [7]. Athleteworkloads are seen as being potentially useful in injury prediction as they fall under the classificationof modifiable risk factors. The multifactorial nature of injury prediction poses a significant challengeas many of the risk factors associated with injury are non-modifiable [7]. These include factors such asage, sex, sport, injury history, and level of competition [34; 21]. One of the drawbacks of non-modifiablefactors is that they do not provide coaches and medical staff with tools to prevent injury. For this reason,modifiable factors which are associated with injuries are more useful for predictive modelling, as theyprovide coaches and medical staff with intervention points which can be used to reduce the risk of injury [7].

2.3.1 Workloads in Predictive Modelling

Athlete workloads are considered potentially useful in predictive injury modelling for the reason thatthey are modifiable, and that they are associated with a risk of injury. Workloads are defined as thecumulative amount of stress placed on an individual from one or more sessions over a period of time,and can be measured along several dimensions [7]. These include external/internal, subjective/objective,and absolute/relative loads. External load is defined as the work completed by an athlete, measuredindependently of the athlete’s internal characteristics [48]. Measuring the external load typically involvesquantifying the load of an athlete. It may include measures such as hours of training, distance run, distancerun above a specified speed, or number of games played [42]. An advantage of using external loads inpredictive modelling is that they can be easily measured using technological devices which are relatively

9

CHAPTER 2. BACKGROUND

inexpensive and of little inconvenience to athletes. The internal load is a means of quantifying an athletesresponse to external loads and is a measure of the relative physiological or psychological stress imposedon an athlete [16]. Internal loads give rise to the second dimension of measurement, which classifies a loadas being either subjective or objective. Subjective measures are self-reported and may include measuressuch as reported perceived exertion (RPE) or questionnaires on well-being [7]. Objective measures includea wide variety of physiological responses to workloads and range from measuring an athletes heart rate orblood lactate concentration, to conducting biochemical, hormonal and immunological assessments. Thereis, however, limited research on the use of biochemical, hormonal and immunological measures due totheir collection being a costly and time-consuming process [16]. The third dimension of measurementtakes into consideration the absolute and relative workloads of athletes. Absolute workloads are simplya summation of an athletes workload over a specified period [7], whereas relative workloads take intoaccount the history of loading or the fitness of an athlete [42]. Relative workloads include measureswhich account for variations in workloads such as, the mean standard deviation ratio (MSWR). They alsoinclude measures which compare workloads over two time periods such as the acute:chronic workloadratio (ACWR), and measures which consider the moving average of an athletes workloads such as theestimated weighted moving average (EWMA) [35; 49; 5; 39].

2.3.2 A Machine Learning Approach

In recent years, there has been a dramatic increase in the use of machine learning to gain insight into thenature of injury and performance in professional sports [6]. The following trend aims to exploit complexpatterns underlying available data and has been sparked by several factors. Firstly, traditional statisticaltechniques used in the past fail to account for the complex non-linear relationships which exist betweenpredictor variables [39]. Secondly, the use of Electronic Performance and Tracking Systems (EPTS)provide researchers with a wealth of data depicting multiple aspects of an athletes movements. Lastly, datascience is a rapidly emerging area that is providing more evidence-based decision-making across manyindustries [24].Predictive modelling in football aims to provide coaches and trainers with practical, usable and interpretablemodels to aid decision making [28]. With this in mind, the task of predicting injury provides numerouschallenges. One of these is the trade-off between a model’s accuracy and interpretability [39]. Predictivemodels should be interpretable for the reason that researchers and coaching staff need to know why anathlete may be at risk. This enables one to change training strategies to avoid high-risk situations fromoccurring. For this reason "black box" approaches such as neural networks are considered impractical [39].On the other hand, it is of vital importance that predictive models are highly accurate, as "false alarms"can negatively impact training strategies and athlete performance. Recent research thus makes use of avariety of modelling techniques with neural networks, decision trees and support vector machines beingthe most popular in the past five years [6].Another challenge in predicting injury is that of the severe class imbalance between classes of injury andnon-injury [5]. Strategies such as oversampling, undersampling, and synthetic data generation have beenused to correct these imbalances [5; 39]. However, little is known about the volume of data needed to buildaccurate predictive models [5].

10

CHAPTER 2. BACKGROUND

2.4 Related Work

Despite the recent move towards using machine learning for injury and performance prediction, thereis still limited work which directly focuses on multivariate relationships between athlete workloads andinjury risk [6]. This is surprising given the extensive body of univariate research that has been conductedin this area. Gabbett et al. have found that rugby players are at a higher injury risk when their workloadsare above certain thresholds[13; 12; 11]. Ehrmann et al. found that soccer players cover a greater distanceper minute in the weeks preceding an injury compared to their seasonal averages [8]. Anderson et al.found a strong correlation between the monotony of basketball players’ workloads and the risk of injury[1].Similarly, Brink et al. observed that injured young soccer players had higher values of monotony in theweeks preceding an injury than non-injured players [4].Two machine learning studies, however, are identified at the time of writing this thesis and are consideredparticularly relevant to the work presented here. In a three-year study which uses GPS training loads andperceived exertion ratings to predict injury in elite Australian football players, Carey et al. [5] concludesthat prediction models show poor ability to predict injuries. In another one-year study which uses GPSworkloads and a previous injury feature to predict future injuries in professional Italian football players,Rossi et al. [39] concludes that their prediction model shows a good trade-off between accuracy andinterpretability. Given the high degree of similarity between the approaches taken in these studies, it issurprising that they produce conflicting results. As each of the studies are conducted using data gatheredfrom a single football team, it should be expected that the study with the larger data set would producemodels with a superior ability to predict incidents of injury. This is, however, not the case, and leads toseveral questions concerning the models’ ability to generalise. The discrepancy in results highlighted hereis one of the motivating factors for the work conducted in this thesis, which aims to predict future injury infootball players from a single season of workload and injury data. Moreover, this study aims to concludewhether it is possible to predict injury using workload and injury data collected from a single season.In an effort to reproduce much of the work that is conducted in the two studies above, a brief account oftheir similarities and differences is presented. Starting with the data sets, both studies create models usingdata which include a combination of absolute and relative workloads. The absolute workloads common toboth studies are those of distance and high speed running distance. Similarly, both studies make use ofrelative workload features, EWMA, ACWR, and MSWR for many of the absolute features. Additionally,Carey et al. include the relative features, strain and rolling averages, whereas Rossi et al. introduce amultivariate previous-injury feature. Further similarities between the studies include the use of logisticregression and random forest prediction algorithms. Rossi et al. additionally makes use of decision trees,whereas Carey et al. include general estimating equations and support vector machines. Both studiesmake efforts to compensate for class imbalance. The Rossi et al. approach makes use of synthetic datageneration using ADASYN. Carey et al. use both undersampling and oversampling techniques. Finally,dimensionality reduction is also used in both studies to compensate for the highly correlated feature sets.Rossi et al. use recursive feature elimination to eliminate irrelevant features and enhance interpretability.Carey et al. however, use principal component analysis to reduce the dimensionality of the data set.One additional difference between the two studies is worth mentioning. Rossi et al. maintain that only asubset of three features is needed to predict injury successfully. These include the EWMA of a playershigh speed running distance, MSWR of a players distance, and the EWMA of the previous-injury feature.The first two features are included in the data sets of both studies, whereas the last feature is only includedin the Rossi et al. study. Given that this is shown to be the most significant prediction variable, it may

11

CHAPTER 2. BACKGROUND

account for the difference in results achieved by the two studies.

12

Chapter 3

Source Data

The data for this project is collected from a professional Norwegian football team during the 2019 seasonspanning from January to December. It is comprised of workload, injury, illness, and log data from threesources, namely, a GPS tracker, a training log, and an injury log. Players included in the study underwenttraining and competition without any interference from the research group. A total of 3005 GPS sessionsare recorded from 38 players. Additionally, 4726 log entries, 57 incidents of injury, and 37 cases of illnessare recorded during the season.This chapter provides a detailed account of the three data sources mentioned above. Starting with theGPS tracker, a description of each of the recorded features is given. A brief account of the injury, illness,and log data is then provided before commenting on potential issues that need to be considered beforeprocessing the data.

3.1 GPS Data

Players are monitored during both training and competition using the Catapult OptimEye X4 athletetracker, a device placed between the shoulder blades in a specially designed vest. The device is equippedwith a 10Hz GPS tracker, 3-axis 100Hz accelerometers, 3-axis gyroscopes, and 3-axis magnetometers,which together generate a set of training features describing different aspects of a player’s workload. Ofthe numerous features which can be extracted from the tracking system, six are chosen for this project.Furthermore, the time interval for which features are extracted is specified to one second, meaning that alltraining features are aggregated over one-second time intervals. Files are generated as comma-separatedvalues (CSV) files, and one file is generated for each session. A single file thus contains workload datafor all players that participated in the given session. Each line represents a single second of training andcontains workload data of a given player for the specified time interval.Table 3.1 presents an overview of the features generated by the GPS unit. In addition to a player’s name,each line of a CSV file also contains six workload recordings, a timestamp, and the interval number for thesession.

13

CHAPTER 3. SOURCE DATA

Interval Interval number for current session

Time Interval start time

First Name Players first name

Last Name Players last name

Acceleration Load Sum of acceleration values for a specified period

Player Load Sum of the accelerations across all axes of the accelerometer

Total Distance Distance covered in meters for the specified interval

V4 Distance Distance covered at a speed of 20 - 25km/h

V5 Distance Distance covered at a speed in excess of 25km/h

HSR Distance The sum of V4 Distance and V5 Distance

Table 3.1: Features extracted from Catapult OptimEye X4

Two of the features presented above require further explanation:

Player LoadPlayer load is defined as the sum of the accelerations across all axes of the tri-axial accelerometer duringmovement [26]. It serves as an indication of the amount of work done by a player over a specified timeinterval, and can be expressed as follows:

PlayerLoad =t=n

∑t=0

√( f wdt=i+1− f wdt=i)2 +(sidet=i+1− sidet=i)2 +(upt=i+1−upt=i)2 (3.1)

where:f wd,side,up are acceleration values in forwards, sideways and upwards directions respectivelyt is time

Acceleration LoadAcceleration load is defined as the sum of accelerations over a specified period [25]. It serves as anindication of acceleration volume or the total amount of speed change during a specified interval oftime. Acceleration-load is calculated from smoothed velocity data at 0.2-second intervals and treats bothaccelerations and decelerations as positive values. By deriving its value from smoothed velocity data,acceleration load values are less susceptible to noise than that of values produced from RAW positionaldata, thus providing a more realistic representation of changes in speed.

3.2 Injury Data

Player injuries and illnesses are recorded throughout the study using a software called AthleteMonitoring.Incidents are reported by the team’s medical staff, and each entry contains information describing thenature and duration of an injury or an illness. Table 3.2 summarises the key variables recorded by theapplication.

14

CHAPTER 3. SOURCE DATA

Id Player Id

Team Name Name of the sports team

Type Either injury or illness. Injury is either acute or overuse

Legacy Diagnosis Diagnosis of of injury/illness

Diagnosis Diagnosis of of injury/illness

Body Area Body area affected by injury/illness

Mechanism Mechanism by which the injury occurred

Classification Either new or pre-existing

Date of Injury Date of occurrence

Recovery Date Date of recovery

Missed days Number of missed training/competition days

Severity Classified as slight, minimal, mild, moderate or severe

Activity Activity during which the injury occurred

Participation Level of training/match participation during injury period

Table 3.2: Features recorded for player injuries and player illnesses

3.3 Training Log Data

The third source of data is also captured using the AthleteMonitoring software and involves a self-evaluationthat is completed by players after each session. Evaluations serve as an indication of how closely playershave followed their planned training sessions as well as an indication of their perceived level of enjoyment.The features extracted from the evaluation are summarised in Table 3.3.

Activity Type of training performed

Planned duration The planned duration of the training

Duration The actual duration of the training

Planned difficulty The planned difficulty of the training

Difficulty The perceived difficulty of the training

Planned load The planned load of the training

Load The calculated load of the training

Planned enjoyment The planned enjoyment of the training

Enjoyment The players perceived enjoyment

Table 3.3: The features recorded by a player after the completion of each training session

3.4 First Impressions

Previous work has identified three training/injury features associated with the risk of injury, namely, theestimated weighted moving average of a player’s high speed running distance, the mean standard deviation

15

CHAPTER 3. SOURCE DATA

ratio of a player’s total distance, and the estimated moving weighted average of a constructed injury feature[39]. The RAW data necessary for calculating the features mentioned above are total distance, high speedrunning distance, the number of injuries incurred by a player, and the number of days since returningto training after a previous injury. All of these can be calculated from the available source data. Bothdistance and high speed running distance are provided by the GPS unit and can be aggregated to the levelof granularity specified. The number of injuries received by a given player can be calculated by summingthe player’s injury records in Table 3.2, and the number of days since a player’s return from injury can beobtained by calculating the difference between a players recovery date and the date of the current session.Additional data may be required for research beyond the scope of this thesis. This includes data such asplayer age, weight, height, and position, as well as possible performance indicators. The topic of additionaldata is covered in further detail in the section on data warehouse modelling.All of the data sources mentioned above are generated as CSV files, each of which has several issues thatneed to be considered during the data cleaning process. These issues are briefly highlighted below.

GPS data:

• Duplication of data. Files may include duplicate lines of data. Multiple instances of duplication existin the provided source data and must be resolved before loading the data into the data warehouse.

• Missing and wrongly ordered parameters. The generation of the CSV files is a manual task whichis carried out for each session and involves specifying which workload features are to be included.Human errors result from the mundane nature of the job. Errors such as missing parameters andwrongly ordered parameters are present among the files used in this thesis.

• Non-training data. Another human error encountered in the data set is that which results fromplayers forgetting to turn off their GPS tracking units. In these instances, multiple lines of data aregenerated outside of a session, resulting in lines of non-training data, which may easily outnumberthe lines of actual training data.

• Players registered with multiple profiles. As a result of difficulties encountered with the trainingsoftware, multiple profiles have been created for specific players. A player’s name is used to matchtraining data with data gathered from the other two sources. For this reason, it is essential to resolvenaming conflicts to guarantee that training data maps to the correct log and injury/illness data.

• Players without data. Due to a shortage of GPS trackers, there exist two players with no recordedtraining data. There, however, exists data from the other two sources for these players, a fact whichneeds to be taken into consideration when loading data into the data warehouse.

Log data:

• Data spread over multiple lines. Numerous data entries are spread over multiple lines. This needs tobe taken into consideration when reading data from a file.

• Players registered with multiple profiles. Some players have been registered with multiple profilesand therefore have multiple profile-ids. Resolving such conflicts is of vital importance when mappinglog data to the correct data from other sources.

16

CHAPTER 3. SOURCE DATA

• Players without log data. A third point of consideration for the log data is that several players donot have any training log records.

Injury/Illness data:

• Incorrect dates. The are multiple cases in which an injury’s recovery date is recorded as occurringbefore the date of injury.

• Missing injury type. The type of injury incurred by a player is classified either as acute or overuse.There is one entry in the source data which is not classified.

17

Chapter 4

Data Warehouse Modelling

There is no consensus on the phases that should be followed during the data warehouse design process,and for this reason, there exists no standard methodological framework. In this thesis, which implements arelatively simple data warehouse, the design process adopts a methodology proposed by A. Vaisman andE. Zimanyi, which is both well documented and easy to understand [46]. The methodology is based onthe assumption that data warehouses are a specialised form of traditional databases that focus primarilyon analytical tasks. Their design thus follows the same phases of traditional database design, namely,requirements specification, conceptual design, logical design, and physical design. An overview of thedesign process is presented in Figure 4.1.

Figure 4.1: Phases in the data warehouse design process

Furthermore, the design process in this thesis is carried out using an analysis/source-driven approach,which takes into consideration both the analysis needs of the users, as well as the data which are availablefrom underlying source systems. This is considered an optimal approach for the current project as it ensuresthat users’ needs for future studies are met while ensuring that all available data are taken into consideration.Users are thus provided with the choice of including data which may not have been considered during theinitial goal-setting phase.A final point of clarification is that of the general method of design, for which there exist two alternatives,namely, top-down and bottom-up design. Top-down design focuses on merging user interests from anentire organisation, in order to create a data warehouse which is cross-functional in scope. In the caseof NIH, this would involve modelling a schema that would include training data from the organisation’sentire scope of sports research. Such a schema may subsequently be tailored into several separate datamarts, each focusing on a specific area of research. This project, however, adopts a bottom-up designmethod, focusing only on a sub-set of the organisation’s data. The resulting product is a data-mart that isspecifically tailored to aid analysis of the organisation’s football data.

19

CHAPTER 4. DATA WAREHOUSE MODELLING

4.1 Requirements Specification

As the earliest step in the design process, the requirements specification phase is responsible for de-termining which data is needed and how it is organised. It is a crucial step in identifying the essentialelements of a multidimensional schema and has a significant impact on the future success of the datawarehouse’s ability to perform its required tasks. When using an analysis/source-driven approach, tasksfrom the analysis-driven approach and the source-driven approach are combined and used in parallel. Thecombination of these two approaches ensures an optimal design by taking into account both the analysisneeds of the user(s) as well as the data which is available for creating a multidimensional schema [46].From Figure 4.2, it can be seen that the requirements specification phase begins with the identification ofthe organisation’s users as well as the data sources available. As it is the intention of this thesis to serve avariety of research needs, all users must be identified to ensure that these needs are met. The next stepinvolves defining the project’s goals and applying a derivation process to the source data. This step resultsin the initial identification of facts, measures, and dimensions. Finally, the information gathered from theprevious two steps is used to produce a separate requirements specification for each of the two approaches.

Figure 4.2: Steps in the requirements specification phase using an analysis/source-driven approach

Starting with the analysis-driven approach, the steps for both of the analysis-driven and source-drivenapproaches are described in detail below:

4.1.1 Analysis Driven Approach

Identify usersTwo primary user groups are identified for this project. The first of these are researchers at NIH whoare interested in gaining insights into the relationships between training workloads, injury, illness, andperformance. The second group includes computer scientists at UIO who are interested in creating injuryprediction models based on player workload and injury data.

Analysis needsThis step begins with the identification of project goals, which plays an essential role in converting user

20

CHAPTER 4. DATA WAREHOUSE MODELLING

needs into data elements. From the two user groups identified above, three project goals are formulatedand listed as follows:

• To identify associations between workload features and player injury.

• To predict the onset of injury using workload features identified in previous studies.

• To identify associations between player biometrics, workload features and player injury/illness.

The goals established above are broken down into a set of sub-goals to identify both common and uniqueelements among the primary goals. The sub-goals generated for this project are listed below:

1. To be able to extract a player’s GPS training features at multiple granularities.

2. To be able to extract injury and illness features for each player.

3. To be able to extract a player’s biometrics.

The next task in defining user analysis needs is to operationalise the above sub-goals by defining severalrepresentative natural language queries. These aim to capture the functional requirements of the datawarehouse, and are listed below with respect to the goal they are derived from:

1. To be able to extract a player’s GPS training features at multiple granularities.

(a) The total distance, player load, acceleration load, V4 distance, V5 distance and HSR distancecovered by a player during a given session, week, month, season.

(b) Compare the workload features for a given time period (eg. one minute) with the average ofthat feature over another time period (eg. a session).

2. To be able to extract injury and illness features for each player.

(a) The duration of an injury/illness.

(b) The number of days since recovering from an injury/illness.

(c) The number of injuries/illnesses for a given player.

(d) The number of days missed as a result of an injury or an illness.

(e) To be able to distinguish between acute and overuse injuries.

3. To be able to extract player biometrics.

(a) The age, weight, and height of a player.

The final task in determining the analysis needs involves breaking down the natural language queries inorder to define the facts, measures, and dimensions of the warehouse. As this is a manual process, a briefdiscussion of each set of queries is given below. This includes information about which elements need tobe aggregated as well as the aggregation functions used.

21

CHAPTER 4. DATA WAREHOUSE MODELLING

1. (a) The total distance, player load, acceleration load, V4 distance, V5 distance, and HSR distance

covered by a player during a given session, week, month, season.

From the above query, a fact and two dimensions are identified. The player element iscategorised as a dimension. In contrast, the elements total distance, player load, accelerationload, V4 distance, V5 distance, and HSR distance are identified as measures, which togetherrepresent a fact. The query also highlights four levels of aggregation, namely, session, week,month, and season, which naturally form the levels of a time hierarchy. Hence a seconddimension for representing time is identified. Finally, the totals for each of the measures canbe calculated using addition, and the aggregation function is specified as Sum.

(b) Compare the workload features for a given time period (eg. one minute) with the average of

that feature over another time period (eg a session).

No additional facts or dimensions are identified from the above query. However, the need forfiner levels of granularity are highlighted. These levels are identified as second and minute,and form a natural extension of the time hierarchy identified above.

2. (a) The duration of an injury/illness.

(b) The number of days since recovering from an injury/illness.

(c) The number of injuries/illnesses for a given player.

(d) The number of days missed as a result an injury or an illness.

(e) To be able to distinguish between acute and overuse injuries.

Two additional dimensions, namely, injury and illness, can be identified from the list ofqueries presented above. In the case of injury, a hierarchy, which includes the level injury-type,is also identified. The type-level enables users to distinguish between incidents of acute andoveruse injuries. Furthermore, an additional level for the time dimension is identified. Thelevel identified is the date-level, and is considered essential for calculating the number of dayssince a player’s last injury. No hierarchy is identified for the illness dimension. This is largelya result of the simplicity of the nature of the research being conducted, which does not requirea detailed classification hierarchy for medical illness.

3. (a) The age, weight, height of a player.

The above query does not aid in the identification of any additional facts or dimensions.The query is, however, important in the sense that the information required to answer it is notdirectly available from the source data. This topic is dealt with later in the modelling process.

22

CHAPTER 4. DATA WAREHOUSE MODELLING

Document requirements specificationThe requirements specification of the analysis-driven approach ends with the documentation of informationgathered during the previous two steps. Information is summarised in the form of two tables, one table forfacts and measures, and one table for dimensions and hierarchies.Table 4.1 provides a summary of the identified fact and its measures. Measures identified during theprevious step, as well as their aggregation functions, are included. Additionally, an indication of whichqueries apply to each measure is provided. It can be seen that all measures apply to queries 1a and 1b,which focus on the granularity of the data extracted. Similarly, it is seen that queries 2a− 3a do notconcern any of the measures in the table. The following is illustrated by marking the relevant cells inthe table with a cross. The reason that the queries are not applicable is because they focus on extractinginformation specific to the player, injury, and illness dimensions.

Fact Measure Aggregation functionAnalysis scenarios

1a 1b 2a 2b 2c 2d 2e 3a

Training data Total Distance Sum 3 3 7 7 7 7 7 7

Training data Player Load Sum 3 3 7 7 7 7 7 7

Training data Acc Load Sum 3 3 7 7 7 7 7 7

Training data V4 Distance Sum 3 3 7 7 7 7 7 7

Training data V5 Distance Sum 3 3 7 7 7 7 7 7

Training data HSR Distance Sum 3 3 7 7 7 7 7 7

Table 4.1: Fact summary

Table 4.2 provides an overview of the dimensions. From the table, it can be seen that four dimensions areidentified, namely, the Date-Time, Injury, Illness, and Player dimensions. Both the Date-Time and Injurydimensions include hierarchies, for which seven and two levels are seen respectively. The Time hierarchyprovides users with the ability to extract data at multiple levels of aggregation ranging from one-secondintervals at the finest level, to an entire season’s data at the coarsest level. The Type level seen in theClassification hierarchy allows users to differentiate between acute and overuse injuries. No hierarchiesare seen for the Illness and Player dimensions. As seen in Table 4.1, Table 4.2 also provides an indicationof which queries apply to each dimension.

Dimension Hierarchies and levelsAnalysis scenarios

1a 1b 2a 2b 2c 2d 2e 3a

Date-TimeTimeSecond→Minute→ Session→Date→Week→Month→ Season

3 3 3 3 7 3 7 7

InjuryClassificationInjury→ Type

7 7 3 3 3 3 3 7

Illness - 7 7 3 3 3 3 7 7

Player - 3 3 3 3 3 3 7 3

Table 4.2: Dimension summary

23

CHAPTER 4. DATA WAREHOUSE MODELLING

4.1.2 Source-Driven Approach

Once the requirements specification for the analysis-driven approach has been completed, the process isthen conducted using a source-driven approach. The steps involved are discussed in detail below.

Identify source systemsThis step involves identifying the projects data sources and determining their respective reliability, availab-ility, and update frequencies. As this process is covered in detail in the section on source data, it is notrepeated here.

Apply derivation processThe derivation process involves deriving facts, measures, and dimensions from the source data. Facts andmeasures are usually associated with elements that are frequently updated. This makes the data gatheredfrom the GPS tracker a prime candidate to be a fact, and each numerical data feature a potential measure.The training features, total distance, acceleration load, player load, V4 distance, V5 distance, and HSRdistance are thus classified as measures belonging to the fact "Training data".Each GPS entry is also associated with a time and a player, both of which are potential dimensions.Training data are associated with both a player and time through a many-to-one relationship. The formerresults from one player being able to have many training entries, but each training entry only beingassociated with one player. The latter is the result of many training sessions being able to take place at thesame time, whereas each recorded GPS entry may only occur at a single given time.Injury data is another candidate dimension and is also associated with the GPS data through a many-to-onerelationship. This relationship arises from a non-injury being associated with many GPS entries. The finalset of source data available is that of the players training log. It provides another candidate dimension andis associated with the training data in a one-to-many relationship, for the reason that one training log isused to summarise multiple GPS entries.The next step in the derivation process involves the analysis of potential dimensions and hierarchies. Thefirst task involves specifying the levels of granularity that need to be included in the Time dimension. TheGPS source data are summarised over one-second intervals and provide the finest level of granularity forthe data warehouse. Coarser levels are defined according to user needs, and for this thesis, it is essential tobe able to summarise data over both the level of a session and a season. One of the primary interests of thisthesis is to be able to analyse data at several different granularities. Logical candidates for representingfine levels of granularity would thus include the levels minute and hour, whereas coarser levels wouldinclude session, week, month, and season. The Time dimension resulting from the source-driven approach,thus gives rise to the following levels :

Second→Minute→ Hour→ Session→ Date→Week→Month→ Season

Although the initial scope of this project is initially limited to a single football team, the data warehousecan easily be modelled to include additional teams. For this reason, a hierarchy is included in the study’sPlayer dimension. The following levels are included:

Player→ Team→ Sport

On closer inspection of the injury source data, it becomes clear that several levels can be identified. Thehierarchy identified in this dimension falls under the classification of an unbalanced hierarchy and is

24

CHAPTER 4. DATA WAREHOUSE MODELLING

discussed in greater detail in the conceptual design phase. Starting with the coarsest level, an injury entrycan be classified as either being an injury or a non-injury. An injury can then be further classified as anillness or a physical injury, and a physical injury is categorised as being either an overuse or an acute injury.This information thus identifies a hierarchy consisting of the following levels:

Injury→ Sub-type→ Type→ Category

Finally, no hierarchies are identified in the Training Log dimension.

Document requirements specificationAs with the analysis-driven approach, the requirements specification of the source-driven approach endswith the documentation of information gathered during the previous two steps. Table 4.3 provides asummary of the information discussed above. One fact containing six measures is identified and isassociated with four dimensions through one-to-many relationships. Hierarchies are presented for thedimensions, Date-Time, Injury, and Player, whereas no hierarchies are identified for the Training Logdimension.

Facts Measures Dimension Cardinality Hierrchies and levels

Training data Total DistanceAccelaration LoadPlayer LoadV4 DistanceV5 DistanceHSR Distance

Date-Time 1:n

TimeSecond→Minute→ Hour→Session→ Date→Week→Month→ Season

Injury 1:nClassificationInjury→ Sub-type→ Type→Category

Player 1:nMembershipPlayer→ Club→ Sport

Training Log 1:n

Table 4.3: Summary of the multidimensional elements using a source-driven approach

4.2 Conceptual Design

The purpose of the conceptual design schema is to present data requirements in a clear and concise mannerthat can be understood by users. By avoiding implementation details, the conceptual schema facilitatescommunication between users and designers, as well as the maintenance and evolution of a database. Dueto the lack of a universally adopted conceptual model for multidimensional data, data warehouse designfrequently omits this phase and usually carries out design directly at the logical level [46]. As mentionedabove, the conceptual modelling phase is incorporated into the design process of this project, making useof the MultiDim model, which is able to represent all elements in the data warehouse. A graphical notationis provided in the appendices.

25

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.3: Steps in the conceptual modelling phase using an analysis/source-driven approach

Following an analysis/source-driven approach, it can be seen from Figure 4.3 that the conceptual modellingphase begins with the creation of two conceptual schemas, one from the analysis-driven approach, and theother from the source driven approach. These schemas are then matched before the final schema is defined,and mappings between the source data and the data warehouse are specified. Before addressing the stepsjust mentioned, it is necessary to look at two of the conceptual design issues encountered in this thesis.The first of these is classified as a many-to-many dimension and arises during the source-driven approach.The second is known as an unbalanced hierarchy and is encountered during both the source-driven andanalysis-driven approaches.

Many-to-Many Dimensions and Nonstrict HierarchiesOne challenge which arises from the project’s data set is that of modelling the relationship between aplayer’s workload data and incidents of injury and illness. The issue encountered here is known as a many-to-many dimension and results from the fact that a player is able to be both injured and ill at the same time.This concept is illustrated in Figure 4.4, and may lead to a problem known as double-counting. Doublecounting occurs when a roll-up operation reaches a many-to-many relationship and results in a singlemeasure being included in multiple aggregation operations, thus breaking the condition of disjointness andresulting in a hierarchy not being able to ensure summarisability. Double counting may arise as a resultof a schema not meeting the requirements of the first multidimensional normal form (1MNF). This rulerequires that each measure be uniquely identified by its set of associated leaf levels, and serves as a basisfor correct schema design.

Figure 4.4: Illustration of the many-to-many relationship between a fact and a dimension

26

CHAPTER 4. DATA WAREHOUSE MODELLING

As an example of the issue described above, Table 4.4 is included to highlight the problem of doublecounting. The data represented in the table may result from selecting the distance covered during sessionsin which a player was registered as either injured or ill. Here it can be seen that player 1 covered a totaldistance of 5000m meters during a training session T1. As the player is registered as both injured and ill atthe time of training, two tuples are generated for a single session. If the data set were to be aggregatedover the time dimension, a total distance of 5000 + 5000 = 10 000 meters would be generated instead of anexpected value of 5000m.

Time Player Distance Injury/Illness

T1 1 5000 Injury 1

T1 1 5000 Illness 2

Table 4.4: An example of double-counting in a many-to-many dimension.

One method for resolving this issue is to decompose the Injury/Illness dimension into two separatedimensions, one for injury and another for illness. This concept is illustrated in Figure 4.5, where it canbe seen that training data is associated with both an Injury dimension and an Illness dimension througha one-to-many relationship. More specifically, this relationship stipulates that a training tuple is relatedto precisely one injury and precisely one illness. In contrast, an injury or an illness may relate to one ormore training tuples. This model thus satisfies the case in which a player may be both injured and ill atthe same time, while also meeting the requirements of 1MNF. As the schema specifies that each trainingentry relates to precisely one injury and one illness, a non-injury and non-illness tuple is required to satisfythis condition for training sessions in which a player is not ill or injured. Similarly, a non-training tuple isrequired for every incident of injury or illness where a player was absent from training.

Figure 4.5: A decomposition of the injury/illness dimension

Unbalanced HierarchiesAnother issue that is encountered during the modelling of the Injury and Illness dimensions is that ofunbalanced hierarchies and arises when at least one level in a dimension’s hierarchy is not mandatory.This results in parent members within the hierarchy not having any child members, as is seen for thehierarchy of the Injury dimension in Figure 4.6. As described above, the Injury dimension is required toinclude non-injuries in order to satisfy the one-to-many modelling constraint between itself and trainingentries. This results in a non-injury member with no child members, and a hierarchy which does not satisfysummarisability conditions. Using the injury hierarchy in Figure 4.6 as an example, it becomes clear thatwhen all measures in the fact table are associated with the "Type" level, aggregation of these measures intothe "Category" level is only possible for the injury member and not for the non-injury member. A similar

27

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.6: Unbalanced hierarchy

problem is encountered for the Illness dimension, and its solution is discussed in the section on logicaldesign.

Having presented some of the initial challenges associated with the modelling of the conceptual schema, abrief account of each of the steps involved in generating the final conceptual schema is given.

4.2.1 The Initial Analysis-Driven Conceptual Schema

An initial conceptual schema is generated from the analysis-driven requirements specified in the require-ments phase and is presented in Figure 4.7. Given that the analysis for this project is centred aroundtraining/competition data, a fact table called "Training data" is created and contains six measures, namely,total distance, player load, acceleration load, V4 distance, V5 distance, and HSR distance. All of these canbe directly extracted from the GPS source data and are aggregated using the SUM operation. Each entry inthe fact table is associated with four dimensions through a one-to-many relationship, meaning that eachfact entry is associated with only one entry in each of the four dimensions. In contrast, a given entry in anyof the dimensions may be associated with multiple entries in the fact table. For example, each workloaddata entry in the fact table is only associated with one player, whereas a single player is expected to havemultiple lines of recorded workload data.The Date-Time dimension contains five levels of aggregation. There are a few points that are worth notingwith regards to this dimension. Firstly, the levels, Second, Minute, and Date are represented in a singlelevel despite being identified as separate levels during the requirements specification phase. Secondly, theDay level may include one or more sessions as a result of multiple sessions being able to take place on thesame day. Thirdly, the levels Week and Month are modelled as a parallel hierarchy for the reason that aweek may extend over two months, and thus cannot be aggregated directly into the Month level. Finally, aSeason is considered the highest level of aggregation for the Time hierarchy and is equivalent to a calendaryear running from 1 January to the 31 December.The Date-Time dimension is of particular interest in this study as data is extracted at one-second intervalsand provides a user with the opportunity to conduct analysis at a relatively high degree of precision.

28

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.7: The initial conceptual model using an analysis-driven approach

Including both dates and times in a data warehouse schema provides users with a vast range of granularitiesat which data can be extracted. It also provides designers with several choices regarding how date andtime are to be modelled. These options are discussed in further detail in the logical design phase. Themost straightforward alternative is taken at the conceptual level as a means of facilitating communicationbetween users and designers.Injury and Illness dimensions are represented as separated dimensions, each including two levels ofaggregation, namely, Category and Type. Despite no hierarchy being identified during the requirementsspecification phase, the conceptual model includes a hierarchy of two levels in the Illness dimension.The Category level is similar for both injury and illness and is introduced to satisfy the one-to-manyrelationship between the Illness and Injury dimensions and the fact table. Thus, each entry in the fact tableis classified as either a non-injury/illness or as an injury/illness. The following relationship is expressedusing the exclusive relationship symbol seen in the figure. In the case of the Injury dimension, the Type

29

CHAPTER 4. DATA WAREHOUSE MODELLING

level is used to distinguish between overuse and acute injuries. The Type level in the Illness dimension isincluded for the purpose of potential future classification of illnesses.The final dimension included in the schema is that of the Player dimension, which includes biometric datasuch as the age weight and height of a player. No hierarchy is included for this dimension.

4.2.2 The Initial Source-Driven Conceptual Schema

The initial conceptual schema generated from the projects source data is presented in Figure 4.8. It is verysimilar to the schema generated from user requirements and therefore only a brief discussion of the theirdifferences is presented here.

Figure 4.8: The initial conceptual model using an source-driven approach

30

CHAPTER 4. DATA WAREHOUSE MODELLING

A Training Log dimension is included in the source-driven schema and contains features from the logcompleted by the players after each session. It is related to the fact table through a one-to-many relationship,meaning that each log entry may be associated with one or more fact entries. In contrast, each trainingentry is only related to one log entry. This dimension does not include a hierarchy.The Player dimension introduces a hierarchy that has two levels of aggregation, namely Club and Sport.Furthermore, the player table includes player name features, whereas no biometric features for a player areavailable here.

4.2.3 Conceptual Schema Matching

The process of matching the two initial conceptual schemas presented above is relatively straight forwarddue to their similarity. The schemas’ respective fact tables, as well as their Date-Time, Injury, and Illnessdimensions, are identical. For this reason, they can be added to the final conceptual schema without anychanges being made. The Training Log dimension from the source-driven schema provides additionalinformation that may provide useful for several reasons. The most important being that it includes thereported perceived exertion (RPE) of a player for each session. This dimension can also be included in thefinal conceptual schema without needing to make any changes. Finally, the Player dimensions from eachof the two initial schemas have several differences that must be resolved. Player biometrics representedin the analysis-driven schema is not available from the source systems provided, and are therefore notincluded in the source-driven schema. Biometric data can, however, be obtained from external sources,and for this reason, are included in the Player dimension of the final conceptual schema. The hierarchypresented in the source-driven schema is not useful for this thesis. However, it may provide useful forfuture research if the project should include data from additional clubs and different sports. As it is of littlecost to include the two additional levels in the given hierarchy, they too are included in the final schema.

4.2.4 The Final Conceptual Schema and Mappings

A final conceptual schema is generated from the analysis-driven and source-driven schemas and is presen-ted in Figure 4.9. It consists of a single fact table that is comprised of six measures and is associated withfive dimensions, namely, Date-Time, Injury, Illness, Player, and Training Log.The Injury, Illness, and Date-Time dimensions are identical to those modelled in the analysis-driven andsource-driven schemas and are discussed in detail above. The most important points from this discussionwere that the date and time of a training entry are modelled in a single dimension to facilitate communic-ation between technical and non-technical-users. Furthermore, the Injury and Illness dimensions bothcontained an unbalanced hierarchy resulting in the existence of exclusive relationships between levels.The dates and times of each training entry can be extracted from GPS source data. It can be seen fromTable 3.1 that GPS files include a timestamp for each entry, while the date for each training can be extractedfrom the file header. Similarly, injury and illness data can be extracted from the injury/illness file producedby the AthleteMonitoring application. Injury and illness data needs to be separated during the extractiontransformation and load (ETL) process and loaded into their respective dimensions.The Training Log dimension from the source-driven schema is included in the final schema. Despite thisdata not being directly related to the current analysis needs of the project, it is included for the reason thatit may prove relevant for future work. Log data can be extracted from the training log file generated by theAthleteMonitoring application. This file contains data for all log entries submitted during the study.

31

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.9: The final conceptual model generated from the analysis-driven and the source-driven models

Finally, the Player dimension is created as a combination of elements from the source-driven and analysis-driven schemas and includes a hierarchy with two levels of aggregation. The first of these is the Club level,which is limited to just one club-id feature that serves as a unique identifier for a given football club. Thedecision to include this dimension is motivated by the organisation’s desire to expand the current study andinclude training data from other football clubs. Similarly, the Sport level is included in the hierarchy andcontains a single attribute, namely sport-id. The player’s name is excluded from the Player dimension forprivacy reasons, and a unique player-id is adopted as a means of differentiating between players. Playerbiometrics, age, weight, and height have been included as they are necessary for achieving one of theproject goals defined in the requirements specification. The biometric data required is not available amongthe data sources and needs to be obtained from a source within the organisation.

32

CHAPTER 4. DATA WAREHOUSE MODELLING

4.3 Logical Design

The logical modelling process is illustrated in Figure 4.10 and is made up of two steps. The first of theseinvolves the generation of a logical schema from the previously defined conceptual schema. The secondinvolves the specification of an ETL process, which takes into consideration mappings and transformationsdefined during the conceptual design phase.

Figure 4.10: Steps in the logical modelling process

There are several approaches to implementing the multidimensional model. These include relational OLAP(ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP). This thesis adopts a ROLAPapproach, meaning that data is stored in relational databases and can be queried using SQL. A strongmotivating factor for this approach is the relative ease of implementation, as well as it being the preferredapproach for large amounts of data.There are several ways in which data warehouses can be relationally implemented. A brief discussion ofthe implementations relevant for this project is given before the logical schema is presented.The first possible implementation is known as a star-schema and consists of a central fact table as wellas a single dimension table for each dimension. Dimension tables are typically not normalised and maycontain a lot of redundant data, which is consequently one of the drawbacks of this implementation. Thestar-schema, however, offers two advantages that are of particular interest for this project. These are its easeof understanding and its query performance. The relatively simple design of a star-schema facilitates easiernavigation and fewer join paths, resulting in less complex queries and improved performance. The secondimplementation of relevance for this study is known as a snowflake schema and avoids data redundancy bynormalising dimension tables. Snowflake schemas have the advantage of easier data maintenance and re-quire less storage. However, as a result of the increased number of join paths, performance is affected, andqueries can be complex. The final implementation of consideration is that of the starflake schema, which isa combination of the star and snowflake schema, where some dimensions are normalised, and others are not.

4.3.1 The Logical Schema

The relational implementation used in this thesis is that of a star schema consisting of one fact table andsix dimensions and is presented in Figure 4.11. In many cases, a logical schema can be generated using aset of general mapping rules which directly transform a conceptual schema into a logical one. As theserules do not capture the specific semantics of some of the hierarchies identified in the conceptual designphase, the transformation process is carried out manually and discussed below.The Date-Time dimension presents an interesting challenge due to the fine granularity of the data beingmodelled. This project aims to support precise calculations down to the finest granularity, while at the same

33

CHAPTER 4. DATA WAREHOUSE MODELLING

time supporting a classic Date dimension. As a classic Date dimension may include attributes that cannotbe generated using SQL, there arise several alternatives with regard to the modelling of the Date-Timedimension. One possibility is to include all levels of granularity in a single Date-Time dimension andprovide access to all of the date-time attributes through one join path. The problem which arises with thisapproach is that the Date-Time dimension quickly becomes large and requires a lot of storage. Assumingthat a football team partakes in 200 days of training/competition a year and that the Date-Time dimensionexcludes days of non-training, this would result in the dimension containing 17280000 entries for eachyear of training. This number could be reduced by excluding all date-time entries which are not associatedwith an entry in the fact table. By doing this and assuming that each training session was to last one hour,the Date-Time dimension can be drastically reduced to containing approximately 720000 entries per teamper year of training. A second modelling approach is to split the Date-Time dimension into separate Dateand Time dimensions, and in so doing, further, reduce the number of entries to 365 for the Date dimensionand 86400 for the Time dimension (assuming a 24-hour day is modelled). This approach, however, leadsto complex queries when calculating time-spans between dates. A third approach involves the introductionof a date-time stamp attribute in the fact table as an addition to the implementation of separate Date andTime dimensions [27]. This approach has the advantage of reducing the number of date and time entries,as well as avoiding the issue of complex queries when performing precise calculations at fine granularities.It can been seen in Figure 4.11, that this thesis adopts the approach of separating the Date-Time dimensioninto two dimensions (Date and Session), and including a date-time stamp in the fact table. The Sessiondimension has a session-id as its primary key and includes the attributes, session number for currentthe season, and session type. Furthermore, both Date and Session dimensions are denormalised. Allattributes are included in a single table to improve performance and to simplify querying data at differentgranularities.The issue of unbalanced hierarchies in the Injury and Illness dimensions is presented in the conceptualdesign phase. As these hierarchies do not satisfy summarisability conditions, a strategy must be implemen-ted to correct this issue. This thesis adopts the use of placeholders to transform unbalanced hierarchiesinto balanced ones. It can be seen in Figure 4.12 that a placeholder is introduced at the type level fornon-injuries in the Injury dimension. One possible disadvantage of this solution is that the introduction ofmeaningless values requires additional storage space.

This is, however of little concern in this thesis as all non-injury and non-illness measures in the fact tablereference a single non-injury and a single non-illness tuple respectively. The additional space required forthe introduction of a "non-injury" and a "non-illness" attributes is thus equivalent to the space requiredfor storing an extra 21 characters, which translates to an additional 23 bytes when using PostgreSQL’stext datatype. The hierarchies in both the Injury and Illness dimensions are denormalised, resulting ina single table for the Injury dimension and a single table for the Illness dimension. The motivation forthis decision is to increase performance. By including all attributes in a single table, the number of joinoperations required to access the leaf level in the Injury or Illness dimension is reduced by two. Thisserves to increase performance as well as simplify queries. Furthermore, due to the relative simplicity ofthe medical records, little additional information is available for the levels Category and Type. For thisreason, the data can be represented in a single table without much duplication occurring.The Player dimension is also denormalised, and all attributes are included in a single table. As the Teamand Sport levels are incorporated into this thesis for possible future work, only a single unique identifieris used for them. The fact that the data for this thesis is gathered from just 38 football players meansthat these levels can be represented in a single table without requiring much extra storage capacity. The

34

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.11: A logical data warehouse schema for football data

35

CHAPTER 4. DATA WAREHOUSE MODELLING

removal of two join paths has the apparent benefit of improved performance as well as the advantage ofsimpler querying.The Training Log dimension is included in the logical schema for the reason that it may provide usefulto the future work of the research project. All attributes from the conceptual schema are included andrepresented in a single table with a unique log-id serving as the primary key for the table. Finally, thefact table contains all measures presented in the conceptual schema as well as a foreign key for each ofthe dimension tables, a date-time stamp, and a unique id. The inclusion of the date-time stamp serves tosimplify time-span calculations. The unique id is used as the primary key for the table.

Figure 4.12: The use of placeholders to transform unbalanced hierarchies into balanced hierarchies

4.3.2 Definition of the ETL Process

The second step of the logical design phase is to describe the transformations required before source datacan be loaded into the data warehouse. This process builds upon the mappings identified in the conceptualdesign phase.Starting with the Date dimension, a script is used to generate the majority of attributes found in thistable. The exception to this rule is the "matchday indicator" attribute for which dates are gathered fromclub personnel. As there are less than 365 days of recorded sessions per year, one consideration whendeveloping the Date dimension is whether to include an entire calendar year. For the reason that it doesnot cost much in terms of storage, the Date dimension includes an entire calendar year for each year oftraining/competition. Furthermore, this approach simplifies matters if additional football teams are includedand eliminates the process of identifying non-training days for multiple teams. Another consideration isthat of the table’s primary key for which several possibilities exist. Two common techniques involve theuse of a meaningless surrogate key or a readable number such as 08031983 to express the date 8 March1983. The latter option is chosen for this thesis.The Session dimension is generated from an ordered text file containing the dates and times of all sessions.The tables primary key is a number corresponding to the sessions position in the text file. The value ofthe primary key for the first session is thus one and is incremented by a value of one for every additionalsession added. As a result of the text file being ordered, the primary key of an entry in the Session tablecorresponds to the number of that particular session relative to the first session entry ever loaded. The

36

CHAPTER 4. DATA WAREHOUSE MODELLING

"Session Number Season" attribute is calculated during loading with the use of a counter which resetsevery time the year is incremented. Lastly, the data needed to load the sessions "Type" attribute is gatheredfrom club personnel.Both the Injury and Illness dimensions are loaded from a CSV file generated by the AthleteMonitoringapplication. Their loading process requires that entries for injuries and illnesses be separated and loadedinto their respective tables. Additionally, text attributes must be converted to the corresponding data typeof each relational attribute. No additional data is required for these dimensions, but a non-injury and anon-illness entry must be generated for the respective tables.Data required for the Player dimension is collected from several sources. Firstly, the player-id attributeis extracted from the AthleteMonitoring application, which randomly assigns player-ids on creation ofan athlete profile. As several players are not registered on this system, additional player-ids must becreated for these players. Similarly, ids for both the team-id and sport-id attributes must be created for thisdimension. The remaining attributes age, height and weight are gathered from team personnel.Data for the Training Log dimension is loaded from a text file generated by the AthleteMonitoringapplication. Text attributes need to be converted to the corresponding data type of each attribute in theTraining Log table. A non-training log entry must also be generated for measures in the fact table whichdo not have an associated log entry.Finally, measures in the fact table are loaded from the GPS generated CSV files, and each attribute mustbe converted to the data type of the corresponding measure. Additionally, a date-time stamp, and a foreignkey for each of the dimensions, needs to be added. The foreign key for the Date dimension can be extractedfrom the header of each of the GPS files. This data needs to be parsed and transformed to the smart keyformat described above. The generated smart key may then be used for each entry loaded from the givenGPS file. Similarly, the fact table’s date-time stamp may be created from a combination of the date in theGPS file’s header, and the timestamp attribute which is present in each line of a file. This is achieved byparsing both attributes and transforming them into a single date-time data type. The foreign key, player-id,requires a lookup table. The first and last name attributes in the GPS file may then be used to access theplayer-id for each name in a file. The fact that some players are registered with more than one profileneeds to be taken into consideration when creating the lookup table. In other words, some ids need to bemapped to by multiple names.In order to guarantee that an entry in the fact table correctly links to entries in the Injury and Illness tables,a temporary table is created. It stores entries for both injuries and illnesses as well as their respectiveprimary keys. If the date of a workload entry for a given player matches the date of an injury or illnessentry of the same player, the id is extracted from the temporary table and used as a foreign key in the facttable. Numerous cases of injury and illness result in a players absence from training/competition, and forthis reason, there are no training entries which link to these particular cases. To ensure that all entries ofinjury and illness are referenced by an entry in the fact table, non-training facts are created for each entryof injury and illness which has no corresponding training data.A similar process is required for the Training Log dimension, and the loading of the foreign key for thisdimension also makes use of a temporary table for accessing the id of a given training entry. Non-trainingentries are also created in the fact table to ensure linkage to all training log entries. Finally, the foreign keyfor the Session table is created in the same way as its primary key and is discussed in detail above.

37

CHAPTER 4. DATA WAREHOUSE MODELLING

4.4 Physical Design

The physical design of a data warehouse aims to enhance query performance. Three techniques arecommonly used to maximise performance, namely the use of materialised views, indexing and partitioning.As an in-depth look at these techniques falls outside of the scope of this thesis, only a brief discussion ofmaterialised views and indexing is presented. Partitioning or fragmentation is particularly effective forvery large databases and divides the contents of a relation into several files. Given the size of the currentdata set, partitioning is considered irrelevant for this thesis and is reserved for future work.

4.4.1 Materialised Views

In ROLAP, a materialised view is a view in which results are persisted in table-like form, and can bequeried like a regular relational table. They serve to enhance query performance by pre-calculatingcostly operations such as aggregations and relational joins. As the enhanced performance is achieved atthe expense of additional storage costs, views are typically created for queries which are re-used. Thishighlights one of the drawbacks of materialised views which require that designers can anticipate frequentlyused queries.One such query is identified in this thesis and involves aggregating workload data at the session level andjoining the results with information from the Date and Injury dimensions. The SQL query for creating thematerialised view is presented in Figure 4.13. Here it is seen that a view is created with all workload dataaggregated at the session-level. In addition to workload data, player and session-ids are also extracted fromthe fact table. Workload data is joined with the Date dimension to extract the date of each session, andwith the Injury dimension to extract injury data associated with each session. Furthermore, the duration ofeach session and the current session-count for each player is calculated.Without the creation of a materialised view, the planning and execution times for generating the summarytable are 12.5 and 30519.9 milliseconds respectively. Extracting the same information from a materialisedview, however, yields planning and execution times of 0.3 and 0.8 milliseconds. The use of a materialisedview thus results in an execution time that is 38148 times faster than querying the data warehouse directly.As the information extracted from the query presented in Figure 4.13 contains all the necessary informationrequired for the data mining presented in the next chapter of this thesis, a materialised view is consideredan optimal choice for this thesis.

38

CHAPTER 4. DATA WAREHOUSE MODELLING

CREATE MATERIALIZED VIEW summary_table AS

SELECT

d.training_date, t.player_id, t.session_id, i.injury_id,

i.type, i.injury_date, i.recovery_date, i.days_missed,

SUM(t.total_distance) as total_distance,

SUM(t.player_load) as player_load,

SUM(t.acceleration_load) as acceleration_load,

SUM(t.v4_distance) as v4_distance,

SUM(t.v5_distance) as v5_distance,

SUM(t.hsr_distance) as hsr_distance,

AGE(MAX(t.dt), MIN(t.dt)) as duration,

COUNT(t.player_id) OVER (PARTITION BY t.player_id) as sessions_played

FROM training_data t

JOIN date_dimension d ON t.datekey = d.datekey

JOIN injury i ON t.injury_id = i.injury_id

GROUP BY

d.training_date, t.player_id, t.session_id,

i.injury_id, i.type, i.injury_date, i.recovery_date

Figure 4.13: SQL code for generating a materialised view

4.4.2 Indexing

A key difference between OLTP and OLAP systems has to do with how they are designed to handle queries.OLTP systems typically experience frequent transactions which only access a small number of tuples, andoften use indexing techniques such as B-Trees which are well suited these types of transactions. TypicalOLAP transactions are, however, more complex in nature and often access a very large number of tuples.To support these types of transactions, alternative indexing techniques such as join indexes and bitmapindexes are commonly used.As the name suggests, join indexes materialise a relational join between two tables by precomputingjoin pairs. These typically involve foreign key relations between the fact table and the dimension tablesand are naturally suited to star schema designs. As PostgreSQL does not offer support for join indexes,B-tree indexes are used instead. B-tree indexes are created for all primary keys. Additionally, indexes areconsidered for foreign keys with the intent of further improving join performance. However, the creationof indexes for foreign keys in the fact table does not have any significant improvement upon performance,and for this reason, are not implemented in this thesis.Indexing columns in the fact table which are likely to be used in combination with selection operatorsdo however have a significant impact upon performance. Two columns, namely session-id and player-id,are identified as prime candidates for such operations. These columns are expected to be used in querieswhich aim to filter information for specific players or specific training sessions. An example of a typicalquery involving summary data for a particular session is provided in Figure 4.14. The query representedin the figure is used to select the total distance covered by players during a specified session. Similar

39

CHAPTER 4. DATA WAREHOUSE MODELLING

queries are often used to check whether a player’s recorded distance is similar to the rest of the teams fora particular session. Such queries are particularly helpful in identifying GPS recording errors which arerelatively prevalent in this thesis. The creation of a B-tree index on session-id in the fact table results inan execution time of the query seen in Figure 4.14 dropping from 2871 to 28 milliseconds, equating to aten-fold improvement.A similar query may be used to select workload data for a specific player and aids users in gaining anunderstanding of workload variation throughout the season. As the identified columns are expected to beused frequently in queries involving the general understanding of workload data, indexes are created forboth player-id and session-id in the fact table.

SELECT

player_id, total_distance

FROM training_data

WHERE session_id = 25

GROUP BY

player_id, session_id;

Figure 4.14: A query for selecting data from a specific session

As bitmap indexes are particularly useful for improving query performance for columns with a low numberof distinct values, they may provide useful for the injury class, which only records two categories andtwo injury types. Although PostgreSQL does not provide support for bitmap indexes, it does offer analternative indexing technique known as BRIN indexes. This thesis, however, does not to include theindexing technique mentioned above for several reasons. Firstly, the injury category may be selected usingthe injury-id column instead of the category column. This is possible because cases of injury are createdwith an id-value greater than zero, whereas cases of non-injury are created with an id-value of zero. Asthere already exists a B-tree index on the injury-id column, no further index is considered necessary forthe category column. An index is not created for the injury-type column as no queries are identified whichdirectly make use of this column. Lastly, due to the low number of injury tuples, additional indexes maynot provide significant increases in performance.Finally, physical modelling is well studied within the field of data warehousing, and there are manytechniques which may further improve the performance of the data warehouse developed in this thesis.Due to the relatively small size of the data set, as well as the limited scope of this thesis, measures beyondthose which are discussed above are considered relevant for future work.

40

CHAPTER 4. DATA WAREHOUSE MODELLING

4.5 The ETL Process

Extraction, transformation and load (ETL), are processes which take place in the back-end tier of a datawarehouse and are responsible for extracting data from an organisation’s sources, transforming them, andthen loading them into a data warehouse. In a high-level description of the ETL process, three distinctphases are identified. The first of these involves the extraction of data from source data stores, whichare often heterogeneous, and may include operational databases, files, web pages, documents and evenstream data. In the second phase, data is transformed from the format of the sources to the format specifiedby the data warehouse schema. This usually requires the propagation of data to a special-purpose areaof the warehouse called the Data Staging Area (DSA). The phase includes tasks such as, data cleaningwhich removes errors and inconsistencies in the data, integration which reconciles data from different datasources, and aggregation which summarises data to the level of detail of the data warehouse. The finalphase involves the loading of transformed data in a data warehouse. The ETL process typically refreshesthe data warehouse periodically, a process which involves propagating updates from the source data tothe data warehouse. The frequency at which a data warehouse is refreshed is specified by organisationalpolicies [47; 46].The ETL process for this thesis is presented at the conceptual level using the Business Project ManagementNotation (BPMN) as is proposed by A. Vaisman and E. Zimanyi [46]. As the notation does not specify theimplementation specifications of ETL processes, it allows users to focus on the characteristics of theseprocesses without having to understand technical details.

4.5.1 Overview of the ETL Process

The source data for this thesis comes in the form of CSV files, and a detailed description of their content isprovided in the section on source data. Furthermore, additional data sources are required for loading thePlayer dimension, and relational tables must be created to ensure correct linking between entries in thefact table and entries in the dimension tables. The sections which follow provide a step by step account ofeach of the processes involved, as well a description of the data required to ensure the correct loading ofthe data warehouse implemented in this thesis.Figure 4.15 presents an overview of the entire ETL process for this thesis. Here is can be seen thatdirectly after the process begins, a parallel gateway giving rise to two branches is entered. The gatewaysymbolises that the order in which the processes that follow are carried out is of no importance. However,all processes between the parallel gateway and the merging gateway must be completed before proceedingfrom the merging gateway. The left branch leads to the process responsible for loading the Date dimension.The right branch enters another parallel gateway which in turn leads to three processes responsible fordata cleaning. Upon completion of all cleaning tasks, four dimensional loading processes are carriedout. Finally, after all of the dimensions have been successfully loaded, the GPS training data and specialmembers are loaded. The plus symbols in the figure illustrate that each of the processes involve severalsub-processes. These are explained in further detail in the sections which follow. Once all processes havebeen completed, the end symbol is reached, symbolising the termination of the ETL process.

41

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.15: Overview of the ETL process

4.5.2 Cleaning of GPS Data

The cleaning process begins with all files from the three data sources, and through a series of tasks,detects and corrects both corrupt and inaccurate records. Cleaning is performed using a combination ofinteractivity, scripting and batch processing techniques which together ensure that the generated outputfiles, are correctly formatted for the loading processes which follow. In this and the following two sections,the cleaning tasks presented in Figure 4.15 are described. Together they address all of the issues presented

42

CHAPTER 4. DATA WAREHOUSE MODELLING

in the section on source data.The first of these processes involves the cleaning of GPS files and is illustrated in Figure 4.16. Using acombination of Java programs and interactivity, the source files are used to generate two sets of outputfiles. In the one set, each file contains the training data recorded for a single player during a single session,whereas the other set contains a file with the times and dates of each team session. For the purpose ofclarification, a team session includes the time interval during which at least one player was recording,and for this reason, begins when the first player starts recording and ends when the last player stopsrecording. Another point of importance is that a single source file contains data for all the players thatwere present during a session, whereas the output files are split into separate files which only contain datafor a single player. The motivation for this is discussed in further detail below. Another potential outputfor the cleaning process occurs when a source file does not meet specified requirements, resulting in thefile’s removal and the request of a new file with the correct specifications.It can be seen in Figure 4.16 that the cleaning of GPS files begins with an iterative sub-process which isresponsible for performing parameter and interval checks on each of the input files. Interval checkingensures that a file’s data meets the granularity requirements of one-second intervals. The task of parameterchecking ensures that the specified data parameters are present as well as that their ordering is consistent.The termination condition is symbolised by the orange circle in the figure and is met once all input fileshave been checked. The completion of the initial checks may result in one of three possible outcomes, asituation which is illustrated with the use of an exclusive gate illustrated in red. The first of these is that afile does not meet the specified requirements (missing parameters or incorrect granularity), in which case anew file with the correct specifications is requested. A second possibility is that the ordering of parametersis inconsistent. In such cases, a re-ordering task is carried out in which a new file with the correctly orderedparameters is produced. The final possibility involves files which meet all requirements. These are writtento the same location as the re-ordered files and used as input files for the next task in the cleaning process.As mentioned in Section 3.3, multiple cases of duplicate data exist as a result of a possible software issueencountered when generating CSV files. For this reason, the next task in the cleaning process is responsiblefor detecting duplicate entries. The iterative task compares all sessions which take place on the same dayand checks them for overlaps. It can be seen in Figure 4.16 that the detection of duplicates requires userinteraction before file discrepancies can be resolved. Duplicate files require careful inspection beforetaking action to ensure that all data is is preserved, while at the same time avoiding the situation of databeing duplicated. Two examples of duplication are encountered in this project. The first occurs when bothduplicates contain the same data. The second occurs when one file contains more data than the other. Inboth cases, one of the two files is discarded.Files without any duplication issues proceed to the next step in the cleaning process, which involves asub-process responsible for the removal of "non-training data". The term non-training data in the contextof this project refers to lines of data before or after a session, in which all training parameters have thevalue of zero and are of no interest with regards to analysis. It is important to clarify that these lines are ofno interest only if occurring before a session has begun or after it has ended. Lines of data with zero-valuesfound within the bounds of a session are however very relevant and are not considered as "non-training"data. As a result of players forgetting to turn off GPS tracking devices, thousands of lines of irrelevantdata exist within the source files. To reduce storage demands as well as remove data that has no relevanceto the research of the project, lines of non-training data are removed in a process consisting of two tasks.The first of these splits a file into separate files for each player. The resulting files thus contain data for asingle player from a single session. In addition to splitting files, file-headers are also reduced. The split

43

CHAPTER 4. DATA WAREHOUSE MODELLING

files are then trimmed, a task which removes all non-training entries before a sessions commencement andall non-training entries after its termination. The start and end of training are defined as the first and lastlines in a file which contain non-zero values.Two types of files are generated from these tasks. The first of these is a file which does not contain anytraining data and is the result of the Catapult software generating data for players that did not participateduring a session. As can be seen from the figure, these files are discarded. The other type of file generatedcontains one or more lines of data for a single player recorded during a single session. The first line ofdata in these files represents a players’ first line of recorded training activity, which may occur after theregistered start time of a team session. This is a result of the fine granularity of the data which identifiesthe exact second a training activity began for a given individual. It is this second file type which is used forloading of the data warehouse and represents the first file output of the cleaning process.The second output for the cleaning process is a file containing the dates and times of all team sessions andis generated from the first set of output files. The task involves comparing individual sessions for a givendate and determining whether they can be grouped as a single team session, or whether they belong toseparate team sessions taking place on the same day. Individual sessions which are considered a memberof the same team session are compared so that the start and end times for the team session may be adjustedto incorporate all individual sessions. In doing so, the start time for the team session is represented by theearliest individual start time in the group, and the end time for the team session is represented by the latestindividual end time in the group. The output file for the above task thus contains a list of dates with thestart and end times of each team session and may contain multiple entries with the same date. In the caseof multiple entries with the same date, each entry represents a non-overlapping team session taking placeon the given date.The cleaning of GPS data thus resolves three of the five training data issues identified in Section 3.3. Theissues addressed here are, the duplication of data, missing and wrongly ordered parameters, and entries ofnon-training data. The remaining two issues, namely, players registered with multiple profiles, and playerswithout training data, are discussed below.

4.5.3 Cleaning of Injury and Illness Data

Two issues are identified in the injury and illness data, namely incorrect dates of injury and recovery, and amissing injury type. The first of these involves recovery dates preceding the dates of injury for which threecases are found in the source file. These issues are attributed to human errors and are resolved by switchingthe dates. The second of the identified issues are resolved by consulting the teams medical personnel andmanually assigning a type to the injury entry.

44

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.16: Tasks involved in the cleaning of GPS training data

45

CHAPTER 4. DATA WAREHOUSE MODELLING

4.5.4 Cleaning of Log Data

The three issues identified among the log data, are resolved using scripting and interactivity techniques.The first of these involves data entries which are spread over multiple lines and is caused by newlinecharacters added by players or team personnel when writing comments to data entries. To facilitatethe loading process, entries are represented on a single line, and for this reason, newline characters areremoved. The second issue encountered in the log data is that of players being registered with multipleprofiles. The above issue is addressed by creating a list of all registered profiles. This is used by teampersonnel to identify all profiles which are related to the same player. A single profile is then decidedupon, and its name and id are used to replace the names and ids of all other profiles registered for the sameplayer. The final issue involving the log data is that which concerns players who have log entries but notraining data. This problem is resolved by creating a single non-training tuple in the fact table and is dealtwith in the section on special members.

4.5.5 Load Date Dimension

Loading of the Date dimension is presented in Figure 4.17, and includes three tasks. The first of theseinvolves the creation of a match-day lookup. This is achieved using a Python script, which parses a textfile containing the dates of all matches, and creates a list of competition dates. The remaining two tasks aregrouped into a sub-process which repeats until a termination condition is reached. The first of these twotasks also makes use of a Python script, which parses a text file containing the years for which date-entriesare to be created. For each day of a given year, a date tuple is created and then inserted into the Date table.The value of the "Matchday Indicator" attribute in each date-tuple is set using the list of competition daysgenerated in the first task. This process continues until the entire text file has been parsed, and a tuple forevery date has been inserted into the database.

Figure 4.17: Loading of the Date dimension

46

CHAPTER 4. DATA WAREHOUSE MODELLING

4.5.6 Load Session Dimension

The Session dimension is loaded from a file produced during the cleaning of GPS data. This file includesthe dates and times of each session. Before session data is loaded, a non-session tuple is first generatedand inserted into the Session table as a single entry which can be used for fact entries that do not referencea specific training session. An example of this is when a player is absent from training as a result of eitheran injury or an illness. In this case, a fact entry is required to bind an injury/illness tuple to a player-idin the Player dimension as well a date entry in the Date dimension. Without such a fact entry, importantdetails would be unavailable to users. A fact entry is thus created, and as a result of a table’s foreign keyconstraints, non-entry values must be available for the appropriate dimensions. This is the case for alldimensions in the data warehouse with the exception of the Player and Date dimensions. This step is onlyapplicable for the first loading of the warehouse, after which the same entry is kept for subsequent loadingupdates.

Figure 4.18: Loading of the Session dimension

47

CHAPTER 4. DATA WAREHOUSE MODELLING

Once the loading of the non-session entry has been completed, the loading process enters an iterativesub-process which parses the session file mentioned above, creates a session tuple, and loads it into thewarehouse. One additional task in this sub-process involves the loading of session data into a temporarysession table which includes date-time stamps for the start and end of each session. These attributes arerequired during the loading of the fact table to link fact entries to the correct session entry. This processis explained in detail in the section on the loading of the fact table. The process of loading the Sessiondimension terminates once all sessions in the file have successfully been loaded into the data warehouse.

4.5.7 Load Player Dimension

Loading of the player dimension is a user intensive task and requires multiple input files to ensure that allplayers are accounted for in the data warehouse. An overview of the process is illustrated in Figure 4.19.In the first step of the process, a list of all player names and ids is generated from the three source files.The initial list produced in this step contains issues such as duplicate profiles as well as profiles withoutplayer ids. These issues are resolved in the next task, which results in two new lists being produced. Thefirst of these is a set consisting of all unique profiles present among the three data sources and can bethought of as a union of all player profiles.

Player pro f iles = Pro f iles Log data ∪ Pro f iles In jury data ∪ Pro f iles GPS data (4.1)

Profiles in the above list are represented by unique profile-ids which are used to identify players in thedata warehouse. The second list produced is a mapping from player names in the GPS file to profile-idsin the warehouse and is used during the loading of the fact table to map training entries to the correctplayer-id. Existing ids are used for players who are registered in the AthleteMonitoring software, whereasnew ids must be created for players who do not have registered profiles. The third task in the processis responsible for adding biometric features to the list of player profiles. It serves as the input for thesub-process responsible for loading player data in the player dimension. This sub-process iterates overevery profile in the list, creating player tuples for each player before loading them into the warehouse. Theloading of the Player dimension thus terminates once all profiles have been loaded.

48

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.19: Loading of the Player dimension

4.5.8 Load Injury and Illness Dimensions

The Illness and Injury dimensions are loaded from a CSV file generated by the AthleteMonitoring software.The entire process is illustrated in Figure 4.20. From the figure, it can be seen that the loading processbegins with a single task followed by an iterative sub-process involving four tasks. On completion of theloading process, all entries have been loaded into their respective tables. Additionally, tuples representingthe events of a non-injury and a non-illness are also loaded.Starting from the green circle at the top of the figure, it can be seen that the first task involves the loadingof a non-injury and a non-illness tuple into their respective tables. The purpose of such tuples is to ensurethat foreign key constraints in the fact table are satisfied. These constraints require that every entry in thefact table references a single entry in both the injury and illness dimensions. For this reason, non-injuryand non-illness tuples are created for fact entries which are not associated with an injury or an illness.Completion of this task leads to the commencement of the sub-process responsible for loading file entriesinto the data warehouse. This is an iterative process which begins with the reading of a single entry from theCSV file. An injury/illness tuple is created and inserted into a temporary table with an auto-incrementingprimary key. The primary key generated in the temporary table is used as the primary key when the tuple isloaded into the injury or illness table. The temporary table thus enables mapping of training entries to the

49

CHAPTER 4. DATA WAREHOUSE MODELLING

correct injury and illness tuples in the data warehouse. This is done by looking up the date and player-idattributes of an entry in the temporary table and using its primary key to locate the correct entry in theInjury or Illness dimension.Once the temporary entry has been added, the next task involves extracting the most recently-addedentry’s primary key, as is mentioned above. The final task in the iterative sub-process is dependant uponwhether the current entry represents an injury or an illness. This is illustrated in the figure by an exclusivegateway which symbolises that only one of the two remaining tasks can be executed for each iterationof the process. Dependant upon whether the current entry being processed is an illness or an injury,the appropriate attributes are selected and inserted as a tuple into the data warehouse. The sub-processcontinues until every entry in the CSV file has been parsed, and a tuple for each entry has been insertedinto the corresponding table of the data warehouse. This is expressed by the conditional event highlightedin orange in the figure and symbolises the loops termination condition. Once this has been fulfilled, theentire process then terminates as is illustrated by the red circle.

Figure 4.20: Loading of the Injury and Illness dimensions

50

CHAPTER 4. DATA WAREHOUSE MODELLING

4.5.9 Load Log Dimension

The loading process for the Log dimension is very similar to that of the loading process for the Injuryand Illness dimensions, and tuples in the log table are generated from a CSV file produced by theAthleteMonitoring software. This process is very similar to the one presented in Figure 4.20 and beginswith the generation of a log tuple from an entry in the CSV file. The tuple is then inserted into a temporarylog table which has an auto-incrementing primary key. As with the process above the temporary tableis created to ensure correct mapping between fact and log entries. Once a tuple has been added to thetemporary log table, its primary key is extracted and used as the primary key for the current entry. Theiterative process terminates once all entries in the CSV file have been loaded into the data warehouse.

4.5.10 Load GPS Data

Loading of the fact table may only begin after the loading processes for the dimension tables havecompleted. This is a result of foreign key constraints in the fact table which require the existence of aprimary key in a referenced dimension table before tuples may be added to the fact table. The loadingprocesses described above thus guarantee that all foreign key values exist as primary key values in therelevant dimension tables before the loading of GPS data begins.An overview of the loading process is presented in Figure 4.21 and begins with a single task before aniterative sub-process consisting of eight tasks is carried out. The process continues until all of the specifiedGPS files have been parsed and loaded into the data warehouse.The first task in the process is responsible for establishing a player-id lookup which is created from themapping file produced during the loading of the Player dimension. The lookup maps player names in GPSfiles to player ids in the data warehouse and is used for each file in the GPS loading process.Once the id-lookup has been created, an iterative sub-process responsible for creating and loading facttuples is carried out. The input for each iteration is a GPS file which is produced during the cleaning ofGPS data. During parsing, the player name is used to obtain the player-id from the id-lookup created above.The next two tasks in the iterative process involve extracting the illness and injury ids from the temporaryinjury-illness table and requires the date and the player-id from the file being parsed. The session-id isthen extracted from the temporary session table. The relevant session-id is located using a GPS entry’sdate-time stamp as a means of identifying which session the entry belongs to. This is done by comparingan entry’s date-time stamp with the start and end date-time stamps of individual session entries. The finalforeign key lookup in the sub-process is that for the log-id and is extracted from the temporary log tableusing the current GPS entry’s player-id and date to locate the relevant entry. In the case of injury, illnessand log ids, if no entry is located a value of zero is used which references special entries (non-injury,non-illness and non-log) in each of the respective dimensions. This is not applicable for the extractionof session-ids as each GPS entry is expected to belong to exactly one session entry. Once all foreign keyvalues have been extracted, a tuple for each entry in the GPS file is created and inserted into the database.An important point of clarification is that the same foreign key values are used for every entry in a givenGPS file. The reason for this is that each GPS file represents a single session for a single player on a givendate. Hence all entries in a given file reference the same entries in each of the dimension tables. It is forthis reason that a combination of foreign keys cannot be used as a primary key for the fact table despite itbeing common practice in data warehousing. The process of loading GPS data terminates once all thespecified files have been loaded into the fact table.

51

CHAPTER 4. DATA WAREHOUSE MODELLING

Figure 4.21: Loading of the fact table

52

CHAPTER 4. DATA WAREHOUSE MODELLING

4.5.11 Insert Special Members

The final task in the ETL process involves inserting special members into the fact table to ensure thatevery tuple in the Injury, Illness and Log dimensions are referenced by at least one tuple in the fact table.Foreign key constraints in the fact table ensure that every entry in the fact table references an entry in thecorresponding dimension table. There is however no constraint in place which ensures that an entry in adimension table is referenced by an entry in the fact table. Due to the nature of the data being modelled,this may result in issues for users when trying to access information from the data warehouse. An exampleof such an issue occurs when a player is absent from training as a result of an injury or an illness. Thiswould result in an entry in either the Injury or Illness dimension with no entry in the fact table to referenceit. Without making changes to the data warehouse, a user would be able to access the injury or illness entrywithout being able to access the corresponding player-id. Similarly, if there were to exist no entry in thefact table for a given log entry, users would be unable to obtain the corresponding player-id or date for thegiven entry.

To avoid such issues, special members are inserted into the fact table to ensure that every entry in theabove-mentioned dimensions are referenced by at least one tuple in the fact table. The process for achievingthis is similar for all three dimensions and involves iterating over every entry in a given dimension andchecking whether it is referenced by an entry in the fact table. If no reference exists, a fact tuple iscreated by retrieving the relevant data through a series of lookups in the temporary databases to extractcorresponding dates and player ids. Additional lookups are also carried out to extract possible ids forcorresponding entries in the other dimension tables.

An example of this is when no fact entry is found for a given injury entry. The player id corresponding tothe injury entry is then retrieved from the temporary injury-illness table. Further lookups are carried out toretrieve entry ids for entries that may exist in the illness and log tables for the same player on the samedate.

Once the necessary data has been retrieved, a tuple is created and inserted into the fact table. This tuplecontains the relevant foreign keys for the given player on a given date. A point of importance is that theforeign key for the session dimension is always set the value that references a non-session entry in thisdimension. Furthermore, all measures are set to a value of zero.

53

Chapter 5

Data Mining

This chapter presents an account of the data mining approach taken in this thesis. Data mining is used in aneffort to achieve the second goal of the thesis, which aims to predict the onset of player injury. Followingthe CRISP-DM process model, a five-phase approach is taken. Starting with the business understandingphase, data mining goals are clearly specified, before beginning with the data understanding phase, inwhich exploratory data analysis is performed on data extracted from the data warehouse. A final dataset is then prepared in the data preparation phase, and is used for building and testing several predictionmodels in the modelling phase. Finally, models are evaluated using several performance measures, and anexplanation of their performance is presented.

5.1 Business Understanding

Business understanding in the context of this study is synonymous with the analysis goal of this thesis,which is defined as the assessment of a model’s ability to predict future injuries from workload and injurydata. The motivation for this work is inspired by two studies which are discussed in Section 2.4. Elementsfrom each of these studies are included in this work to test the reproducibility of previous modellingapproaches on a new data set.

More specifically, the aim of this thesis is defined in terms of the following objectives:

• To assess several machine learning models in their ability to predict whether a player will get injuredin the next session.

• To assess whether different definitions of injury affects the ability of a model to detect injury [5].

• To assess how a model’s predictive performance is affected when the data set is reduced to a subsetof previously identified features [39].

• To assess whether the use of feature extraction improves a model’s predictive performance [39].

55

CHAPTER 5. DATA MINING

5.1.1 Supervised Learning

Statistical learning problems typically fall into one of two categories, namely supervised or unsupervisedlearning. The modelling problem in this thesis naturally falls into the category of supervised learning,as each predictor measurement xi is associated with a response measurement yi [22]. More specifically,each set of GPS features is associated with an injury response measurement, representing whether or notthe given player got injured. The objective of learning is to fit a model that relates the response to thepredictors with the aim of accurately predicting future responses [22]. Supervised learning problems arefurther characterised as being either quantitative or qualitative. In the case of this thesis, the responsevariable is considered qualitative, meaning that each response falls into one of K classes. Furthermore, asthe response variable falls into one of two classes, injured or not-injured, the learning problem for thisthesis is characterised as a binomial classification problem.There exist a wide variety of classification models, each of which presents a trade-off between accuracyand interpretability [22]. More restrictive modelling techniques are often easy to interpret, a quality whichis of great practical value for both the coaching staff and players. The ease of understanding offered bysuch techniques often comes at the expense of their lower prediction accuracy when compared to moreflexible "black box" approaches. As predictive accuracy is of vital importance, and interpretability ishighly desirable, a variety of modelling techniques are employed in this thesis. Previous work has beenable to achieve good results using decision trees, a technique which provides a good balance betweeninterpretability and accuracy [39].

A fundamental aspect of classified learning involves splitting the data set into a test and a training set. Oneof the major challenges encountered in this study is that of an unbalanced data set, meaning that there area disproportionately large number of non-injury cases compared to the number of injury cases. Skeweddata sets compromise a classifiers ability to learn because of a model’s tendency to focus on the prevalentevent and ignore the rarer one [33]. Methods for correcting the issue of class imbalance can be groupedinto two general strategies, correction techniques at the learning level, and correction techniques at thedata level. Techniques at the learning level aim to strengthen a learning model’s ability to identify theminority class. Data level techniques focus on altering class distribution by randomly oversampling theminority class, undersampling the majority class, or by creating artificial samples of the minority class[33]. This thesis adopts the strategy of generating synthetic data in an attempt to correct the non-injuryclass imbalance. Specifically, an adaptive synthetic sampling (ADASYN) approach is taken, in an effortto avoid the possible loss of useful data which may result from undersampling, or the potential risk ofoverfitting which may occur from oversampling.

5.1.2 Injury Definition

This study focuses on a sub-set of the season’s injury data known as non-contact injuries. These are definedas injuries occurring without extrinsic contact with another player or an object on the field. Furthermore,this thesis considers two different types of non-contact injury [5]. The first of these, non-contact (NC),includes all injuries which fall into the category of the definition above. The second, non-contact resultingin time loss (NCTL), is defined as "an injury that occurred during a scheduled training session or matchthat caused absence from the next training session or match" [14]. Separate models are created for bothdefinitions to determine whether a model performs better on specific injury types.

56

CHAPTER 5. DATA MINING

5.1.3 Data Granularity

One of the advantages provided by a data warehouse, is the ease at which data can be extracted at differentlevels of granularity. The granularity chosen for this study is that of the session-level, meaning thatworkload data is aggregated for each session completed by an individual player. This is a natural choice asinjury data is recorded with the date of its occurrence being the finest level of time granularity available.Positive injury labels are thus created for the last session completed by a given player before the date oftheir next recorded injury.

5.2 Data Understanding

In this section, workload and injury data are extracted from the data warehouse developed in the previouschapter, and exploratory data analysis (EDA) is performed to gain insights into the data. A summary ofworkload data is initially presented before applying smoothing to correct GPS recording errors. Severalsummary statistics and a visual representation of the season’s injuries is then presented. Finally, comparis-ons between the workloads of injured and non-injured players are made in an attempt to gain insight intopossible relationships that might exist between workload and injury data.

5.2.1 Summary Statistics for the Workload Features

A summary of workload data aggregated at the session-level is presented in Table 5.1. Data is summarisedfrom 3005 sessions completed by 38 players during the 2019 season. The data represented in the tableincludes summary statistics for the six GPS features as well as the number of sessions completed byplayers during the season.It is evident from the summary data that several data quality issues exist within the data set. This issue ishighlighted by the highly skewed values that are seen for features such as, distance, acceleration load, V5distance, and HSR distance. The highly skewed distribution is primarily accounted for by the unrealisticallylarge maximum values of these features. Closer inspection of the data suggests that these values can beexplained by faulty GPS readings produced by a single GPS on three separate occasions.

Another issue highlighted in the summary table is that of GPS units under-recording, and is illustratedby the presence of minimum values of zero. On closer inspection, 174 sessions are identified as having

Variables Mean Stdev Median Min Max Range Skew

Distance 8965.82 131905.50 4493.01 0.00 4657235.00 4657235.00 32.46

Player Load 495.51 276.58 460.18 0.00 1761.80 1761.80 0.83

Acceleration Load 1482.30 801.07 1415.06 0.00 20280.90 20280.90 4.61

V4 Distance 174.47 183.78 121.84 0.00 1328.64 1328.64 1.66

V5 Distance 71.23 999.07 16.28 0.00 54472.26 54472.26 53.77

HSR Distance 245.70 1023.56 146.43 0.00 54472.26 54472.26 49.55

Session Count 81.22 51.31 75 1 167 166 0.13

Table 5.1: Summary statistics

57

CHAPTER 5. DATA MINING

unrealistically low values across all GPS features. These low values can be explained by GPS units turningoff during sessions as a result of device faults or low battery levels. The data issues mentioned above areclassified as noise and missing values respectively and may prove harmful to prediction models if notcorrected in advance.Correction of data quality issues are typically carried out during the data preparation phase of the datamining cycle. The data quality problems are, however, corrected in this phase to facilitate ease of reading.There are a wide variety of techniques for correcting noisy data and missing values [17]. Two approachesare considered in this thesis, namely, removal of incorrect values, and smoothing of incorrect values bybin means. In the next phase of the data mining cycle, engineered features are introduced. Due to thedependence of these features upon values from previous training sessions, it is considered more harmful toexclude values than to replace them with smoothed averages. For this reason, the approach of smoothingby bin means is adopted, and training sessions with recorded values falling outside of a defined range arereplaced by the mean values of all correct recordings from the same session.Table 5.2 presents a summary of the GPS data after smoothing by bin averages has been applied. It canbe seen that the smoothing technique results in a drastic improvement in skewness values for all of theproblematic features identified above. The improvement is attributed to correcting unreasonably highmaximum values and unreasonably low minimum values. An advantage of the smoothing technique used,is that outliers from an individual session are only corrected when they belong to a team session whichfalls within the range of normal. In other words, the technique targets individuals outliers which mayrepresent an incorrect recording of a single GPS unit. If the average values for an entire team session wereto fall outside of the range of normal, this would suggest a correctly recorded session, and the recordedvalues of individual sessions would remain unchanged. This concept is highlighted by the distance feature,which has a minimum value of 500m, despite a lower bound of 1000m being specified. This value remainsunchanged as it belongs to a team session in which all values were unusually low.The features, V4, V5 and HSR distance only record for running speeds above 20km/h. Players do notachieve these speeds in every session. Hence a minimum value of zero is still observed for these featuresafter smoothing has been applied. Finally, the majority of features have distributions which are consideredto be highly positively skewed (>= 1), meaning that the mass of the distributions are concentrated towardsthe left of each distribution and that the distributions have long right tails.

Variables Mean Stdev Median Min Max Range Skew

Distance 5132.23 2546.09 4640.99 500.06 14209.09 13709.03 1.08

Player Load 526.10 261.44 478.47 31.33 1761.80 1730.47 1.00

Acceleration Load 1555.72 655.87 1461.20 138.37 3982.75 3844.38 0.57

V4 Distance 186.57 183.43 133.63 0.00 1328.64 1328.64 1.57

V5 Distance 56.49 112.38 19.93 0.00 2501.57 2501.57 10.10

HSR Distance 243.07 261.29 163.54 0.00 3191.37 3191.37 2.38

Session Count 81.22 51.31 75 1 167 166 0.13

Table 5.2: Summary statistics after smoothing by bin means

58

CHAPTER 5. DATA MINING

Statistic Non-contact (NC) Non-contact time-loss (NCTL)Total Count 45 14

Players Injured 22 11

Players with 1 Injury 11 8

Players with 2 Injuries 3 3

Players with 3 Injuries 5 0

Players with more than 3 Injuries 3 0

Days missed 551 551

Table 5.3: Injury statistics

5.2.2 Summary Statistics for the Injury Data

A summary of injury data for both definitions of non-contact injury is presented in Table 5.3. As describedin Section 5.1, two definitions of injury are considered, namely, non-contact (NC) and non-contact causingtime loss (NCTL). The former of the two definitions naturally has a higher number of injuries whencompared with the stricter definition, and it can be seen that there are over three times as many NC injuriesas there are NCTL injuries. In addition to NC injuries occurring more frequently, a higher incidence ofplayers incurring multiple injuries is also observed for this definition of injury. Half of all the players thatreceived NC injuries were injured on more than one occasion.Figure 5.1 provides a visual perspective of player injuries in the form of a Gantt chart. NCTL injuries arerepresented in orange, whereas NC injuries are represented by both the blue and orange bars in the chart.Each bar in the chart represents the date and duration of an injury for a given player. The figure provides aclear indication of the sequence of injury occurrence through the season.The span of the orange bars represent both the duration of a player’s injury as well as the number of daysa given player was absent from training and competition. The span of the blue bars, however, indicatesonly the duration of a player’s injury, and not the number of missed sessions. The chart highlights severalimportant issues that must be considered in the next phase of the data mining cycle. Firstly, multipleincidents of injury occur very early in the season, and may need to be excluded from the final data set as aresult of there being insufficient workload data prior to their occurrence. Secondly, two injuries resulted inthe affected players being absent from training/competition for approximately half of the season’s duration.As a result of their absence, there is limited workload data for these players, a factor which could influencemodel performance. Another consideration worth noting is that 8 of the 14 NCTL injuries are preceded byanother injury.

5.2.3 Comparison of Workload Features for Injured and Non-Injured Players

A comparison of load features for non-injured players and players that received NC injuries is presentedusing boxplots in Figure 5.2. Workloads of players who did not get injured during the season arerepresented in orange, whereas the workloads for players who did receive injuries are represented in blue.It can be seen in both Figures 5.2a and 5.2b that the load features for injured and non-injured players arealmost identical. The similarity between the figures’ median values, interquartile ranges (IQR), as wellas their minimum and maximum values do not provide much insight into possible relationships betweenload features and non-contact injuries. One point worth noting is that players with injuries had a greater

59

CHAPTER 5. DATA MINING

Figure 5.1: Visual overview of injury occurrence for the 2019 season

number of outliers for both workload features. The fact that these are present in both features highlight thecorrelation between the two features, both of which are calculated from acceleration data provided by theGPS’s accelerometer.Similarly to above, a comparison of distance features for non-injured players and players that received NCinjuries is presented using boxplots in Figure 5.3. Median distance values in Figure 5.3a are very similarfor both injured and non-injured players. This is particularly surprising as players carrying injuries areexpected to have lower distance values as a result of reduced participation caused by injury. Despite theapparent similarity between the boxplots presented in Figure 5.3b, the observed differences between theHSR distances of non-injured and injured players are the most noteworthy of all workload features in thisstudy. This is an interesting observation as higher HSR values have been associated with the incidence ofinjury [39].Finally, a comparison of distance features for injured and non-injured players with respect to the NCTLdefinition of injury is presented. It is seen in Figure 5.4 that the stricter definition of injury yields almostidentical boxplots for both distance features. This is again surprising for the distance feature, as playerswith more severe injuries are expected to have lower average mileages as a result of lower training loadsafter returning from injury. In contrast to the differences in HSR distance observed for players with NCinjuries, almost no difference is seen between the HSR distances of non-injured players and players withNCTL injuries. Due to the similarity of load values for players with NCTL injuries and non-injured players,no boxplots are included.An initial comparison between non-injured players and players with NC and NCTL injuries does providemuch insight into potential relationships that may exist between workload features and injury. Workloadfeatures are similar for all features and both definitions of injury, with the difference in HSR distancebetween non-injured players and players with NC injuries being the most notable.

60

CHAPTER 5. DATA MINING

(a) Player Load (b) Acceleration Load

Figure 5.2: Boxplots comparing the workload features of injured and non-injured players with respect toNC injury

(a) Distance (b) HSR Distance

Figure 5.3: Boxplots comparing the distance features of injured and non-injured players with respect toNC injury

(a) Distance (b) HSR Distance

Figure 5.4: Boxplots comparing the distance features of injured and non-injured players with respect toNCTL injury

61

CHAPTER 5. DATA MINING

5.3 Data Preparation

The data preparation phase involves constructing a final data set from the data presented in the previousphase. The final data set consists of the six original GPS workload features, twelve additional relativeworkload features, and a single injury feature calculated from the player injury data. Given the discretenature of the original data set, data transformations are required to provide an indication of a player’saccumulated workload. Several well-studied methods for calculating the relative workload of a player arediscussed in detail below.

5.3.1 The Exponentially Weighted Moving Average (EWMA)

The exponentially weighted moving average was first introduced as a control scheme for detecting smallshifts in the mean of a process [37]. More recently it has also been proposed as a method for calculating therelative workloads of athletes, to account for the decaying nature of fitness and fatigue over time [49]. Thisis achieved using a decay factor that weights the importance of recent workloads, where recent workloadsare weighted more heavily than older ones. The result is thus a smoothed average of a given workload. Asimple representation of the formula is given as:

EWMAtoday = Loadtoday×λ+((1−λ)×EWMAyesterday) (5.1)

where:0≥ λ≤ 1

λ represents the degree of decay, with higher values discounting older values at a faster rate. It is commonlyrepresented as:

λ = 2/(N +1) (5.2)

where:N is referred to as span, and represents the chosen decay constant

EWMA features are created for each of the six GPS workload features using a span of 6 [39; 5]. EWMAcalculations account for non-training days by using a value of zero. Thus players with fewer sessions willhave lower EWMA values for a given period than players with more sessions (assuming that both sets ofplayers have similar workloads for each session).

5.3.2 The Mean Standard Deviation Ratio (MSWR)

The mean standard deviation ratio is a commonly used technique to quantify the monotony of a player’sworkloads [39; 4; 1; 5]. It is defined as the ratio between the standard deviation and the mean of a givenworkload for a specified period. MSWR values thus represent the variation in a player’s workload, withhigher values indicating more monotonous workloads for the specified period.

MSRWt =µt

σt(5.3)

62

CHAPTER 5. DATA MINING

where:µt is the mean of a workload feature for time period tσt is the standard deviation of a workload feature for time period t

MSWR features are created for each of the GPS features as an average of a feature’s workload for theprevious seven days divided by the standard deviation of the workload for the same period [39; 4; 1; 5].The MSWR calculation includes all calendar days since the first recorded session for each player and usesa value of zero for days on which no workload data is recorded. As a result of the feature requiring sevencalendar days of workload data, the first six calendar days since a players first session are not included inthe final data set.

5.3.3 Injury Feature

The injury feature created in this study takes into account the number of previous injuries a player hasreceived, as well as the number of days since a player’s return to play after recovering from their mostrecent injury [39]. Its calculation involves modifying the EWMA presented above, which can be written asa moving average of past and current observations [31]:

Zi = λ

i−1

∑j=0

(1−λ) jXi− j +(1−λ)iZ0 (5.4)

where:Ziis the calculated EWMA for i sessionsXi is the recorded value for session number i

λ is a value from 0 to 1Z0 is the starting EWMA value

By setting Xi to the number of injuries a player has had to date, and Z0 to Xi−1, the injury feature can becalculated as follows:

Id = λ

d−1

∑j=0

(1−λ) jXi +(1−λ)dXi−1 (5.5)

where:Id > 0 for Xi >= 1 and Id = 0 for Xi = 0d is the number of days since a players return to play since their last injury

A feature value of zero thus means that a player has never been injured, whereas a value greater than zeroindicates that a player has had at least one injury during the season. For players with one or more injuries,the feature increases for each day that passes since the players return to play. For a player with a singleinjury, the value of the feature starts from a value of 0 and increases for each day passed until it eventuallyreaches a value of 1. Similarly, for a player with two injuries, the value of the feature grows from 1 to 2,increasing every day until the value of 2 is reached. The rate of growth of the feature is determined by λ,with larger values resulting in faster growth rates.

63

CHAPTER 5. DATA MINING

The injury feature thus serves as an indication of both the number of injuries a player has received as wellas the number of days since returning to play after receiving their last injury.

5.3.4 Creation of the Final Data Set

Two data sets are constructed for the modelling phase, one for the modelling of NC injuries and theother for the modelling of NCTL injuries. Both data sets include six GPS workload features, six EWMAworkload features, six MSWR workload features, one injury feature, and a label indicating whether theplayer got injured after the current session. Each of the data sets consists of data from a total of 2848sessions completed by 32 players and represents a subset of the workload data presented in the previousphase. Table 5.4 presents an overview of several differences between the data sets generated for modellingand the data set presented in the section on data understanding.

Statistic ML Data Set) Data WarehouseNumber of sessions 2848 3005

Number of players 32 38

Number of NC injuries 34 45

Number of NCTL injuries 13 14

Table 5.4: Comparison of ML data set and data from data warehouse

From the table above, it can be seen that multiple tuples are excluded in the process of creating the dataset to be used for modelling. Among the data excluded are workload data from 6 of the 38 players. Datafrom these players are excluded due to them having an insufficient number of recordings. Additionally, allrecorded sessions from January are excluded because of the extensive time gap between the last session inJanuary and the first session in February. As mentioned in the section on MSWR workloads, all sessionswithin the first six days since a player’s first recorded session are excluded from the final data set. Theabove action is performed to remove MSWR values of zero.Furthermore, it can be seen that neither NC nor NCTL data sets include all injuries from the data warehouse.In the case of the NC data set, injuries are excluded if they took place very early in the season, or if aninjury is associated with a player that is not included in the final data set. The NCTL data set loses onecase of injury as a result of the exclusion of a player with insufficient workload data.The final data set generated in the data preparation phase is thus a matrix of 2848 vectors, of which eachvector contains 18 workload features, an injury feature, and a label. A visual representation of the data setis presented in Figure 5.5.

WF11 WF12 . . . WF118 IF119 L120

WF21 WF22 . . . WF218 IF219 L220...

WF28481 WF28482 . . . WF284818 IF284819 L284820

Figure 5.5: Representation of the final data set constructed in the data preparation phase

64

CHAPTER 5. DATA MINING

5.3.5 Correlation Matrix for the Features in the Final Data Set

Correlation is a measure of the extent to which two variables change together. A correlation matrix for thefinal data set is presented in Figure 5.6 and provides an indication of the degree of correlation betweeneach pair of variables in the data set.Several interesting relationships are seen with the aid of the figure. Firstly, there is a weak correlationbetween the injury feature and all of the workload features. The feature is thus considered independentfrom all other features and may provide useful as a predictor variable in the modelling phase. Secondly,there are 16 highly correlated relationships (> 0.80) between variables, meaning that if highly correlatedfeatures were to be removed before modelling, this would result in the removal of eight features fromthe data set. One set of highly correlated features is that of total distance, player load, and accelerationload. High correlation values are also observed for the EWMA and MSWR values of these features. Theserelationships are not surprising as both player load and acceleration load values are expected to increasewith an increase in the total distance covered by a player.Another set of highly correlated features is that of HSR distance and V4 distance. This is also not surprisinggiven that HSR distance is the sum of V4 and V5 distances. The fact that V4 distance has a strongercorrelation to HSR distance than that of V5 distance indicates that V4 distance forms a more significantportion of the HSR distance than V5 distance does.

Figure 5.6: Correlation matrix of the injury and workload features in the final data set

65

CHAPTER 5. DATA MINING

5.4 Modelling

This section provides a detailed account of the entire modelling approach taken in this thesis and includesa description of the models and techniques which are used. Four modelling approaches are applied toboth NC and NCTL data sets. Each of these approaches uses four classification algorithms of varyingcomplexity and interpretability, to build predictive models for each of the two data sets. The sectionbegins with a brief presentation of the four classification models, before providing an explanation of thetechniques used to account for unbalanced data, overfitting, and multicollinearity. Several key conceptswhich are reproduced from a previous study are then explained. Finally, a presentation of each of the fourmodelling approaches is provided.

5.4.1 Model Selection

• Decision Tree (DT) classification is a non-parametric approach, in which models are fitted bylearning simple decision rules inferred from predictor variables. In so doing, the predictor spaceis segmented into several simple regions. Prediction for a given observation thus involves takingthe most commonly occurring class of the training observations in the region to which it belongs[22]. Decision Trees are considered an ideal approach for gaining insight into injury and trainingworkloads, due to the relative ease at which they may be interpreted. Additionally, their accuracy isnot adversely affected by the presence of redundant attributes, a phenomenon that is likely to beencountered among workload features [44]. Possible drawbacks include the classifiers susceptibilityto lower prediction accuracy when compared with other classification techniques [22]. DT classifiersare also susceptible to the inclusion of irrelevant attributes in the tree building process [44].

• Random Forests (RF) are an ensemble learning method for classification which builds upon theconcept of decision trees in an effort to build more powerful prediction models. The algorithmworks by building several decision trees based on bootstrapped training samples. In contrast tobuilding classical decision trees, each time a split in a given tree is considered, a random sampleof predictors is chosen as split candidates. These candidates represent a subset of the full set ofavailable predictors, and a new subset of predictors is chosen for each split. This technique reducesthe tendency of a model to overfit to its training set by decorrelating trees. The average of theresulting trees is thus less variable and more reliable [22].

• Support Vector Machines (SVM) separate between classes using a subset of predictor variablesknown as support vectors. The basic principle involves the construction of a hyperplane, whichaims at maximising the margin between itself and the support vectors. Decision boundaries withlarger margins tend to have lower generalisation errors than those with small margins [44]. Thefollowing principle can be extended to accommodate non-linear decision boundaries by enlargingthe feature space to a higher dimensional space through the use of kernels [22]. Although supportvector machines have shown promising results in many practical applications, they can be ineffectivein determining class boundaries when dealing with unbalanced data sets [51].

• Logistic Regression (LR) lends itself well to the binary nature of the problem being modelled inthis thesis and is a common choice for modelling the outcome of injury [5]. By making use of alogistic function, this algorithm produces an S-shaped prediction curve, which limits predictionsto a value between 0 and 1. Model fitting involves the process of estimating a set of regression

66

CHAPTER 5. DATA MINING

coefficients that maximise a given likelihood function. Once this has been done, injury predictionsare made by computing its probability from a set of predictor variables [22].

5.4.2 Adaptive Synthetic Sampling

Adaptive synthetic sampling (ADASYN) is used before building models to correct the severe classimbalance, which is present in both NC and NCTL data sets. At a high level, ADASYN corrects classimbalance by generating synthetic examples of the minority class. The sampling technique facilitateslearning from imbalanced data sets by reducing bias and learning adaptively. A vital concept of thetechnique involves the use of a density distribution. The distribution is used for determining the number ofsynthetic samples that need to be generated for each of the predictor variables in the minority class. Thisparticular strategy ensures that more synthetic data is generated for the minority class in neighbourhoodsdominated by the majority class. Thus, the resulting data set is balanced, and learning algorithms areforced to focus on harder-to-learn examples [18]. One potential weakness that may result from the adaptivenature of this algorithm is that neighbourhoods with few examples of the minority class may end up with alot of synthetic data, which is very similar to that of the majority class. This can affect the precision oflearning models as a result of too many false positives.

5.4.3 Stratified K-Fold Cross Validation

Cross-validation is a resampling method that can be used to assess a model’s generalisation ability ona limited set of data. In k-fold cross-validation, the training data is divided into k disjoint subsets ofapproximately equal size. The model is fitted on k−1 of the subsets, which together form the training set.The remaining subset is considered the validation set and is used to assess the model’s performance. Thisprocedure is repeated k times until each subset has served as the validation set. A models cross-validatedperformance is calculated as the mean of the k measures [3]. Stratified cross validation is particularlyimportant when dealing with unbalanced data sets, as it ensures that each class is approximately equallyrepresented across all subsets. Without this guarantee, there is a risk of one or more subsets not havingany examples of the minority class. K-fold cross-validation is typically performed with k = 5 or k = 10,as there is empirical evidence that these values are associated with good bias-variance trade-offs [22].Furthermore, these values avoid expensive computational costs which may result from high values of k.

5.4.4 Feature Selection

Feature selection refers to the process of eliminating predictor variables from a data set which do notcontribute to a model’s predictive performance. The motivation for reducing the number of features ina data set typically fall into one of two categories. The first of these is aimed at improving a modelsperformance by eliminating predictor variables, which may negatively affect a models performance.Support vector machines are sensitive to irrelevant predictors, and a model’s performance may suffer ifthey are not removed. A further consideration is that of highly correlated predictors, which can negativelyimpact models such as logistic regression. A second reason for the removal of predictor variables ismotivated by the aim of reducing a models complexity. Both of the above mentioned motivating factors arehighly desirable to this thesis. More specifically, the elimination of irrelevant or highly correlated variablesis desirable for the reason that it may improve predictive performance for the logistic regression and SVM

67

CHAPTER 5. DATA MINING

models. Alternatively, the elimination of predictors that do not compromise a model’s performance ishighly desirable for the reason that a reduction in a models complexity may significantly improve itsinterpretability. This is particularly relevant for random forest and decision-tree models [29].

Feature selection techniques can be grouped into three classes, namely, intrinsic methods, filter methods, orwrapper methods. For intrinsic methods, feature selection forms a part of the modelling process, and thereis a direct connection between selecting features and the statistic which the model attempts to optimise. Infilter methods, a single supervised search is performed to determine which predictors are of importance.Wrapper methods take an iterative approach by providing a predictor subset to the model and using itsevaluation as the selection criteria for the next subset [29].

This thesis uses a wrapper method known as recursive feature elimination (RFE) to remove potentiallyirrelevant features from the injury/workload data set. RFE is an example of a greedy wrapper thateliminates features in each iteration to achieve the best immediate results. The process begins by rankingeach predictor with a measure of importance. Each successive round removes one or more features of lowimportance before performing the ranking procedure on the new subset of features. Upon termination ofthe process, a ranking of importance for all prediction variables has been calculated. The subset used formodel building is thus a selection of the predictors with the highest importance ratings [29].

5.4.5 Features From Previous Work

As already mentioned, this thesis is largely inspired by previous work, which was able to predict the onsetof injury among football players with reasonable success [39]. The results from the work mentioned aboveare achieved using data gathered from a single professional football team during one season of trainingand competition. A subset of only three features was identified as relevant for the prediction of injury. Thefeatures identified were the estimated weighted moving average of a player’s high-speed running distance(HSREWMA), the mean standard deviation ratio of a player’s total distance (DistMSWR), and the estimatedweighted moving average of a previous injury feature (PIEWMA). Part of this thesis uses a combination oftechniques and features from the work mentioned above, to test their reproducibility on a new set of data.There are, however, several important differences between the data set presented in this study and the dataset from the study mentioned above. The differences noted are as follows:

• The previous study used an NCTL definition of injury, whereas this study takes into considerationboth NCTL and NC injury definitions.

• The previous study reported 21 incidents of NCTL injury, whereas this study has only 13.

• The previous study reports 931 individual sessions, whereas this study includes 2848 individualsessions.

5.4.6 Modelling Approach

Four modelling approaches are taken to compare the effects of the data processing techniques discussedabove. All four classifiers and both NC and NCTL data sets are used in each of the four approaches.Starting with a simple naive approach, each successive approach incorporates additional techniques in anattempt to improve model performance and gain insight into the effects of a feature’s contribution to theincidence of injury. Each approach involves building models using one portion of the data set and testing

68

CHAPTER 5. DATA MINING

them on the remaining portion. Data sets are stratified to ensure the correct representation of the injuryclass in both training and testing data sets. Additionally, all approaches are subjected to 1000 iterationsof training and testing, and results are reported as an average of the 1000 performance measurements.Following is a description of the different approaches.

Naive ApproachThe first of the four approaches involves building and testing models using each of the two data setswithout modifying the data. In each iteration, the data is split into a test and a training set. Models are thenbuilt using the training data, after which they are evaluated on the designated test data. The splitting ratiodiffers for the two data sets to ensure that test sets for both NC and NCTL data sets contain approximatelythe same number of test examples. A train-test ratio of 80:20 is used for the NCTL data, whereas a ratio of55:45 is used for the NC data. These ratios ensure approximately six to seven test cases for both sets ofdata. An illustration of the test-train split for the NC data is presented in Figure 5.7.

Figure 5.7: An illustration of a test/train split

ADASYN and Stratified K-Fold Cross ValidationThe second approach aims to compensate for the class imbalance by generating synthetic data. Additionally,stratified k-fold cross validation is used in an attempt to avoid models overfitting to the training set. Theapproach is presented in Figure 5.8, which illustrates how ADASYN and stratified 5-fold cross validationare used in combination with one another. From the figure, it can be seen that the initial data set consistsof 2814 sessions, which are not associated with injury and 34 sessions, which are associated with an injury.The data is then split into five folds of approximately equal injury to non-injury ratios. In the next step, onefold is selected as a test set, while the remaining folds are used together as the training set. Before trainingis initiated, ADASYN is used to balance the training set by generating synthetic data for the injury class.The resulting training set thus has an approximately equal number of examples of injury and non-injury.Once training is completed, a model’s performance is tested using the test fold. The step involving theselection of a testing set and the generation of synthetic data is repeated five times until all of the five foldshave served as the test set. Performance measurements (discussed in the evaluation phase) are calculatedas an average of the measurements from the five steps.The entire process is repeated 1000 times for each of the four models and both of the two data sets. Tomaintain a similar number of injury test cases among all test sets, the values of k is set to five for NC dataand two for NCTL data.

69

CHAPTER 5. DATA MINING

Figure 5.8: An illustration of 5-Fold Cross Validation with ADASYN

70

CHAPTER 5. DATA MINING

Previously Identified FeaturesThe third approach is almost identical to that of the second, with the exception being the choice of predictorvariables included in the data set. A subset consisting of the predictor variables presented in Section 5.4.5is used to determine whether they influence a models ability to detect injury.

Feature EliminationThe final approach combines the second approach described above with the process of feature elimination.

This is essentially a replication of the entire modelling approach taken in the study presented in Section5.4.5. There are two objectives to this approach. The first of these is aimed at improving a model’sperformance by removing features that may negatively affect the model’s ability to learn. The otherinvolves identifying a combination of features that are associated with an injury.Due to there being too few cases of injury in the NCTL data set, the feature elimination approach iscarried out using only the NC data set. As with the approaches presented above, performance measures arereported as an average of 1000 iterations. The entire approach is illustrated in Figure 5.9 and summarisedas follows:

1. Step 1 involves splitting the data set. A 30 : 70 split ratio is used, with 30% of the data being usedfor the first 3 steps, while 70% is reserved for constructing and testing prediction models.

2. In Step 2, ADASYN is used to correct the class imbalance by generating synthetic data for theminority class. In the case of the NC injury data, this results in approximately 844 examplesrepresented for each class.

3. Feature selection is then applied in Step 3 using RFECV, and the predictor variables most relevantfor classification are recorded. RFECV is performed using a Decision Tree classifier, the number offolds is set to k = 5, and the performance measure is specified as AUC ROC.

4. Step 4 marks the beginning of the second phase of the modelling approach and involves buildingand testing models using the second portion of data from the split in Step 1.

5. In Step 5, predictor variables identified as irrelevant in Step 3 are removed from the data set.

6. In Step 6, the filtered data is used to perform stratified k-fold cross validation using ADASYN tocompensate for class imbalance in the training set. The process is identical to that illustrated inFigure 5.8, with the exception being that k = 3 is used.

71

CHAPTER 5. DATA MINING

Figure 5.9: An illustration of Feature Elimination combined with K-Fold Cross Validation and ADASYN

5.5 Evaluation

5.5.1 Evaluation Metrics

The use of common performance measures such as accuracy or error rate may produce misleading resultswhen dealing with unbalanced data sets due to their dependence upon class distribution [19]. This can beillustrated using the example of the NC injury data set. As the injury class represents approximately 1% ofthe data, a classifier may miss every case of injury and still achieve an accuracy of 99%. An additionalissue that is associated with common performance measurements is that of the cost of misclassification.As it is often more important to be able to identify the minority class than the majority class, the costof misidentifying the minority class will have a more significant consequence [10]. This is of particularimportance for this study, which places more emphasis on a classifier being able to identify cases of injurycorrectly. It is for this reason that the performance measures used in this thesis are addressed towardsquantities that are class independent.As there appears to be no consensus among previous studies as to which performance measures shouldbe used [39; 5; 23], this thesis makes use of a variety of widely used evaluation techniques. Table 5.5presents a confusion matrix that forms the basis of the majority of measures used in this study. In binary

72

CHAPTER 5. DATA MINING

classification problems, predictions fall into one of two categories, either positive or negative. This givesrise to four possible outcomes, namely, True Positives, False Positives, True Negatives, and False Negatives.The values of these outcomes are used for calculating the measures presented below.

Predicted Class

Positive Negative

Act

ualC

lass

Positive TP FN

Negative FP TN

Table 5.5: A confusion matrix for binary classification

Precicion represents the fraction of cases classified as positive that are, in fact, positive [33]. It serves asis an indication of a models trustworthiness. In the case of injury prediction, a high precision would resultin the model identifying injuries with a great deal of certainty. A low precision, however, would result inthe model raising a lot of false alarms.

Precision =TruePositives

Truepositives+FalsePositives(5.6)

Recall is an indication of a model’s ability to identify the injury class. The higher the recall, the bettera model’s ability to detect the injury class. A low recall would result in an inability to predict a playergetting injured.

Recall =TruePositives

Truepositives+FalseNegatives(5.7)

The F1 Score is a harmonic mean between recall and precision, and combines the two into one measure.

F1−Score = 2∗ precision∗ recallprecision+ recall

(5.8)

Area under the curve (AUC): The Receiver Operating Characteristics (ROC) is a frequently used toolfor evaluating the performance of a classifier in the presence of unbalanced data. It is a graphical plot ofa classifiers true positive rate (recall) versus its false positive rate at all classification thresholds. Betterperformance is associated with steeper curves, whereas a models inability to differentiate between classeswould result in a diagonal curve from the bottom left to the top right of the plot [33]. The area under thecurve (AUC) provides an aggregate measure of performance, which measures the entire two-dimensionalarea under the ROC curve. Scores close to 1 are associated with good performance, whereas scores around0.5 are equivalent to random guessing.

73

CHAPTER 5. DATA MINING

5.5.2 Results

A summary of all results is presented in Table 5.6.

Data Set Classifier Injury Precision Recall F1 AUC

Naive Approach SVMNC 0.00 0.00 0.00 0.50

NCTL 0.00 0.00 0.00 0.50

LRNC 0.00 0.00 0.00 0.50

NCTL 0.00 0.00 0.00 0.50

RFNC 0.00 0.00 0.00 0.50

NCTL 0.00 0.00 0.00 0.50

DTNC 0.02 ±0.05 0.01±0.04 0.01±0.04 0.50±0.02

NCTL 0.01±0.03 0.00±0.03 0.01±0.03 0.50±0.01

ADASYN&

StratifiedK-Fold CV

SVMNC 0.52±0.27 0.02±0.01 0.03±0.02 0.50±0.01

NCTL 0.08±0.08 0.00±0.00 0.00±0.00 0.50±0.00

LRNC 0.44±0.24 0.02±0.01 0.03±0.02 0.50±0.01

NCTL 0.38±0.05 0.01±0.00 0.02±0.01 0.50±0.00

RFNC 0.06±0.09 0.02±0.03 0.03±0.04 0.50±0.01

NCTL 0.04±0.01 0.02±0.03 0.03±0.04 0.51±0.01

DTNC 0.10±0.02 0.02±0.01 0.03±0.02 0.50±0.01

NCTL 0.08±0.09 0.02±0.02 0.02±0.03 0.51±0.01

PreviouslyIdentifiedFeatures

SVMNC 0.45±0.24 0.01±0.01 0.02±0.01 0.50±0.01

NCTL 0.45±0.12 0.00 0.01 0.50

LRNC 0.33±0.18 0.01±0.01 0.02±0.01 0.50±0.01

NCTL 0.46±0.04 0.01 0.01 0.50

RFNC 0.15±0.07 0.02±0.01 0.02±0.01 0.50±0.00

NCTL 0.17±0.08 0.01±0.01 0.01±0.01 0.50

DTNC 0.17±0.08 0.02±0.01 0.03±0.03 0.50±0.01

NCTL 0.19±0.10 0.01±0.01 0.02±0.01 0.50

FeatureSelection

&ADASYN

&Sratified

K-Fold CV

SVMNC 0.65±0.17 0.02±0.01 0.04±0.01 0.51±0.01

NCTL

LRNC 0.48±0.10 0.02 ±0.01 0.04±0.02 0.51

NCTL

RFNC 0.06±0.08 0.02±0.02 0.03±0.04 0.50±0.01

NCTL

DTNC 0.08±0.13 0.01±0.01 0.02±0.02 0.50±0.01

NCTL

Table 5.6: Summary results of all models

74

CHAPTER 5. DATA MINING

An explanation of the results seen in the Table 5.6 is provided below.

Naive ApproachIt can be seen from Table 5.6 that by not employing techniques to account for class imbalance andcollinearity present among the data sets, the performance achieved by all models is extremely poor. Apartfrom decision trees, all of the models achieved precision, recall, and F1 scores of 0.00 for both definitionsof injury. DT models are the only models able to produce prediction scores above zero, which appearto be marginally higher for the case of NC injuries when compared to NCTL injuries. The performancemeasures for both data sets are associated with a high degree of variation. AUC scores of 0.50 are achievedfor all models, which indicates that models are unable to distinguish between injury and non-injury classes.Prediction of injury is thus considered equivalent to guessing.Table 5.9 presents the confusions matrices for SVM and DT models after being tested on NC data. Theseare identified as the poorest and the best performers, respectively. Here it is evident the SVM modelscreate decision boundaries, which classify all test data as belonging to the non-injury class. Hence, nopredictions of the injury class are seen. The decision boundaries created by the DT models, however, doresult in positive predictions with a very low degree of precision. Here it can be seen that only 117 of 8797predictions of injury are, in fact, true cases of injury.

Predicted Class

Positive Negative

Act

ualC

lass

Positive 0 7000

Negative 0 563000

Table 5.7: Confusion matrix for SVM

Predicted Class

Positive Negative

Act

ualC

lass

Positive 117 6883

Negative 8680 554320

Table 5.8: Confusion matrix for DT

Table 5.9: A comparison of confusion matrices for SVM and DT models using NC data

ADASYN and Stratified K-Fold Cross ValidationThe introduction of cross validation and synthetic data generation produces higher precision, recall, and F1scores for all models and both data sets. The most notable performance improvement is observed amongprecision scores. SVM models are able to achieve the highest average precision score of 0.52 for NCinjuries, a drastic improvement from the average precision score achieved using the naive approach. TheSVM models are, however, unable to reproduce these results for NCTL injuries, for which an averageprecision score of 0.08 is achieved. All models achieve better precision results in the case of NC injuries,and hence fewer false alarms are raised for this definition of injury.Recall scores are low for all models. Low recall scores are the result of a high proportion of false negatives,meaning that models are unable to identify incidents of injury. These lower values of recall are responsiblefor the poor F1 scores seen for all models. It could be said that F1 scores are marginally better in the caseof NC injuries, but the values are so low that they are of no significant importance.

75

CHAPTER 5. DATA MINING

(a) NC (b) NCTL

Figure 5.10: Boxplots comparing the area under the curve for models using ADASYN and StratifiedK-Fold Approach

A comparison of AUC scores for NC and NCTL injuries is presented in Figure 5.10 using boxplots. Hereit can be seen that there is minimal variation between scores for both NC and NCTL injuries. An averageAUC score of 0.50 indicates that none of the models are able to distinguish between classes, despiteattempts being made to compensate for imbalance and overfitting.

Previous Identified FeaturesThe reduction of the feature set does not result in improved performance, with the exception of severalresults seen among precision scores. In particular, models built using the NCTL data set show animprovement in precision scores for all models. Despite their improvement, none of the precision scoresexceed 0.50 and is by no means a trustworthy indicator for injury. Furthermore, recall scores remain low,highlighting the models’ inability to recognise incidents of injury.Figure 5.11 presents a comparison of AUC scores for models tested on both the NC and NCTL data sets.Near-perfect alignment of AUC scores around a value of 0.50 illustrates that models are unable to learnfrom the data. When compared to the boxplots presented in Figure 5.10, it can be seen that the subset ofpreviously identified features does not significantly influence the median AUC scores. The only noticeabledifference is that of the variance, which is slightly higher for the full data set when compared to the reduceddata set. The difference in variance may be explained by more complex decision boundaries which resultfrom a larger feature set.

76

CHAPTER 5. DATA MINING

(a) NC (b) NCTL

Figure 5.11: Boxplots comparing the area under the curve for models using the previously identified injuryfeatures

Feature selectionThe introduction of RFECV produces marginally better results for both the SVM and LR models, whereasno improvement in performance is seen for the RF and DT models. By eliminating "irrelevant" features,the fourth modelling approach taken in this thesis results in the highest precision and F1 scores for bothSVM and LR models. Despite these slight improvements, the approach is unable to produce results ofany significance, and recall and AUC scores remain very low. A comparison of the AUC scores achievedusing the feature elimination modelling approach is present in Figure 5.12. As seen with all modellingapproaches used in this thesis, low scores of around 0.50 are achieved by all models. This illustrates themodels’ inability to learn from the data, despite the elimination of "irrelevant" features.

Figure 5.12: Comparison of AUC scores for models using RFECV in combination with ADASYN andStratified K-Fold CV

77

CHAPTER 5. DATA MINING

Figure 5.13 presents eight graphs taken from separate runs of the RFECV algorithm, where each graphplots the number of features against a DT classifier’s prediction performance. The graphs are taken fromseparate iterations of the feature elimination approach and highlight how the number of features requiredto achieve maximum performance varies from one iteration to another. The figure presents graphs in whichfour to eleven features are considered optimal for maximising a classifier’s performance. The high degreeof variation can be explained by the similarity in the shape of the graphs. Here it is seen that classificationperformance is initially poor when the number of features is small. The performance rapidly improvesas the number of features increases, after which it plateaus once approximately four to five features havebeen reached. The flattening of the curves represents the algorithm’s inability to identify a single subset offeatures that can be considered optimum, hence the significant variation in the number of features selected.Ideally, a decrease in performance should be seen after the optimal number of features has been reached,resulting in a graph with a clear peak and the curve falling away on both sides.From 1000 iterations of RFECV, three features are common to all sets of selected features. These arethe V5 distance of a player, the mean standard deviation ratio of a player’s V4 distance, and a player’sprevious injury feature. Despite these three features being common to all feature sets, they are neverselected without the inclusion of additional features. The additional features, however, differ, which maybe caused by several factors. The first of these may be attributed to the high degree of correlation amongmany of the features. A high correlation between features may result in a feature being selected in oneiteration, but being excluded in another iteration as a result of another highly correlated feature beingselected instead. Another factor that influences the variability of the number of features selected is thefact that only 30% of the data set is used for RFECV. As a result of the combination of a small numberof injuries and a low degree of correlation between the workload features and injuries, it is logical thatthe features will differ from one iteration to another. The last factor involves the use of a decision treeclassifier in the RFECV process. Decision trees are susceptible to overfitting, which may also contribute tothe high degree of variability between iterations. An alternative option would be to use a random forestclassifier which is less susceptible to overfitting.

5.5.3 Discussion of Results

Models of the relationship between GPS workloads, a previous injury feature, and injury shows lim-ited ability to predict future injury among players from a professional Norwegian football team. MeanAUC scores are below 0.52 for all modelling approaches indicating that injury predictions are no betterthan those expected by random chance. Precision scores are higher than recall scores for all modellingapproaches, with a highest score of 0.65±0.17 being achieved by the feature selection approach. Theinability to achieve recall scores higher than 0.02 means that all modelling approaches result in a largenumber of false-negative predictions, indicating that models are unable to identify the injury class. Theuse of different definitions of injury does not improve model performance and supports evidence from asimilar study on Australian football players [5]. The naive approach adopted in this study produced theworst results, indicating that the techniques adopted to compensate for unbalanced data, and irrelevantfeatures improve performance. Despite the use of previously identified features showing no improvementin predictive performance, 1000 iterations of feature elimination result in a subset of three features beingselected in every iteration. One of these features, a previous injury feature, is identical to that identified inprevious work and suggests that it may be a contributing factor to the risk of injury [39].

78

CHAPTER 5. DATA MINING

(a) Four Features (b) Five Features

(c) Six Features (d) Seven Features

(e) Eight Features (f) Nine Features

(g) Ten Features (h) Eleven Features

Figure 5.13: Comparison of feature selection

79

CHAPTER 5. DATA MINING

Several limitations that may negatively impact the prediction models’ performance are identified. Theamount of data collected in this study is significantly less than that collected in many other studies of asimilar nature, which typically include a much larger number of players, multiple seasons of data recording,or a combination of both [40; 5; 23; 45; 38]. The limited size of the data set may severely impair amodels ability to generalise cases of injury, and models run the risk of overfitting to the training set. Ithas been proposed that more than ten seasons of data is needed to create reliable prediction models [5].Two strategies can be used to increase the size of the data set. One of these involves the inclusion ofadditional teams and may be achieved by the sharing of data between football clubs. An advantage ofthis strategy is that a wide variety of players and workloads will improve a models generalisation ability.Additionally, a large amount of data can be collected over a shorter time. The complexity of coordinatinga larger scale project is, however, a distinct disadvantage of the above strategy. Another strategy is tocontinue collecting data during future seasons. The strategies mentioned above are not mutually exclusiveand may be implemented together.In addition to the limited amount of data, this study has a very low number of injuries with respect tothe size of the data set. Using the example of NCTL injuries, similar studies report incidents of injuryaccounting for 1.7 and 2.2 percent of the data set [5; 39]. In contrast, NCTL injuries account for only0.5 percent of the entire data set in this study. Severe class imbalance negatively affects the quality ofdata generated in the process of synthetic data generation, which again may impair a models ability togeneralise. There is little that can be done about the low number of injuries. However, injuries that areexcluded from the data set due to being classified as pre-season injuries, may be incorporated in futurestudies if off-season training loads are to be recorded. Additionally, by incorporating more clubs into thestudy, and by continuing to collect workload and injury data over multiple seasons, the number of injurieswill naturally increase.Another possible limitation encountered in this study is that of the definition of injury. All NC and NCTLinjuries which occur in-season are included among the injury data regardless of whether they took placeduring a session or not. As several incidents of injury occurred outside of training/competition, they maynot be directly associated with previous workloads, and for this reason, negatively affect a models abilityto differentiate between classes of injury and non-injury. Similarly, more specific definitions of injury, suchas hamstring injury, have been shown to produce better prediction performance [5]. Hamstring injury is,however, not used in this study due to an insufficient number of cases. Again the collection of additionaldata using the strategies mentioned above may provide the data necessary to include hamstring injuries infuture studies.Finally, GPS data is also considered a possible limitation in this study. As discussed in the data under-standing phase, multiple GPS recordings fall outside of the range of what is expected. This highlights thefact that GPS devices are susceptible to providing inaccurate workload recordings. Despite efforts beingmade to correct the more apparent outliers, there is a high probability that several more subtle recordinginaccuracies remain unchanged. An obvious solution to the problem is to replace the GPS units withnewer models. This is, however, costly and may not be a viable alternative. Another option is to monitorGPS recordings more frequently as opposed to at the end of the season after the ETL process has beenperformed. More frequent checking by coaching staff would provide researchers with a clearer indicationof which recording needs to be corrected, and thus limiting potential errors.

80

Chapter 6

Conclusion

Until recently, the majority of research on relationships between athlete workloads and injury has beenlimited to univariate studies. However, the emergence of machine learning has given rise to numerousmultivariate studies. The work covered in this thesis forms part of a partnership between the University ofOslo (UIO) and the Norwegian School of Sports Science (NIH), which aims to study relationships betweenathlete workloads, injury and illness. With this thesis serving as the first work undertaken in the partnership,two goals are identified. The first is to build a data warehouse to store all available training, competition,injury and illness data generated by a professional Norwegian football team. The data warehouse serves asa unified representation of the clubs data, providing data for both the second goal of this thesis, as well asfuture research. The second goal is to conduct a data mining study using player workload and injury datato predict future injury.This chapter begins in Section 6.1 with a summary of the data warehouse implemented in this thesis. Itthen proceeds to Section 6.2 which provides a summary of the data mining study and a brief discussionmodel performance. Lastly, Section 6.3 provides direction for the future work of the project.

6.1 Data Warehouse Summary

The choice of a data warehouse is motivated by three factors. These are the user needs, the granularity ofthe source data, and the update frequency of the datastore. Data warehouses specialise in handling thetypes of query intended for this project, which are expected to be relatively complex and access a largenumber of tuples. A data warehouse is also well suited to working with multiple levels of granularity usingOLAP functions such as roll-up and drill-down. The GPS data for this project is exported at a granularityof one-second intervals, providing the opportunity for multiple levels of aggregation, thus making a datawarehouse a logical alternative for its storage. Update policies for data warehouses are typically lessfrequent than those of traditional database management systems. The data for this project is made availablefrom CSV file exports, a process which is carried out infrequently due to the timeliness of the process.Data updates are thus infrequent, making the data well suited for a data warehouse.The three data sources available to this project are GPS workload data, a training log completed byplayers after each session, and an injury/illness log completed by the club’s medical staff. Workload dataincludes four distance features, namely total distance, V4 distance, V5 distance, and HSR distance, and twoacceleration features, namely acceleration load, and player load. Injury and illness data contains important

81

CHAPTER 6. CONCLUSION

information such as the date and type of injury/illness incurred by a player, as well information about itsseverity, mechanism, duration, and the number of days a player is absent from training/competition. Logdata summarises the difference between a player’s planned and actual workloads for each session, as wellas the difference between a player’s planned and actual enjoyment for each session.Data warehouse design is divided into four phases using an analysis/source-driven approach. Thisapproach takes into account both the analysis needs of users as well as the data which are available fromthe underlying source systems. The process begins with the requirements specification phase, in whichthe data required to achieve project goals, as well as several data warehouse components are identified.This is followed by the conceptual design phase, which uses information from the requirements phaseto generate a single conceptual schema to facilitate communication between designers and users. Twomodelling issues are identified in this phase. The first of these is many-to-many dimensions, which resultfrom the possibility of players being injured and ill at the same time. It is resolved by splitting injury andillness into two separate dimensions. The other is that of unbalanced hierarchies, which are present inboth the Injury and Illness dimensions, and is resolved by using placeholders for missing levels in each ofthe hierarchies. A star schema comprised of a single fact table and six dimensions is then generated inthe logical design phase using a ROLAP approach. The fact table includes a foreign key for each of thesix dimension tables, the six workload features represented as measures, a date-time stamp and a uniqueidentifier which serves as the tables primary key. The date-time stamp supports precise time calculationsat finer levels of granularity. It is included in the fact table as opposed to a separate date-time dimensionto prevent dimension tables from becoming excessively large. The time hierarchy is thus represented asa combination of a Date dimension, a Session dimension and the date-time stamp in the fact table. Thisprovides users with the ability to aggregate workload data from the finest granularity at the second-level,up the level of an entire season. The other dimensions included in the schema are the Player, Training Log,Injury and Illness dimensions.In the physical design phase, B-Tree indexes are created for all primary keys as well as the session-idand player-id columns in the fact table. B-tress indexes are used as an alternative to join indexes due tothere being no support for the latter in PostgreSQL. A materialised view is also created in this phase andincludes workload and injury data aggregated at the session-level. The data included in the material viewis selected specifically for the analysis needs of the data mining completed in this thesis and does not takeinto consideration the analysis needs of future work.

6.2 Data Mining Summary

In the second half of this thesis, a data mining study is conducted using the CRISP-DM process model.With the aim of predicting future injury from workload data, five phases from this framework are carriedout, namely business understanding, data understanding, data preparation, modelling, and evaluation. Fourobjectives are identified in the business understanding phase, all of which aim to predict future injury inprofessional football players using workload and injury data gathered from a single season of training andcompetition. The first and primary objective is to predict future injury, while the three remaining objectivesinvestigate whether altering the definition of injury, using a subset of previously identified features, orusing a process of feature elimination improve a models prediction performance. Two definitions of injuryare provided in this phase, namely non-contact injuries (NC) and non-contact causing time loss (NCTL).Furthermore, all data is aggregated at the session-level.

82

CHAPTER 6. CONCLUSION

Exploratory data analysis (EDA) is performed in the data understanding phase using injury and workloaddata from the materialised view created in the first half of this thesis. GPS recording errors are identified,and 174 sessions are corrected using a smoothing by bin means approach. Additionally, several injuriesare identified as occurring before the start of the season and are considered irrelevant for the study. Playerworkloads of injured and non-injured players are compared for both definitions of injury. The most notabledifference is observed between the HSR distances of non-injured players and players with NC injuries.However, the differences are insignificant and do not provide any insight into possible relationshipsbetween player workloads and player injury.The data preparation phase involves the creation of the final data set to be used for predictive modelling. Itconsists of 19 features and a label indicating whether a player received an injury after the current session.The 19 features consist of the six GPS workload features, six estimated weighted moving average (EWMA)features, six mean standard deviation (MSWR) features, and a previous injury feature which represents thenumber of injuries a player has received to date as well as the number of days since returning to trainingafter recovering from their last injury. Two data sets are created, one for each definition of injury, each ofwhich contains data from 2848 sessions and 32 players. The NC and NCTL data sets contain 34 and 13injury labels, respectively.Four models of varying complexity and interpretability are used in the modelling phase. The modelsincluded in this phase are decision trees (DT), random forests (RF), logistic regression (LG), and supportvector machines (SVM). Furthermore, each model is used in four modelling approaches which aim toachieve the objectives identified in the business understanding phase. The first of these is a naive approach,which adopts a classic train-test split approach. The second and third approaches make use of stratifiedk-fold cross validation and synthetic data generation using ADASYN to improve generalisation andaccount for the class imbalance present in the data set. The difference between the two approaches is thatapproach two make use of the entire data set, whereas approach three uses a subset of previously identifiedfeatures, namely a player’s high-speed running distance (HSREWMA), the mean standard deviation ratio of aplayer’s total distance (DistMSWR), and the estimated weighted moving average of a previous injury feature(PIEWMA). The fourth approach uses 30% of the data set to eliminate features which do not contribute toa models predictive performance. Recursive feature elimination with cross validation (RFECV) is usedfor this process. Irrelevant features are removed from the remaining data set before using ADASYN tocorrect class imbalance and stratified k-fold cross validation to improve generalisation. Each approach issubjected to a 1000 iterations of training and testing. Performance metrics are reported as an average oftest results from all iterations.Finally, the evaluation phase involves evaluating a models predictive performance for each of the fourapproaches. Four class independent evaluation metrics are used to assess model performance, namely,precision, recall, F1 score and area under the curve (AUC). Models of the relationship between GPS work-loads, a previous injury feature, and injury show limited ability to predict future injury among players froma professional Norwegian football team. Mean AUC scores are below 0.52 for all modelling approachesindicating that injury predictions are no better than those expected by random chance. Precision scores arehigher than recall scores for all modelling approaches, with a highest score of 0.65±0.17 being achievedby the feature selection approach. The inability to achieve recall scores higher than 0.02 means that allmodelling approaches result in a large number of false-negative predictions, indicating that models areunable to identify the injury class. The use of different definitions of injury does not improve modelperformance. The previous injury feature is one of three features selected in every iteration of featureelimination, suggesting a possible association with player injury.

83

CHAPTER 6. CONCLUSION

6.3 Future Work

As mentioned above, this thesis serves as the first work undertaken in the partnership between UIO andNIH, and numerous opportunities for the future work of the project are identified. The most importantelement of future work is to gather more data and is considered essential for achieving good results inthe field of injury prediction. This topic is discussed in Section 5.5.3, which highlights two approachesfor attaining more data. The first is to include data from additional clubs and has the advantage ofaccumulating more data over a shorter period. Additionally, data from a large number of players mayimprove the generalisation ability of predictive models. A potential drawback to this approach is that ofadded complexity. Increasing the number of clubs is likely to result in heterogeneity among data sources,something which needs to be taken into consideration in the ETL process. Another approach to gatheringdata is to continue collecting data from the same club over multiple seasons. This approach is advantageousin that the data warehouse and ETL processes are already in place, and no additional consideration tonew data formats is required. The disadvantage of this approach is that it will take many years before asubstantial amount of data is collected. The two approaches are, however, not mutually exclusive and maybe used together.The data set used for predictive modelling in this thesis is limited to the inclusion of two relative workloads,namely MSWR and EWMA. There is strong evidence that spikes in acute:chronic workload ratios (ACWR)are associated with increases in team injury rates [20; 32]. ACWR workloads are not included in thisstudy due to their calculation requiring 30 days of accumulated workload data. For this reason, the first30 days since a player’s first session cannot be included in the data set and result in a drastic reductionin the size of the data set used for modelling. ACWR workloads may, however, provide very useful forfuture work if the data set is to increase in size. Furthermore, the inclusion of physical measures of fitness,motor coordination, and neuromuscular measurements have shown promising results in the field of injuryprediction [38; 2].The data warehouse provides several opportunities for improving query performance, simplifying queriesand automating ETL processes. The current ETL process is both complex and time-consuming. One of thesignificant constraints with the current process is that Catapult GPS data is loaded from CSV files. Thegeneration of a CSV file is time-consuming and subject to human error. Additionally, CSV files need tobe cleaned and separated by player before their data can be loaded into the warehouse. An alternativeapproach is to load data using an open API provided by Catapult Sports. Such an approach eliminatestimely and complex tasks and enables secure batch loading of the data warehouse with the update frequencybeing decided by users.The introduction of injury definitions into the Injury dimension may serve to simplify queries further.The process of deciding which injuries are to be included in a given definition is a manual task involvingcareful inspection of each incident of injury. The lists produced from these tasks are used in SQL querieswhich are used in the data mining chapter of this thesis. By adding columns to the Injury dimension, whichclassify an injury according to definitions specific to analysis tasks, SQL queries can be simplified.Another improvement which needs to be considered as the data warehouse increases in size is that ofphysical modelling. This topic is briefly discussed in this thesis but requires additional work to ensureoptimum performance for larger amounts of data. Firstly, indexing techniques such as join indexes are not

84

CHAPTER 6. CONCLUSION

supported by PostgreSQL but may be implemented manually by data warehouse designers. Additionally,BRIN indexes may be considered for the Injury dimension. Further performance considerations includepartitioning and view maintenance.

85

CHAPTER 6. CONCLUSION

86

Bibliography

[1] L. Anderson, T. Triplett-McBride, C. Foster, S. Doberstein, and G. Brice. Impact of training patternson incidence of illness and injury during a women’s collegiate basketball season. Journal of strength

and conditioning research / National Strength Conditioning Association, 17:734–8, 12 2003.

[2] F. Ayala, A. Lopez-Valenciano, J. Martín, M. De Ste Croix, F. Vera-Garcia, M. García-Vaquero,I. Ruiz-Pérez, and G. Myer. A preventive model for hamstring injuries in professional soccer:Learning algorithms. International Journal of Sports Medicine, 40, 03 2019.

[3] D. Berrar. Cross-Validation. 01 2018.

[4] M. S. Brink, C. Visscher, S. Arends, J. Zwerver, W. J. Post, and K. A. Lemmink. Monitoring stressand recovery: new insights for the prevention of injuries and illnesses in elite youth soccer players.British Journal of Sports Medicine, 44(11):809–815, 2010.

[5] D. Carey, K.-L. Ong, R. Whiteley, K. Crossley, J. Crow, and M. Morris. Predictive modelling oftraining loads and injury in australian football. International Journal of Computer Science in Sport,17, 06 2017.

[6] J. Claudino, D. Capanema, T. Souza, J. Serrão, A. Machado Pereira, and G. Nassis. Currentapproaches to the use of artificial intelligence for injury risk assessment and performance predictionin team sports: a systematic review. Sports Medicine - Open, 5(1):1–12, 2019.

[7] T. Eckard, D. Padua, D. Hearn, B. Pexa, and B. Frank. The relationship between training load andinjury in athletes: A systematic review. Sports Medicine, 48:1–33, 06 2018.

[8] F. Ehrmann, C. Duncan, D. Sindhusake, W. Franzsen, and D. Greene. Gps and injury prevention inprofessional soccer. Journal of strength and conditioning research / National Strength Conditioning

Association, 30, 07 2015.

[9] J. Ekstrand, M. Hägglund, and M. Waldén. Injury incidence and injury patterns in professionalfootball: the uefa injury study. British Journal of Sports Medicine, 45(7):553–558, 2011.

[10] C. G. Weng and J. Poon. A new evaluation measure for imbalanced datasets. volume 87, pages27–32, 01 2008.

[11] T. Gabbett. Reductions in pre-season training loads reduce training injury rates in rugby leagueplayers. British journal of sports medicine, 38:743–9, 12 2004.

[12] T. Gabbett. The training-injury prevention paradox: Should athletes be training smarter and harder?British journal of sports medicine, 50, 01 2016.

87

[13] T. Gabbett and S. Ullah. Relationship between running loads and soft-tissue injury in elite team sportathletes. Journal of strength and conditioning research / National Strength Conditioning Association,26:953–60, 02 2012.

[14] M. Hägglund, M. Waldén, R. Bahr, and J. Ekstrand. Methods for epidemiological study of injuriesto professional football players: developing the uefa model. British Journal of Sports Medicine,39(6):340–346, 2005.

[15] M. Hägglund, M. Waldén, H. Magnusson, K. Kristenson, H. Bengtsson, and J. Ekstrand. Injuriesaffect team performance negatively in professional football: an 11-year follow-up of the uefachampions league injury study. British Journal of Sports Medicine, 47(12):738–742, 2013.

[16] S. Halson. Monitoring training load to understand fatigue in athletes. Sports medicine (Auckland,

N.Z.), 44, 09 2014.

[17] J. Han, M. Kamber, and J. Pei. 1 - introduction. In J. Han, M. Kamber, and J. Pei, editors, Data

Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems, pages 1 – 38.Morgan Kaufmann, Boston, third edition edition, 2012.

[18] H. He, Y. Bai, E. A. Garcia, and S. Li. Adasyn: Adaptive synthetic sampling approach for imbalancedlearning. In IJCNN, pages 1322–1328. IEEE, 2008.

[19] H. He and E. Garcia. Learning from imbalanced data. Knowledge and Data Engineering, IEEE

Transactions on, 21:1263 – 1284, 10 2009.

[20] B. Hulin, T. Gabbett, D. Lawson, P. Caputi, and J. Sampson. The acute: Chronic workload ratiopredicts injury: High chronic workload may decrease injury risk in elite rugby league players. British

Journal of Sports Medicine, 0:1–7, 10 2015.

[21] M. Hägglund, M. Waldén, and J. Ekstrand. Previous injury as a risk factor for injury in elite football:A prospective study over two consecutive seasons. British journal of sports medicine, 40:767–72, 092006.

[22] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: With

Applications in R. Springer Publishing Company, Incorporated, 2014.

[23] A. Jaspers, T. Beéck, M. Brink, W. Frencken, F. Staes, J. Davis, and W. Helsen. Relationshipsbetween the external and internal training load in professional soccer: What can we learn frommachine learning? International Journal of Sports Physiology and Performance, 13:1–18, 12 2017.

[24] M. Jordan and T. Mitchell. Machine learning: Trends, perspectives, and prospects. Science (New

York, N.Y.), 349:255–60, 07 2015.

[25] C. Julien. What is acceleration load?, 2019.

[26] C. Julien. What is playerload?, 2019.

[27] R. Kimball. Latest thinking on time dimension tables, 2004.

[28] D. Kirkendall and J. Dvorak. Effective injury prevention in soccer. The Physician and sportsmedicine,38:147–57, 04 2010.

88

[29] M. Kuhn and K. Johnson. Feature Engineering and Selection: A Practical Approach for Predictive

Models. Chapman and Hall/CRC, 2019.

[30] E. Lehmann and G. Schulze. What does it take to be a star? – the role of performance and the mediafor german soccer players. Applied Economics Quarterly (formerly: Konjunkturpolitik), 54:59–70,02 2008.

[31] J. M. Lucas and M. S. Saccucci. Exponentially weighted moving average control schemes: Propertiesand enhancements. Technometrics, 32(1):1–12, 1990.

[32] S. Malone, A. Owen, M. Newton, B. Mendes, T. Gabbett, and K. Collins. The acute:chonic workloadratio in relation to injury risk in professional soccer. Journal of Science and Medicine in Sport, 20,11 2016.

[33] G. Menardi and N. Torelli. Training and assessing classification rules with unbalanced data. Data

Mining and Knowledge Discovery, 01 2012.

[34] D. Murphy, D. Connolly, and B. Beynnon. Risk factors for lower extremity injury: A review of theliterature. British journal of sports medicine, 37:13–29, 03 2003.

[35] N. B. Murray, T. J. Gabbett, A. D. Townshend, and P. Blanch. Calculating acute:chronic workloadratios using exponentially weighted moving averages provides a more sensitive indicator of injurylikelihood than rolling averages. British Journal of Sports Medicine, 51(9):749–754, 2017.

[36] D. Pfirrmann, M. Herbst, P. Ingelfinger, P. Simon, and S. Tug. Analysis of injury incidences in maleprofessional adult and elite youth soccer players: A systematic review. Journal of Athletic Training,51(5):410–424, 2016. PMID: 27244125.

[37] S. W. Roberts. Control chart tests based on geometric moving averages. Technometrics, 1(3):239–250,1959.

[38] N. Rommers, R. Rössler, E. Verhagen, F. Vandecasteele, S. Verstockt, R. Vaeyens, M. Lenoir, andE. D’Hondt. A machine learning approach to assess injury risk in elite youth football players.Medicine Science in Sports Exercise, page 1, 02 2020.

[39] A. Rossi, L. Pappalardo, P. Cintia, F. M. Iaia, J. Fernãndez, and D. Medina. Effective injuryforecasting in soccer with gps training data and machine learning. PLOS ONE, 13(7):1–15, 07 2018.

[40] J. Ruddy, A. Shield, N. Maniar, M. Williams, S. Duhig, R. Timmins, J. Hickey, M. Bourne, andD. Opar. Predictive modeling of hamstring strain injuries in elite australian footballers. Medicine

Science in Sports Exercise, 50:1, 12 2017.

[41] T. Scott, C. R. Black, J. Quinn, and A. Coutts. Validity and reliability of the session-rpe method forquantifying training in australian football: A comparison of the cr10 and cr100 scales. Journal of

strength and conditioning research / National Strength Conditioning Association, 27, 03 2012.

[42] T. Soligard, M. Schwellnus, J.-M. Alonso, R. Bahr, B. Clarsen, H. P. Dijkstra, T. Gabbett, M. Gleeson,M. Hägglund, M. R. Hutchinson, C. Janse van Rensburg, K. M. Khan, R. Meeusen, J. W. Orchard,B. M. Pluim, M. Raftery, R. Budgett, and L. Engebretsen. How much is too much? (part 1)international olympic committee consensus statement on load in sport and risk of injury. British

Journal of Sports Medicine, 50(17):1030–1041, 2016.

89

[43] M. Stein, H. Janetzko, D. Seebacher, A. Jäger, M. Nagel, J. Hölsch, S. Kosub, T. Schreck, D. A. Keim,and M. Grossniklaus. How to make sense of team sport data: From acquisition to data modeling andresearch aspects. Data, 2(1), 2017.

[44] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Education, 2006.

[45] H. Thornton, J. Delaney, G. Duthie, and B. Dascombe. Importance of various training load measureson injury incidence of professional rugby league athletes. International Journal of Sports Physiology

and Performance, 12, 10 2016.

[46] A. Vaisman and E. Zimányi. Data Warehouse Systems: Design and Implementation. Springer,Heidelberg, 2014.

[47] Vassiliadis and A. Simitsis, Panos. Extraction, transformation, and loading, 2009.

[48] L. Wallace, K. Slattery, and A. Coutts. The ecological validity and application of the session-rpemethod for quantifying training loads in swimming. Journal of strength and conditioning research /

National Strength Conditioning Association, 23:33–8, 11 2008.

[49] S. Williams, S. West, M. J. Cross, and K. A. Stokes. Better way to determine the acute:chronicworkload ratio? British Journal of Sports Medicine, 51(3):209–210, 2017.

[50] R. Wirth and J. Hipp. Crisp-dm: Towards a standard process model for data mining. Proceedings of

the 4th International Conference on the Practical Applications of Knowledge Discovery and Data

Mining, 01 2000.

[51] G. Wu and E. Chang. Class-boundary alignment for imbalanced dataset learning. ICML 2003

Workshop on Learning from Imbalanced Data Sets, 01 2003.

90

Appendices

91

Appendix A

Source Code

The source code for this thesis can be found at: https://github.uio.no/garthft/masters-thesis.This includes code for the creation of the data warehouse used in this thesis, code for the ETL processesrequired to load data into the warehouse, and code for the data mining study conducted in this thesis.

The data warehouse is physically implemented using PostgreSQL 12.2. ETL processes are carried outusing Java 13.0.2 and Python 3.7.6. Java is primarily used for cleaning tasks which involve the parsingCSV files. Loading tasks, however, are executed using Python together with the PostgreSQL databaseadapter Pyscopg2.Data modelling is carried out using Scikit-learn, a machine learning library for Python. Additionally, datavisualisation, data processing, and reporting make use of the following Python libraries: Pandas, NumPy,Imblearn, Matplotlib, and Seaborn.

93

Appendix B

MultiDim Model for Data Warehouses

Figure B.1: Dimension level

Figure B.2: Fact table

95

Figure B.3: Cardinatlities

Figure B.4: Dimension types

Figure B.5: Balanced hierarchy

96

Figure B.6: Ragged hierarchy

97

98

Appendix C

BPMN Notation

Figure C.1: General notation

99

Figure C.2: More general notation

Figure C.3: Events

Figure C.4: Gateways

100