because it's the cup: predicting the stanley cup...

1
Because It's the Cup: Predicting the Stanley Cup Playoffs Mason Swofford, Shuvam Chakraborty, Vineet Kosaraju The Stanley Cup playoffs have long been known for their drama: most games are close, upsets are common, and teams not considered one of the best can win the cup. However, current predictions are based on traditional NHL statistics, which are not indicative of success due to luck influenced outcomes. Our project has two main goals: 1) predict regular season and playoff game results, and 2) construct a gambling agent to optimize returns. Backround Features Models Results & Error Analysis Data Collection Two main datasets: Regular season & playoff games Training set amalgamated from data sources #2 (stats: features) and #3 (game results: labels) Each training example is 1 game, label = winning team (0=A, 1=H) Features for each training example consists of team statistics averaged over past games in season (see right) Feature Description CF Corsi For, shot attempts for a team, including blocked shots, and shots not on goal CA Corsi Against, shot attempts against a team, including blocked and not on goal shots. GF Goals For GA Goals Against xGF Expected Goals For, based on quality of shot attempts for xGA Expected Goals Against, based on quality of shot attempts against PENT Penalties Taken PEND Penalties Drawn Feature Description PDO Shooting%+Save% (rough measure of luck) FF Fenwick For (unblocked shot attempts) FA Fenwick Against SF Shots For (on goal) SA Shots Against xPDO Expected PDO dPDO PDO difference OZS Offensive Zone Starts DZS Defensive Zone Starts NZS Neutral Zone Starts ZSR Zone Start Ratio FOW Faceoffs Won FOL Faceoffs Lost GVA Giveaways TKA Takeaways HF Hits For HA Hits Against % Win Winning Percentage “Basic” Features “Advanced” Features Regarding Goal 1, classification models attempted include: Logistic, softmax regression SVM (rbf, linear, poly, sigmoid) ANNs (varying hidden layers, activation functions) Features chosen using basic feature selection and PCA. Predicting if team A wins a playoff is done with a binomial distribution, where p is the prob. A wins a game: Regarding Goal 2, the gambling problem was formulated as a Markov Decision Process. State: (currentMoney, game). Start state: (initialMoney, 0). Action: (money, team). Can bet up to current money on Home/Away; betting amounts discretized. T(s, a, s’): Probabilities of transitions are given by our ML model. isEnd(s): If we run out of money, or we have reached the last game. R(s, a, s’): 1 if we have reached an end state and have greater than or equal to Desired Amount and 0 otherwise Discount: Set to 1. User Parameters: Payoff: A number greater than 1 that corresponds to how much you get back for each dollar bet Bucket Size: Discretization size for betting. Desired Amount: minimum money we want to finish with 44 46 48 50 52 54 56 58 60 62 Baseline Logistic Softmax SVM (Rbf) SVM (Poly) SVM (Sigmoid) SVM (Linear) Game Result Prediction Accuracy using Basic Features Training Set Validation Set 44 46 48 50 52 54 56 58 60 62 Baseline Logistic SVM (Rbf) SVM (Linear) ANN (h=5, relu) ANN (h=15, logistic) ANN (h=5/10, tanh) ANN (h=5/10, identity) Game Result Prediction Accuracy using Basic + Advanced Features Training Set Validation Set Figure 1, 2: Training and validation accuracies reported using 10-fold cross validation. For the best model, the accuracy on the test set of playoffs was 54.66% (for reference, ESPN experts were ~51% accurate). Ablative analysis of basic features demonstrated that more advanced ones were needed, however even advanced features didn’t help. The literature mentions a theoretical limit for predicting the result of a single hockey game due to luck/variability: This limit of 60-63% was confirmed in a Monte Carlo simulation, running 1000 trials, suggesting games can’t be directly predicted (right). This model is similar to those used in the NFL. Conclusions References Hockey is a very challenging sport to predict due to variability inherent to sport. Perfect stats could allow reaching the theoretical accuracy limit, but incremental progress needed. Reached 70% using SVM on playoff data, so model could be fine-tuned. Applications in other leagues (other than NHL) or sports (baseball, etc). Pischedda, Gianni. Predicting NHL Match Outcomes with ML Models. citeseerx.ist.psu.edu/viewdoc/download?doi=1 0.1.1.735.795&rep=rep1&type=pdf. Weissbock, Joshua, et al. Use of Performance Metrics to Forecast Success in the National Hockey League. ceur-ws.org/Vol-1969/paper- 06.pdf. 1. Playoff Results 2. Daily Team Stats 3. Betting Odds 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 10 20 30 40 50 60 70 80 90 100 Monte Carlo Simulation of Season SD from Luck Observed SD Figure 3: Results are accounted for by 73% luck, so when making predictions we can accurately predict 27% and guess with 50% accuracy on the 73%, which gives us a 63.5% ceiling. $200 $0 $300 + $675 - $200 +/- $0 MDP: Example Total: $1475 in 3 games. Shot attempt/quality features. Includes shot attempt features, and adds overall team-based metrics. 0 0.2 0.4 0.6 0.8 0 500 1000 1500 0.01 0.02 0.03 0.04 0.05 0.075 0.1 0.2 0.5 Accuracy vs. ”Closeness” of Game Samples Accuracy Team Type Number Games Accuracy Both Good 1088 0.5588 Both Bad 1306 0.5628 Good & Bad 3146 0.5950 Team Type Accuracy Both Good Teams 0.524 Both Bad Teams 0.547 One Good Team, One Bad 0.572 Baseline Predictions

Upload: vuongcong

Post on 19-Oct-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Because It's the Cup: Predicting the Stanley Cup …cs229.stanford.edu/proj2017/final-posters/5147767.pdfBecause It's the Cup: Predicting the Stanley Cup Playoffs Mason Swofford, Shuvam

Because It's the Cup: Predicting the Stanley Cup PlayoffsMason Swofford, Shuvam Chakraborty, Vineet Kosaraju

The Stanley Cup playoffs have longbeen known for their drama: mostgames are close, upsets arecommon, and teams not consideredone of the best can win the cup.However, current predictions arebased on traditional NHL statistics,which are not indicative of successdue to luck influenced outcomes.Our project has two main goals: 1)predict regular season and playoffgame results, and 2) construct agambling agent to optimize returns.

Backround Features Models Results & Error Analysis

Data Collection

Twomaindatasets:

• Regularseason&playoffgames

• Trainingsetamalgamatedfromdatasources#2(stats:features)and#3(gameresults:labels)

• Eachtrainingexampleis1game,label=winningteam(0=A,1=H)

• Featuresforeachtrainingexampleconsistsofteamstatisticsaveragedoverpastgamesinseason(seeright)

Feature DescriptionCF Corsi For,shotattemptsforateam,

includingblockedshots,andshotsnotongoal

CA Corsi Against,shotattemptsagainstateam,includingblockedandnotongoalshots.

GF GoalsForGA GoalsAgainstxGF ExpectedGoalsFor,basedon

qualityofshotattemptsforxGA ExpectedGoalsAgainst,basedon

qualityofshotattemptsagainstPENT PenaltiesTakenPEND PenaltiesDrawn

Feature DescriptionPDO Shooting%+Save% (roughmeasure

ofluck)FF Fenwick For(unblockedshot

attempts)FA FenwickAgainstSF ShotsFor(ongoal)SA ShotsAgainstxPDO ExpectedPDOdPDO PDOdifferenceOZS OffensiveZoneStartsDZS DefensiveZoneStartsNZS NeutralZoneStartsZSR ZoneStartRatioFOW Faceoffs WonFOL Faceoffs LostGVA GiveawaysTKA TakeawaysHF HitsForHA HitsAgainst%Win WinningPercentage

“Basic” Features

“Advanced” Features

Regarding Goal 1, classificationmodels attempted include:

• Logistic, softmax regression

• SVM (rbf, linear, poly, sigmoid)

• ANNs (varying hidden layers,activation functions)

Features chosen using basic featureselection and PCA. Predicting ifteam A wins a playoff is done with abinomial distribution, where p is theprob. A wins a game:

Regarding Goal 2, the gamblingproblem was formulated as aMarkov Decision Process.State:(currentMoney,game).Startstate:(initialMoney,0).Action:(money,team).CanbetuptocurrentmoneyonHome/Away;bettingamountsdiscretized.T(s,a,s’): ProbabilitiesoftransitionsaregivenbyourMLmodel.isEnd(s): Ifwerunoutofmoney,orwehavereachedthelastgame.R(s,a,s’):1ifwehavereachedanendstateandhavegreaterthanorequaltoDesiredAmountand0otherwiseDiscount:Setto1.

UserParameters:

Payoff:Anumbergreaterthan1thatcorrespondstohowmuchyougetbackforeachdollarbetBucketSize:Discretizationsizeforbetting.DesiredAmount:minimummoneywewanttofinishwith

44464850525456586062

Baseline Logistic Softmax SVM(Rbf) SVM(Poly) SVM(Sigmoid) SVM(Linear)

GameResultPredictionAccuracyusingBasicFeatures

TrainingSet ValidationSet

44

46

48

50

52

54

56

58

60

62

Baseline Logistic SVM(Rbf) SVM(Linear) ANN(h=5,relu) ANN(h=15,logistic)

ANN(h=5/10,tanh)

ANN(h=5/10,identity)

GameResultPredictionAccuracyusingBasic+AdvancedFeatures

TrainingSet ValidationSet

Figure 1, 2: Training and validation accuracies reported using 10-fold crossvalidation. For the best model, the accuracy on the test set of playoffs was54.66% (for reference, ESPN experts were ~51% accurate).

Ablative analysis of basic featuresdemonstrated that more advancedones were needed, however evenadvanced features didn’t help. Theliterature mentions a theoretical limitfor predicting the result of a singlehockey game due to luck/variability:

This limit of 60-63% was confirmed ina Monte Carlo simulation, running1000 trials, suggesting games can’t bedirectly predicted (right). This model issimilar to those used in the NFL.

Conclusions

References

• Hockey is a very challenging sport topredict due to variability inherent tosport.

• Perfect stats could allow reaching thetheoretical accuracy limit, butincremental progress needed.

• Reached 70% using SVM on playoffdata, so model could be fine-tuned.

• Applications in other leagues (otherthan NHL) or sports (baseball, etc).

Pischedda,Gianni.PredictingNHLMatchOutcomeswithMLModels.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.795&rep=rep1&type=pdf.

Weissbock,Joshua,etal.UseofPerformanceMetricstoForecastSuccessintheNationalHockeyLeague.ceur-ws.org/Vol-1969/paper-06.pdf.

1. Playoff Results

2. Daily Team Stats

3. Betting Odds

00.050.10.150.20.250.30.35

0 10 20 30 40 50 60 70 80 90 100

MonteCarloSimulationofSeason

SDfromLuck ObservedSD

Figure 3: Results are accounted for by 73%luck, so when making predictions we canaccurately predict 27% and guess with 50%accuracy on the 73%, which gives us a63.5% ceiling.

$200

$0

$300

+ $675

- $200

+/- $0

MDP: Example

Total: $1475 in 3 games.

Shot attempt/quality features.

Includes shot attempt features, andadds overall team-based metrics.

0

0.2

0.4

0.6

0.8

0

500

1000

1500

0.01 0.02 0.03 0.04 0.05 0.075 0.1 0.2 0.5

Accuracyvs.”Closeness”ofGame

Samples Accuracy

TeamType NumberGames

Accuracy

BothGood 1088 0.5588

Both Bad 1306 0.5628

Good&Bad 3146 0.5950

TeamType Accuracy

BothGoodTeams 0.524

Both BadTeams 0.547

OneGoodTeam,One Bad 0.572

Baseline Predictions