harness racing and sas
DESCRIPTION
Harvard Stats 135 midterm project evaluating SAS techniques.TRANSCRIPT
![Page 1: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/1.jpg)
HARNESS RACING AND SASUSING SAS TO MODEL HORSE RACES
![Page 2: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/2.jpg)
• “Past Performance” from TrackMaster for races September 26, 2013 at Yonkers Raceway
• Published in advance of the race
• Cost: $1.50
• Comes in XML format – parsed using python
• Contains 10 most recent PPs for each horse racing that day
• 12 races x 8 horses x 10 past performances = 960 records
• Variables of use: Lengths back at each quarter, final time, lead final time, gait, age (meta), track condition, track name, track length
• Created race-level, horse-race-level, and longitudinal data sets for different aspects of this analysis
DATA SET
![Page 3: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/3.jpg)
GAIT AND CONDITION• Hypothesis: Gait and track condition influence race time
• Gait
• Binary: Pacers and Trotters• Each race is one or the other• Each horse is one or the other
• Condition
• Categorical: Fast, Good, or Sloppy• Each race categorized into one
• Created and cleaned race-level data set
• Means test showed means are different for both variables
• T-test showed these differences are statistically significant
![Page 4: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/4.jpg)
REMOVING OUTLIERS
![Page 5: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/5.jpg)
REMOVING OUTLIERS
![Page 6: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/6.jpg)
GAIT T-TEST
![Page 7: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/7.jpg)
CONDITION T-TEST
![Page 8: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/8.jpg)
CORRELATION: LENGTHS BACK AT CALLS• Some horses pull away early, others seem to wait for the
last quarter to go to the front
• TrackMaster reports lengths back from lead and calls at each quarter
• Lengths are recorded as fractional numbers (to the quarter) and as parts of horse
• Nose• Head• Neck
• Additional complication: “costly breaks” of pace and disqualification
• Still not happy – strange lengths back for winners at final
![Page 9: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/9.jpg)
CORRELATION OF LENGTHS BACK BY QUARTER
![Page 10: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/10.jpg)
CORRELATION OF LENGTHS BACK BY QUARTER
![Page 11: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/11.jpg)
• Goal: Quantify how much horses slow down with age
• Merged metadata for each horse with past performance data
• Single-variable regression analysis of mean data set
• Found that age is not a great predictor of speed
• Age: Discrete, yet not categorical
AGE AND SPEED
![Page 12: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/12.jpg)
• Longitudinal data set
• Created dummy variables for past and present track conditions, gaits, and track sizes
• Used SAS’s “Lag” and “Last” Features
• Removed disqualified races
• Modeled race time based on current race conditions and two races prior
MULTIVARIATE REGRESSION
![Page 13: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/13.jpg)
Label ParameterEstimate
StandardError
t Value Pr > |t|
Intercept 104.67788 4.81142 21.76 <.0001
Lag final time
0.01412 0.03120 0.45 0.6510
Lag2 final time
0.11361 0.02975 3.82 0.0001
Pacer -3.68185 0.21247 -17.33 <.0001
Fast -0.77005 0.38954 -1.98 0.0484
Sloppy 0.86942 0.43605 1.99 0.0465
Age 0.05312 0.04023 1.32 0.1871
5/8 Track -2.74052 0.20313 -13.49 <.0001
1 Track -3.18411 0.47824 -6.66 <.0001
MULTIVARIATE REGRESSION
Label ParameterEstimate
StandardError
t Value Pr > |t|
Fast lag 0.35883 0.38598 0.93 0.3528
Sloppy lag 0.48532 0.43151 1.12 0.2610
Fast lag2 0.09472 0.37245 0.25 0.7993
Sloppy lag2
-0.39904 0.42068 -0.95 0.3431
5/8 Track lag
0.14639 0.23680 0.62 0.5366
1 Track lag 0.40192 0.51792 0.78 0.4379
5/8 track lag2
0.58564 0.21764 2.69 0.0073
1 track lag2
0.67260 0.49172 1.37 0.1717
Variables of Interest Control Variables
Final race times from previous races are not great determinants of final race time this race!
![Page 14: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/14.jpg)
Predicting the Winner
RightWrong
• Used the coefficients from my multivariate regression and most recent two races for each horse
• Ranked horses by predicted race values
• But my bets weren’t great! But better than choosing at random!
• Reason: Low, low variance in race times among horses. Not enough predictive power in model, even with R^2 > 0.5
PREDICTION OF SEPTEMBER 26 RACES
![Page 15: Harness Racing and SAS](https://reader036.vdocuments.mx/reader036/viewer/2022081413/548cd13db47959bb678b4593/html5/thumbnails/15.jpg)
• SAS’s LAG and LAST features are great for dealing with longitudinal data
• Most work was on the DATA steps, not the PROC steps
• My model was based on only 960 occurrences, 96 horses
• With more data, might model Pacers and Trotters separately, Conditions separately
• Still want to investigate lengths back for winning horses
• Learned much about SAS and about harness racing
FINAL THOUGHTS