data exploration assignment ppt
TRANSCRIPT
Data Exploration Thoroughbred Horse Racing
N RAMACHANDRAN
Transform Qualitative Variables to Quantitative• Transforming the qualitative variables which have a significant impact
on the handle :1.Race_Type:dummy_allowance , dummy_handicap , dummy_stakes , dummy_maiden , dummy_starters,dummy_claiming
2.Age Restriction:dummy_is2allowed , dummy_is3allowed, dummy_is4allowed, dummy_is5allowed, dummy_isg5allowed
3.Surface : dummy_dirt , dummy_turf
4.Track Id:dummy_AD ,dummy_CD , dummy_CRC, dummy_FG
Derived Variables
• Hour of race : Getting the hour of race in 24hr format• Day of race : Getting the day of the week (1: Sunday , 7: Saturday)• Month of race :Gettting the month of the race(1:Jan , 12:Dec)
Summary Statistics
• No missing values .Some of the data not available for conditions_of_races , sex_restriction are assumed to mean that there are no conditions or restrictions and hence the field is blank.• Proc means and proc freq data on the expected lines .Nothing
unusual to be reported from the data.
Graphical Analysis
• Compared different independent variables to the dependent variable handle and generated some charts.
1
2
3
4
5
6
7
0 100000 200000 300000 400000 500000 600000
HANDLE
DAY
OF
WEE
K
Average Handle vs Day of Week
• The data below shows that the average handle peaks on Wed , Fri and Sat.(Sun =1 and Sat=7)
7000 13000 19000 24500 30500 36500 43500 48300 52500 59500 65000 750004000000
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
Purse_USA
Hand
le
Average Handle vs Purse_USA
• There is a steep increase in the Handle when the total prize money increases above 125000.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
100000
200000
300000
400000
500000
600000
700000
Race No
Hand
le
Average Handle vs Race No
• The average handle increases with the no of races till the race no 10 or 11.The client is advised to restrict the number of races to 11.For the cases of more than 11 races in a day , the returns are not that great.Race no 15 is an outlier .
Average Handle vs No of runners• The average handle increases from no of runners from 4 to 12 and the
client is suggested to keep this range to maximize profits.
3 4 5 6 7 8 9 10 11 12 13 140
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
No of Runners
HAND
LE
11 12 13 14 15 16 17 18 19 200
200000
400000
600000
800000
1000000
1200000
Hour of Day
HAND
LE
Average Handle vs Hour of day
• The data shows a significant high value of the handle where the 1st race are in the range 11-12 pm and the last race occurs in the time 7-8 pm. The client can be suggested to schedule the races as such.
Handle Value Graph
• All the high values of the handle look like an outlier but the reason behind them is that they are mostly placed on the weekends (ie on holidays)
11 385 759 1133 1507 1881 2255 2629 3003 3377 3751 4125 4499 4873 5247 5621 5995 6369 6743 7117 7491 7865 8239 8613 8987 9361 9735 10109104830
1000000
2000000
3000000
4000000
5000000
6000000
handle
Handle vs Track Id
• From data it can be inferred that the average handle at Churchill Downs in the state Kentucy is significantly greater than its peers.
AP CD CRC FG0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Count of handleAverage of handle2
Anomaly Detection
• In the handle graph(11th slide) , there are some spikes in the values which turnout to be weekends when high transaction handle occurs , so could not be termed as an outlier.• There is only one day(26-Oct-04) where we have no of races =15 , so
that can be an outlier .
Suggestions for client(Summary)
• As described in the few graphs and histograms , some of the things the client should take into account are :• 1.Wed , Fri , Sat , Sun : are the highest gross handle days in a week.• 2.Steep increase in handle when the purse is higher than 150000$.• 3.Restrict the no of races to 11/day.• 4.Average handle increases when the no of runners are in 4-12 range.• 5.Value of the handle is significantly high if the first race is in 11-12pm
and the last in 7-8pm range.