Predict cancellation based on past cab booking data

Download Predict cancellation based on past cab booking data

Post on 20-Jan-2017

65 views

Category:

Data & Analytics

1 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 1 of 17 </p><p>Predict cancellation based on past cab booking data </p><p>Author: Neeraj Tiwary </p><p>Abstract Based on past cab booking data, we need to predict the probability of cancellation of any booking. </p><p>Data source: Training dataset contains 9999 records whereas test dataset contains 10000 records. </p><p>The main attributes in the training/test dataset were </p><p> driver_id </p><p> customer_id </p><p> booking_date </p><p> booking_from_long </p><p> booking_from_lat </p><p> booking_to_long </p><p> booking_to_lat </p><p> booking_status </p><p> original_rate_cost </p><p> amount_cust_discount </p><p> is_rush_hour_price </p><p>Developed a model with the help of R statistical language. </p><p>Keywords: Booking, Predict, Cancellation </p><p>Introduction, Motivation and Questions Loaded the data into R </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 2 of 17 </p><p>Data Cleansing </p><p> Converted some features into categorical variables </p><p> From booking_date feature, derived below new attributes </p><p>o Hour (hr) </p><p>o Day (day) </p><p> Since training dataset contains just 2 days of data, so no other new features could derive here. </p><p> Dropped booking_date feature from the data frame </p><p> Removed duplicates </p><p> Verified the initial summary of the data </p><p>Output </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 3 of 17 </p><p>Feature Engineering </p><p>This is the man step of any model development activity. We need to enhance our features to have a </p><p>better predictability. </p><p> At the above step developed two new features (hr and day). </p><p> In this step, initially developed a feature termed as Rate_Cost which is the difference between </p><p>original_rate_cost and amount_cust_discount. </p><p> Checked the histogram of that feature and observed that if I do a logarithmic transformation </p><p>over this new feature, it would be good. </p><p> We have latitude/longitude details for source and destination of booking. </p><p> From these details, I derived 3 new features. </p><p>o Dist_KM: This is the total distance in kilometers between source and destination of </p><p>booking. This may not be complete accurate as this attribute is derived via Euclidean </p><p>distance. </p><p>o Price_KM: This is the ratio of Rate_Cost divided by Dist_KM. </p><p>o Log_Price_KM: Logarithmic transformation of above derived attribute. </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 4 of 17 </p><p>Exploratory data analysis </p><p>Here I did an exploratory data analysis of the attributes </p><p>Boxplot </p><p>train_data$Log_Rate_Cost train_data$Log_Price_KM </p><p> train_data$Dist_KM </p><p>Histograms </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 5 of 17 </p><p>Qplot </p><p>Plot of cancellation over source location Plot of cancellation over destination location </p><p>Plot of cancellation against rate vs peak hour Plot of cancellation against Price_KM vs Dist_KM </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 6 of 17 </p><p>Plot of cancellation against Price_KM vs hour </p><p>Plot of cancellation on different hours Chart of cancellation on different hours </p><p>Dropped unnecessary variables </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 7 of 17 </p><p>Supervised Model Information Gain </p><p> Output </p><p>Decision Tree </p><p>Developed a decision tree model </p><p>Confusion matrix of the model </p><p>Model Structure </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 8 of 17 </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 9 of 17 </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 10 of 17 </p><p>Nave Bayes </p><p>Developed a Nave Bayes model </p><p>Confusion matrix of the model </p><p>Model Structure </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 11 of 17 </p><p>Nave Bayes </p><p>Developed another version of Nave Bayes model </p><p>Confusion matrix of the model </p><p>Model Structure </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 12 of 17 </p><p>Random Forest </p><p>Developed a random forest model </p><p>Confusion matrix of the model </p><p>ROC Curve </p><p>Importance of variable in the model </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 13 of 17 </p><p>Support Vector Machines </p><p>Developed a linear SVM model with cost = 500 and gamma = 1 </p><p>Confusion matrix of the model </p><p>Model Structure </p><p>Developed a nonlinear SVM model with cost = 500 and gamma = 1 </p><p>Confusion matrix of the model </p><p>Model Structure </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 14 of 17 </p><p>Support Vector Machines Non Linear </p><p>Methodology Supervised machine learning models </p><p>Results Observation: </p><p> Models developed with Support vector machine algorithm are good models out of all the </p><p>models that I developed. </p><p> In SVM too, if we develop the model in non-linear space, then it has good accuracy. </p><p>Conclusions and Next Steps After a thoroughly understanding of the problem, below are my further recommendations to proceed </p><p>ahead </p><p> From the longitude / latitude details, we can derive the geo locations / locality and then group </p><p>them or see the relationship between them. That information might be helpful for the model. </p><p> If we have further details of customer / driver, then it would be helpful in improving the </p><p>accuracy of the model </p><p>Constraints I tried to find out the locality from reverse geocoding, but all APIs have some kind of daily usage </p><p>limit. </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 15 of 17 </p><p>Code R Code: </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 16 of 17 </p></li><li><p>Predict cancellation based on past cab booking data </p><p>Neeraj Tiwary Sep-16 Page 17 of 17 </p></li></ul>