Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 1 of 17
Predict cancellation based on past cab booking data
Author: Neeraj Tiwary
Abstract Based on past cab booking data, we need to predict the probability of cancellation of any booking.
Data source: Training dataset contains 9999 records whereas test dataset contains 10000 records.
The main attributes in the training/test dataset were
driver_id
customer_id
booking_date
booking_from_long
booking_from_lat
booking_to_long
booking_to_lat
booking_status
original_rate_cost
amount_cust_discount
is_rush_hour_price
Developed a model with the help of R statistical language.
Keywords: Booking, Predict, Cancellation
Introduction, Motivation and Questions Loaded the data into R
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 2 of 17
Data Cleansing
Converted some features into categorical variables
From booking_date feature, derived below new attributes
o Hour (hr)
o Day (day)
Since training dataset contains just 2 days of data, so no other new features could derive here.
Dropped booking_date feature from the data frame
Removed duplicates
Verified the initial summary of the data
Output
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 3 of 17
Feature Engineering
This is the man step of any model development activity. We need to enhance our features to have a
better predictability.
At the above step developed two new features (hr and day).
In this step, initially developed a feature termed as “Rate_Cost” which is the difference between
“original_rate_cost” and “amount_cust_discount”.
Checked the histogram of that feature and observed that if I do a logarithmic transformation
over this new feature, it would be good.
We have latitude/longitude details for source and destination of booking.
From these details, I derived 3 new features.
o Dist_KM: This is the total distance in kilometers between source and destination of
booking. This may not be complete accurate as this attribute is derived via Euclidean
distance.
o Price_KM: This is the ratio of Rate_Cost divided by Dist_KM.
o Log_Price_KM: Logarithmic transformation of above derived attribute.
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 4 of 17
Exploratory data analysis
Here I did an exploratory data analysis of the attributes
Boxplot
train_data$Log_Rate_Cost train_data$Log_Price_KM
train_data$Dist_KM
Histograms
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 5 of 17
Qplot
Plot of cancellation over source location Plot of cancellation over destination location
Plot of cancellation against rate vs peak hour Plot of cancellation against Price_KM vs Dist_KM
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 6 of 17
Plot of cancellation against Price_KM vs hour
Plot of cancellation on different hours Chart of cancellation on different hours
Dropped unnecessary variables
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 7 of 17
Supervised Model Information Gain
Output
Decision Tree
Developed a decision tree model
Confusion matrix of the model
Model Structure
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 10 of 17
Naïve Bayes
Developed a Naïve Bayes model
Confusion matrix of the model
Model Structure
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 11 of 17
Naïve Bayes
Developed another version of Naïve Bayes model
Confusion matrix of the model
Model Structure
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 12 of 17
Random Forest
Developed a random forest model
Confusion matrix of the model
ROC Curve
Importance of variable in the model
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 13 of 17
Support Vector Machines
Developed a linear SVM model with cost = 500 and gamma = 1
Confusion matrix of the model
Model Structure
Developed a nonlinear SVM model with cost = 500 and gamma = 1
Confusion matrix of the model
Model Structure
Predict cancellation based on past cab booking data
Neeraj Tiwary Sep-16 Page 14 of 17
Support Vector Machines – Non Linear
Methodology Supervised machine learning models
Results Observation:
Models developed with Support vector machine algorithm are good models out of all the
models that I developed.
In SVM too, if we develop the model in non-linear space, then it has good accuracy.
Conclusions and Next Steps After a thoroughly understanding of the problem, below are my further recommendations to proceed
ahead
From the longitude / latitude details, we can derive the geo locations / locality and then group
them or see the relationship between them. That information might be helpful for the model.
If we have further details of customer / driver, then it would be helpful in improving the
accuracy of the model
Constraints I tried to find out the locality from reverse geocoding, but all APIs have some kind of daily usage
limit.