Transcript

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 1 of 17

Predict cancellation based on past cab booking data

Author: Neeraj Tiwary

Abstract Based on past cab booking data, we need to predict the probability of cancellation of any booking.

Data source: Training dataset contains 9999 records whereas test dataset contains 10000 records.

The main attributes in the training/test dataset were

driver_id

customer_id

booking_date

booking_from_long

booking_from_lat

booking_to_long

booking_to_lat

booking_status

original_rate_cost

amount_cust_discount

is_rush_hour_price

Developed a model with the help of R statistical language.

Keywords: Booking, Predict, Cancellation

Introduction, Motivation and Questions Loaded the data into R

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 2 of 17

Data Cleansing

Converted some features into categorical variables

From booking_date feature, derived below new attributes

o Hour (hr)

o Day (day)

Since training dataset contains just 2 days of data, so no other new features could derive here.

Dropped booking_date feature from the data frame

Removed duplicates

Verified the initial summary of the data

Output

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 3 of 17

Feature Engineering

This is the man step of any model development activity. We need to enhance our features to have a

better predictability.

At the above step developed two new features (hr and day).

In this step, initially developed a feature termed as “Rate_Cost” which is the difference between

“original_rate_cost” and “amount_cust_discount”.

Checked the histogram of that feature and observed that if I do a logarithmic transformation

over this new feature, it would be good.

We have latitude/longitude details for source and destination of booking.

From these details, I derived 3 new features.

o Dist_KM: This is the total distance in kilometers between source and destination of

booking. This may not be complete accurate as this attribute is derived via Euclidean

distance.

o Price_KM: This is the ratio of Rate_Cost divided by Dist_KM.

o Log_Price_KM: Logarithmic transformation of above derived attribute.

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 4 of 17

Exploratory data analysis

Here I did an exploratory data analysis of the attributes

Boxplot

train_data$Log_Rate_Cost train_data$Log_Price_KM

train_data$Dist_KM

Histograms

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 5 of 17

Qplot

Plot of cancellation over source location Plot of cancellation over destination location

Plot of cancellation against rate vs peak hour Plot of cancellation against Price_KM vs Dist_KM

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 6 of 17

Plot of cancellation against Price_KM vs hour

Plot of cancellation on different hours Chart of cancellation on different hours

Dropped unnecessary variables

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 7 of 17

Supervised Model Information Gain

Output

Decision Tree

Developed a decision tree model

Confusion matrix of the model

Model Structure

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 8 of 17

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 9 of 17

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 10 of 17

Naïve Bayes

Developed a Naïve Bayes model

Confusion matrix of the model

Model Structure

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 11 of 17

Naïve Bayes

Developed another version of Naïve Bayes model

Confusion matrix of the model

Model Structure

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 12 of 17

Random Forest

Developed a random forest model

Confusion matrix of the model

ROC Curve

Importance of variable in the model

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 13 of 17

Support Vector Machines

Developed a linear SVM model with cost = 500 and gamma = 1

Confusion matrix of the model

Model Structure

Developed a nonlinear SVM model with cost = 500 and gamma = 1

Confusion matrix of the model

Model Structure

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 14 of 17

Support Vector Machines – Non Linear

Methodology Supervised machine learning models

Results Observation:

Models developed with Support vector machine algorithm are good models out of all the

models that I developed.

In SVM too, if we develop the model in non-linear space, then it has good accuracy.

Conclusions and Next Steps After a thoroughly understanding of the problem, below are my further recommendations to proceed

ahead

From the longitude / latitude details, we can derive the geo locations / locality and then group

them or see the relationship between them. That information might be helpful for the model.

If we have further details of customer / driver, then it would be helpful in improving the

accuracy of the model

Constraints I tried to find out the locality from reverse geocoding, but all APIs have some kind of daily usage

limit.

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 15 of 17

Code R Code:

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 16 of 17

Predict cancellation based on past cab booking data

Neeraj Tiwary Sep-16 Page 17 of 17


Top Related