predicting average basket value - galit shmueli total_cart_value_tilldate, sex_m, mobile_status,...
TRANSCRIPT
Predicting Average Basket Value
Business Analytics Using Data Mining Group 9A Aashish Sharma: 61310403 Abhishek Agrawal: 61310548 Saurabh Malhotra: 61310161 Madhav Pathak: 61310330 Madhur Chadha: 61310185
Table of Contents
I. Executive Summary ..................................................................................................................................... 3
A. Problem Description ............................................................................................................................... 3
B. Brief description of the data, its source, key characteristics, and chart(s) ....................... 3
C. High level Description (Prediction Methodology) ...................................................................... 4
D. Technical Summary .................................................................................................................................. 5
E. Performance Metrics .............................................................................................................................. 5
F. Limitations .................................................................................................................................................. 6
G. Recommendations ................................................................................................................................... 6
II. APPENDIX ....................................................................................................................................................... 7
A. CART Model ................................................................................................................................................ 7
B. Multiple Regression Model ................................................................................................................... 8
C. KNN Model .................................................................................................................................................. 9
D. Naïve Model ............................................................................................................................................ 10
I. Executive Summary
A. Problem Description Retailers spend a considerable amount of time, effort, and money to acquire a new
customer. However once a customer has been acquired, the maximum value can only be
derived if the customer becomes a repetitive buyer and his/her purchase amounts increase
with time. Identifying which customer will qualify for a promotion is a key to this problem
and our study makes an attempt to solve this issue. Our model predicts the future shopping
basket value of a customer. Basis this prediction, every week top 10% customers will be
identified and the store will email the promotional discount coupons to these customers.
Accuracy of model will be another deciding factor in the overall scheme of business
strategy. A miss out on a probable loyal customer could impact the long term customer life
time value associated. As the hypermarket, we are interested in predicting the average
basket value of the next customer who walks in based on his/her demographic data as well
as his previous purchase pattern prior to this visit
B. Brief description of the data, its source, key characteristics, and chart(s) Transaction data for ABC that include information at SKU level – customer ID, purchase date,
extended price, quantity sold, item description, department, sub-department, class and
subclass. Data are provided on a period of 13 months from Aug 2011 to Aug 2012. Customer
demographic information has also been provided which includes enrolment date, date of birth,
sex, marital status and customer IDs.
Source: Data collected by Hansa Cequity Solutions
Key Characteristics: The available data was at the SKU and customer level. However, we
needed basket level data for our analysis. In order to efficiently filter and sort the data to
make it suitable for basket level analysis, we used BASE SAS v9.1.
Compiled data at the Basket level for each customer – Merged customer information
with the transactions data to get customer_id and transaction_data level data
Computed the lagged (i.e. till previous purchase) ‘Average Basket Price’ & ‘Average
Purchase Quantity’ for each customer at transaction level using all purchases prior
to current transaction date
Included 4 dummy variables in data set - Sex, Married, and Email Flag
Variables : Customer_No, Transaction_Date, Day of Week, Quantity_Sold, Extended_Price,
Avg_Quantity_Sold_Lag, Avg_Ext_Price_Lag, Count_Footfall, Clean_Email_Flag,
Total_qty_sold_tilldate, Total_cart_value_tilldate, Sex_M, Mobile_Status, Married,
Enrollment_Age, ,
Missing Values Handling:
At summary level, approx ~4000 rows had SEX and MARITAL status missing. This subset of
was removed for our analysis. This is done deliberately as we want to create a robust model
for only those customers for which we have all the information captured (it will also prompt
customers to provide information next time in order to avail discounts)
C. High level Description (Prediction Methodology) Following three methods were used to analyze the data and to create the predictive model
on the driver file (created above)
Multiple Linear Regression: Y = 2066.58 + -76.29*Day of week + 0.5505*Avg Ext Price
(Lag) + -23.30*Count (Footfall) + 16.93*AGE + -494.82*SEX_M + 100.08*ENROLLMENT
AGE + 150.20*MARRIED + -154.86*CLEAN_EMAIL_FLAG
Naïve Forecast: (Benchmark): Y = 1736.7644 + 0.6333 * Avg. Ext Price_Lag
k-NN Model: Input Variables: Day of Week, Avg_Ext_Price_Lag, Count_Footfall, AGE, SEX_M,
ENROLLMENT_AGE, MARRIED (k=20)
CART: Input Variables: Day of Week, Avg_Ext_Price_Lag, Count_Footfall, AGE, SEX_M,
ENROLLMENT_AGE, MARRIED (best pruned one)
The Appendix shows the model working and other technical details.
Residuals Histogram for the above models-
D. Technical Summary The lift chart and RMSE were used to evaluate the performance of the models above. The
lift charts are shown below followed by comparison of RMSE -
E. Performance Metrics The cost of offering a discount to a customer whom the model predicted but actually is
misclassified, is very negligible. The cost of not identifying a person through the model could be
very high because the hypermarket might lose this customer’s average basket value sales less
the discount that would have been offered. Underlying assumption to this evaluation is that
because of not offering the discount promotion to the ‘correct’ customer, we have lost the
opportunity to make the customer visit the store next time. Also, we are assuming that during
the next purchase, this customer’s basket size would approximately be equal to his average last
purchase value (Naïve). This cost turns out to be ~Rs.500 per customer for each
misclassification (weekly prediction Vs actual data analysis on misidentified customer set).
Final CART Model (Pruned)
Best
Also we cannot offer this promotion to every customer at the hypermarket, because by offering
the promotion, the probability of re-visit increases tremendously, thus the discount hit would
be much higher.
F. Limitations Macro-economic factors have been ignored for the purpose of building this model. To
improve the prediction quality of the model, the hypermarket should further add more
predictor variables by collecting information from the customers. Demographic data might
be misleading as many customers might have one loyalty card per family that might be
getting used by multiple members of the family. All customers will be given a loyalty card
while he/she makes a purchase at the hypermarket.
G. Recommendations We would recommend the hypermarket to adopt and deploy our CART model to predict
the basket value for the next purchase. Using this model, hypermarket will be better able
to identify top 10 percentile customers and incentivize them for more frequent visits as
well as reward them for being loyal customers. Hypermarket should run this model
weekly – update the misclassification cost and take proactive actions in defining the
probability of repeat purchase for all the discounts offered. Tracking of the actual increase
in customer footfall will help hypermarket evaluate whether the strategy is actually
translating into real values. This model should be updated on a real time basis since it
extensively uses the prior purchase of these customers as an input variable
Two ways to measure the success of our proposal-
• % Increase in Sales and % increase in re-purchase
rate for existing customers. Prerequisite for this
benchmark: % increase in sales should be greater
than 5%
• % increase in profitability: Ideal measure but
since data is not available we are not using this
metric for evaluation
II. APPENDIX
A. CART Model
Full Tree Rules (Using Training Data)
6 7
Level NodeID ParentID SplitVar SplitValue Cases LeftChild RightChild PredVal Node Type
0 0 N/AAvg_Ext_Price_Lag 5676.87 9999 1 2 3584.464788 Decision
1 1 0 Count_Footfall 0.5 8897 3 4 3094.471514 Decision
1 2 0 N/A N/A 1102 N/A N/A 7540.426824 Terminal
2 3 1ENROLLMENT_AGE 2.5 4448 7 8 3797.778932 Decision
2 4 1Avg_Ext_Price_Lag 2715.57 4449 5 6 2391.322178 Decision
3 5 4Avg_Ext_Price_Lag 1006.0053 3153 11 12 1712.634843 Decision
3 6 4 N/A N/A 1296 N/A N/A 4042.480486 Terminal
3 7 3 AGE 32.8236 3357 9 10 3257.41286 Decision
3 8 3 N/A N/A 1091 N/A N/A 5460.48187 Terminal
4 9 7 N/A N/A 1806 N/A N/A 2685.129934 Terminal
4 10 7 N/A N/A 1551 N/A N/A 3923.784855 Terminal
4 11 5 N/A N/A 1489 N/A N/A 1134.395803 Terminal
4 12 5 N/A N/A 1664 N/A N/A 2230.061484 Terminal
Pruned Tree Rules (Using Validation Data)
4 5
Level NodeID ParentID SplitVar SplitValue Cases LeftChild RightChild PredVal Node Type
0 0 N/AAvg_Ext_Price_Lag 5676.87 21520 1 2 3584.464788 Decision
1 1 0 Count_Footfall 0.5 18575 3 4 3094.471514 Decision
1 2 0 N/A N/A 2945 N/A N/A 7540.426824 Terminal
2 3 1ENROLLMENT_AGE 2.5 5735 7 8 3797.778932 Decision
2 4 1Avg_Ext_Price_Lag 2715.57 12840 5 6 2391.322178 Decision
3 5 4 N/A N/A 8518 N/A N/A 1712.634843 Terminal
3 6 4 N/A N/A 4322 N/A N/A 4042.480486 Terminal
3 7 3 N/A N/A 5390 N/A N/A 3257.41286 Terminal
3 8 3 N/A N/A 345 N/A N/A 5460.48187 Terminal
Training Data scoring - Summary Report (Using Full Tree)
Total sum of
squared
errors
RMS ErrorAverage
Error
2.02691E+11 4502.342301 -1.02611E-07
Validation Data scoring - Summary Report (Using Prune Tree)
Total sum of
squared
errors
RMS ErrorAverage
Error
3.31232E+11 3923.240319 -81.441941
Test Data scoring - Summary Report (Using Prune Tree)
Total sum of
squared
errors
RMS ErrorAverage
Error
2.26221E+11 3971.011282 207.933656
#Decision Nodes #Terminal Nodes
#Decision Nodes #Terminal Nodes
We have checked multiple box plots to see how this model is doing with respect to various
predictors. One of them – Predictor value vs day of week is shown here
B. Multiple Regression Model
The Regression Model
Coefficient Std. Error p-value SS
2066.584473 171.1040497 0 1.2436E+11 9991
-76.2850494 17.81961441 0.00002235 152441504 0.23708501
0.55048656 0.01074641 0 47134162944 3982.249512
-23.3044643 3.39011359 0 707978816 1.5844E+11
16.93166161 3.94331098 0.00002114 539239808
-494.817383 91.99887085 0.00000012 542590080
100.0760422 50.31545258 0.04725318 73395968
159.1952515 88.49066925 0.07262547 44065160
-154.867523 93.62989807 0.09875221 43385896
Training Data scoring - Summary Report
Total sum of
squared
errors
RMS ErrorAverage
Error
1.5844E+11 3980.457149 -0.00011155
Validation Data scoring - Summary Report
Total sum of
squared
errors
RMS ErrorAverage
Error
5.79565E+11 4019.905343 18.35766434
Input variables
Residual df
Day of Week Multiple R-squared
Avg_Ext_Price_Lag Std. Dev. estimate
Count_Footfall Residual SS
Constant term
AGE
SEX_M
ENROLLMENT_AGE
MARRIED
CLEAN_EMAIL_FLAG
Box plot: Predictor value vs day of week is shown here
C. KNN Model
Validation error log for different k
Value of kTraining
RMS Error
Validation
RMS Error
1 24.44349947 5503.977377
2 24.44349947 4776.979308
3 24.44349947 4484.350744
4 24.44349947 4341.368449
5 24.44349947 4241.777322
6 24.44349947 4179.116395
7 24.44349947 4133.049977
8 24.44349947 4103.822495
9 24.44349947 4074.922732
10 24.44349947 4052.492928
11 24.44349947 4033.628783
12 24.44349947 4018.640439
13 24.44349947 4007.154024
14 24.44349947 3994.31554
15 24.44349947 3986.065351
16 24.44349947 3976.562105
17 24.44349947 3972.204232
18 24.44349947 3966.046008
19 24.44349947 3960.786713
20 24.44349947 3957.36427 <--- Best k
Training Data scoring - Summary Report (for k=20)
Total sum of
squared
errors
RMS ErrorAverage
Error
5974249.178 24.44349947 0
Validation Data scoring - Summary Report (for k=20)
Total sum of
squared
errors
RMS ErrorAverage
Error
3.37019E+11 3957.36427 215.4206761
Test Data scoring - Summary Report (for k=20)
Total sum of
squared
errors
RMS ErrorAverage
Error
2.39481E+11 4085.737877 685.1382098
Similar to MLR, the box plot of predicted value vs day of week is shown below-
D. Naïve Model
The Regression Model
Coefficient Std. Error p-value SS
1736.764404 54.34583664 0 84894556160
0.63322592 0.01248606 0 43268575232
Training Data scoring - Summary Report
Total sum of
squared
errors
RMS ErrorAverage
Error
1.362E+11 4101.090313 4.9201E-06
Validation Data scoring - Summary Report
Total sum of
squared
errors
RMS ErrorAverage
Error
93906645535 4170.91869 1049.05211
Std. Dev. estimate
Residual SS
Input variables
Constant term Residual df
Avg_Ext_Price_Lag Multiple R-squared