predicting average basket value - galit shmueli total_cart_value_tilldate, sex_m, mobile_status,...

Predicting Average Basket Value

Business Analytics Using Data Mining Group 9A Aashish Sharma: 61310403 Abhishek Agrawal: 61310548 Saurabh Malhotra: 61310161 Madhav Pathak: 61310330 Madhur Chadha: 61310185

Table of Contents

I. Executive Summary ..................................................................................................................................... 3

A. Problem Description ............................................................................................................................... 3

B. Brief description of the data, its source, key characteristics, and chart(s) ....................... 3

C. High level Description (Prediction Methodology) ...................................................................... 4

D. Technical Summary .................................................................................................................................. 5

E. Performance Metrics .............................................................................................................................. 5

F. Limitations .................................................................................................................................................. 6

G. Recommendations ................................................................................................................................... 6

II. APPENDIX ....................................................................................................................................................... 7

A. CART Model ................................................................................................................................................ 7

B. Multiple Regression Model ................................................................................................................... 8

C. KNN Model .................................................................................................................................................. 9

D. Naïve Model ............................................................................................................................................ 10

I. Executive Summary

A. Problem Description Retailers spend a considerable amount of time, effort, and money to acquire a new

customer. However once a customer has been acquired, the maximum value can only be

derived if the customer becomes a repetitive buyer and his/her purchase amounts increase

with time. Identifying which customer will qualify for a promotion is a key to this problem

and our study makes an attempt to solve this issue. Our model predicts the future shopping

basket value of a customer. Basis this prediction, every week top 10% customers will be

identified and the store will email the promotional discount coupons to these customers.

Accuracy of model will be another deciding factor in the overall scheme of business

strategy. A miss out on a probable loyal customer could impact the long term customer life

time value associated. As the hypermarket, we are interested in predicting the average

basket value of the next customer who walks in based on his/her demographic data as well

as his previous purchase pattern prior to this visit

B. Brief description of the data, its source, key characteristics, and chart(s) Transaction data for ABC that include information at SKU level – customer ID, purchase date,

extended price, quantity sold, item description, department, sub-department, class and

subclass. Data are provided on a period of 13 months from Aug 2011 to Aug 2012. Customer

demographic information has also been provided which includes enrolment date, date of birth,

sex, marital status and customer IDs.

Source: Data collected by Hansa Cequity Solutions

Key Characteristics: The available data was at the SKU and customer level. However, we

needed basket level data for our analysis. In order to efficiently filter and sort the data to

make it suitable for basket level analysis, we used BASE SAS v9.1.

Compiled data at the Basket level for each customer – Merged customer information

with the transactions data to get customer_id and transaction_data level data

Computed the lagged (i.e. till previous purchase) ‘Average Basket Price’ & ‘Average

Purchase Quantity’ for each customer at transaction level using all purchases prior

to current transaction date

Included 4 dummy variables in data set - Sex, Married, and Email Flag

Variables : Customer_No, Transaction_Date, Day of Week, Quantity_Sold, Extended_Price,

Avg_Quantity_Sold_Lag, Avg_Ext_Price_Lag, Count_Footfall, Clean_Email_Flag,

Total_qty_sold_tilldate, Total_cart_value_tilldate, Sex_M, Mobile_Status, Married,

Enrollment_Age, ,

Missing Values Handling:

At summary level, approx ~4000 rows had SEX and MARITAL status missing. This subset of

was removed for our analysis. This is done deliberately as we want to create a robust model

for only those customers for which we have all the information captured (it will also prompt

customers to provide information next time in order to avail discounts)

C. High level Description (Prediction Methodology) Following three methods were used to analyze the data and to create the predictive model

on the driver file (created above)

Multiple Linear Regression: Y = 2066.58 + -76.29*Day of week + 0.5505*Avg Ext Price

(Lag) + -23.30*Count (Footfall) + 16.93*AGE + -494.82*SEX_M + 100.08*ENROLLMENT

AGE + 150.20*MARRIED + -154.86*CLEAN_EMAIL_FLAG

Naïve Forecast: (Benchmark): Y = 1736.7644 + 0.6333 * Avg. Ext Price_Lag

k-NN Model: Input Variables: Day of Week, Avg_Ext_Price_Lag, Count_Footfall, AGE, SEX_M,

ENROLLMENT_AGE, MARRIED (k=20)

CART: Input Variables: Day of Week, Avg_Ext_Price_Lag, Count_Footfall, AGE, SEX_M,

ENROLLMENT_AGE, MARRIED (best pruned one)

The Appendix shows the model working and other technical details.

Residuals Histogram for the above models-

D. Technical Summary The lift chart and RMSE were used to evaluate the performance of the models above. The

lift charts are shown below followed by comparison of RMSE -

E. Performance Metrics The cost of offering a discount to a customer whom the model predicted but actually is

misclassified, is very negligible. The cost of not identifying a person through the model could be

very high because the hypermarket might lose this customer’s average basket value sales less

the discount that would have been offered. Underlying assumption to this evaluation is that

because of not offering the discount promotion to the ‘correct’ customer, we have lost the

opportunity to make the customer visit the store next time. Also, we are assuming that during

the next purchase, this customer’s basket size would approximately be equal to his average last

purchase value (Naïve). This cost turns out to be ~Rs.500 per customer for each

misclassification (weekly prediction Vs actual data analysis on misidentified customer set).

Final CART Model (Pruned)

Best

Also we cannot offer this promotion to every customer at the hypermarket, because by offering

the promotion, the probability of re-visit increases tremendously, thus the discount hit would

be much higher.

F. Limitations Macro-economic factors have been ignored for the purpose of building this model. To

improve the prediction quality of the model, the hypermarket should further add more

predictor variables by collecting information from the customers. Demographic data might

be misleading as many customers might have one loyalty card per family that might be

getting used by multiple members of the family. All customers will be given a loyalty card

while he/she makes a purchase at the hypermarket.

G. Recommendations We would recommend the hypermarket to adopt and deploy our CART model to predict

the basket value for the next purchase. Using this model, hypermarket will be better able

to identify top 10 percentile customers and incentivize them for more frequent visits as

well as reward them for being loyal customers. Hypermarket should run this model

weekly – update the misclassification cost and take proactive actions in defining the

probability of repeat purchase for all the discounts offered. Tracking of the actual increase

in customer footfall will help hypermarket evaluate whether the strategy is actually

translating into real values. This model should be updated on a real time basis since it

extensively uses the prior purchase of these customers as an input variable

Two ways to measure the success of our proposal-

• % Increase in Sales and % increase in re-purchase

rate for existing customers. Prerequisite for this

benchmark: % increase in sales should be greater

than 5%

• % increase in profitability: Ideal measure but

since data is not available we are not using this

metric for evaluation

II. APPENDIX

A. CART Model

Full Tree Rules (Using Training Data)

6 7

Level NodeID ParentID SplitVar SplitValue Cases LeftChild RightChild PredVal Node Type

0 0 N/AAvg_Ext_Price_Lag 5676.87 9999 1 2 3584.464788 Decision

1 1 0 Count_Footfall 0.5 8897 3 4 3094.471514 Decision

1 2 0 N/A N/A 1102 N/A N/A 7540.426824 Terminal

2 3 1ENROLLMENT_AGE 2.5 4448 7 8 3797.778932 Decision

2 4 1Avg_Ext_Price_Lag 2715.57 4449 5 6 2391.322178 Decision



3 7 3 AGE 32.8236 3357 9 10 3257.41286 Decision






Pruned Tree Rules (Using Validation Data)

4 5

Level NodeID ParentID SplitVar SplitValue Cases LeftChild RightChild PredVal Node Type

0 0 N/AAvg_Ext_Price_Lag 5676.87 21520 1 2 3584.464788 Decision

1 1 0 Count_Footfall 0.5 18575 3 4 3094.471514 Decision


2 3 1ENROLLMENT_AGE 2.5 5735 7 8 3797.778932 Decision






Training Data scoring - Summary Report (Using Full Tree)

Total sum of

squared

errors

RMS ErrorAverage

Error

2.02691E+11 4502.342301 -1.02611E-07

Validation Data scoring - Summary Report (Using Prune Tree)

Total sum of

squared

errors

RMS ErrorAverage

Error

3.31232E+11 3923.240319 -81.441941

Test Data scoring - Summary Report (Using Prune Tree)

Total sum of

squared

errors

RMS ErrorAverage

Error

2.26221E+11 3971.011282 207.933656

#Decision Nodes #Terminal Nodes

#Decision Nodes #Terminal Nodes

We have checked multiple box plots to see how this model is doing with respect to various

predictors. One of them – Predictor value vs day of week is shown here

B. Multiple Regression Model

The Regression Model

Coefficient Std. Error p-value SS

2066.584473 171.1040497 0 1.2436E+11 9991

-76.2850494 17.81961441 0.00002235 152441504 0.23708501

0.55048656 0.01074641 0 47134162944 3982.249512

-23.3044643 3.39011359 0 707978816 1.5844E+11

16.93166161 3.94331098 0.00002114 539239808

-494.817383 91.99887085 0.00000012 542590080

100.0760422 50.31545258 0.04725318 73395968

159.1952515 88.49066925 0.07262547 44065160

-154.867523 93.62989807 0.09875221 43385896

Training Data scoring - Summary Report

Total sum of

squared

errors

RMS ErrorAverage

Error

1.5844E+11 3980.457149 -0.00011155

Validation Data scoring - Summary Report

Total sum of

squared

errors

RMS ErrorAverage

Error

5.79565E+11 4019.905343 18.35766434

Input variables

Residual df

Day of Week Multiple R-squared

Avg_Ext_Price_Lag Std. Dev. estimate

Count_Footfall Residual SS

Constant term

AGE

SEX_M

ENROLLMENT_AGE

MARRIED

CLEAN_EMAIL_FLAG

Box plot: Predictor value vs day of week is shown here

C. KNN Model

Validation error log for different k

Value of kTraining

RMS Error

Validation

RMS Error

1 24.44349947 5503.977377

2 24.44349947 4776.979308

3 24.44349947 4484.350744

4 24.44349947 4341.368449

5 24.44349947 4241.777322

6 24.44349947 4179.116395

7 24.44349947 4133.049977

8 24.44349947 4103.822495

9 24.44349947 4074.922732

10 24.44349947 4052.492928

11 24.44349947 4033.628783

12 24.44349947 4018.640439

13 24.44349947 4007.154024

14 24.44349947 3994.31554

15 24.44349947 3986.065351

16 24.44349947 3976.562105

17 24.44349947 3972.204232

18 24.44349947 3966.046008

19 24.44349947 3960.786713

20 24.44349947 3957.36427 <--- Best k

Training Data scoring - Summary Report (for k=20)

Total sum of

squared

errors

RMS ErrorAverage

Error

5974249.178 24.44349947 0

Validation Data scoring - Summary Report (for k=20)

Total sum of

squared

errors

RMS ErrorAverage

Error

3.37019E+11 3957.36427 215.4206761

Test Data scoring - Summary Report (for k=20)

Total sum of

squared

errors

RMS ErrorAverage

Error

2.39481E+11 4085.737877 685.1382098

Similar to MLR, the box plot of predicted value vs day of week is shown below-

D. Naïve Model

The Regression Model

Coefficient Std. Error p-value SS

1736.764404 54.34583664 0 84894556160

0.63322592 0.01248606 0 43268575232

Training Data scoring - Summary Report

Total sum of

squared

errors

RMS ErrorAverage

Error

1.362E+11 4101.090313 4.9201E-06

Validation Data scoring - Summary Report

Total sum of

squared

errors

RMS ErrorAverage

Error

93906645535 4170.91869 1049.05211

Std. Dev. estimate

Residual SS

Input variables

Constant term Residual df

Avg_Ext_Price_Lag Multiple R-squared

predicting average basket value - galit shmueli total_cart_value_tilldate, sex_m, mobile_status,...

Documents