customer life time value prediction using embeddings...customer life time value prediction using...

Customer Life Time Value Prediction Using Embeddings

Benjamin Paul ChamberlainDepartment of ComputingImperial College London

[email protected]

Ângelo CardosoASOS.com

London, United [email protected]

C.H. Bryan LiuASOS.com


Roberto PagliariASOS.com


Marc Peter DeisenrothDepartment of ComputingImperial College [email protected]

ABSTRACTWe describe the Customer Life Time Value (CLTV) prediction sys-tem deployed at ASOS.com, a global online fashion retailer. CLTVprediction is an important problem in e-commerce where an accu-rate estimate of future value allows retailers to eectively allocatemarketing spend, identify and nurture high value customers andmitigate exposure to losses. e system at ASOS provides dailyestimates of the future value of every customer and is one of thecornerstones of the personalised shopping experience. e stateof the art in this domain uses large numbers of handcraed fea-tures and ensemble regressors to forecast value, predict churn andevaluate customer loyalty. We describe our system, which adoptsthis approach, and our ongoing eorts to further improve it. Re-cently, domains including language, vision and speech have showndramatic advances by replacing handcraed features with featuresthat are learned automatically from data. We show that learningfeature representations is a promising extension to the state ofthe art in CLTV modelling. We propose a novel way to generateembeddings of customers, which addresses the issue of the everchanging product catalogue and obtain a signicant improvementover an exhaustive set of handcraed features.

CCS CONCEPTS•Computingmethodologies→Supervised learning by regres-sion; Dimensionality reduction andmanifold learning; Clas-sication and regression trees; Neural networks; Supervisedlearning by classication; •Applied computing →Forecasting;Marketing; E-commerce infrastructure; Online shopping; Consumerproducts;

KEYWORDSCustomer Lifetime Value, E-commerce, Random Forests, NeuralNetworks, Embeddings

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).Conference’17, Washington, DC, USA© 2016 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . .$15.00DOI: 10.1145/nnnnnnn.nnnnnnn

ACM Reference format:Benjamin Paul Chamberlain, Ângelo Cardoso, C.H. Bryan Liu, RobertoPagliari, and Marc Peter Deisenroth. 2016. Customer Life Time Value Pre-diction Using Embeddings. In Proceedings of ACM Conference, Washington,DC, USA, July 2017 (Conference’17), 9 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONASOS is a global e-commerce company, based in the UK and spe-cialising in fashion and beauty. e business is entirely online,and products are sold through eight country-specic websites andmobile apps. At the time of writing, there were 12.5 million activecustomers and the product catalog contained more than 85,000items. Products are shipped to 240 countries and territories and theannual revenue for 2016 was £1.4B, making ASOS one of Europe’slargest pure play online retailers.

An integral element of the business model at ASOS is free de-livery and returns. Free shipping is vital in online clothing retailbecause customers need to try on items without being charged. AsASOS do not recoup delivery costs for returned items, customers caneasily have negative lifetime value. For this reason the Customer-LifeTime-Value (CLTV) problem is particularly important in onlineclothing retail.

Our CTLV system addresses two tightly coupled problems: CLTVand churn prediction. We dene a customer as churned if they havenot placed an order in the past year. We dene CLTV as the sales,net of returns, of a customer over a one year period. e objectiveis to improve three key business metrics: (1) e average customershopping frequency, (2) e average order size, (3) e customerchurn rate. e model supports the rst two objectives by allowingASOS to rapidly identify and nurture high value customers, whowill go on to have either high frequency, high order size, or both.e third objective is achieved by identifying customers at high riskof churn and controlling the amount spent on retention activities.

State-of-the-art CLTV systems use large numbers of handcraedfeatures and ensemble classiers (typical random forest regressors),which have been shown to perform well in highly stochastic prob-lems of this kind [11, 20]. However, handcraed features introducea human boleneck, can be dicult to maintain and oen fail toutilise the full richness of the data. We show how automatically

arX

iv:1

703.

0259

6v1

[cs

.LG

] 7

Mar

201

7

Conference’17, July 2017, Washington, DC, USA B.P. Chamberlain et al.

Figure 1: Training and prediction timescales for CLTV. emodel is retrained every day using customer data from thepast two years. Labels are the net customer spend over theprevious year. Model parameters are learned in the trainingperiod and used to predict CLTV from new features in thelive system.

learned features can be combined with handcraed features to pro-duce a model that is both domain knowledge aware and can learnrich paerns of customer behaviour from raw data.

e deployed ASOS CLTV system uses the state-of-the-art ar-chitecture, which is a Random Forest (RF) regression model with132 handcraed features. To train the forest, labels and featuresare taken from disjoint periods of time as shown in the top row ofFigure 1. e labels are the net spend (sales minus returns) of eachcustomer over the past year. e training labels and features areused to learn the parameters of the RF. e second row of Figure 1shows the live system where the RF parameters from the trainingperiod are applied to new features generated from the last year’sdata to produce prediction for net customer spend over the nextyear.

We provide a detailed explanation of the challenges and lessonslearned in deploying the ASOS CLTV system and describe our latesteorts to improve this system by augmenting it with learned fea-tures. Inspired by recent successes around representation learningin the domains of vision, speech and language we experimentedwith learning representations directly from the data using two dif-ferent approaches: (1) by training a feedforward neural network onthe handcraed features in a supervised seing (see Section 4.2);(2) by learning unsupervised embeddings of customers using weband app browsing data and using these to augment our set of RFfeatures (see Section 4.1). e novel customer level embeddingsare shown to signicantly improve CLTV prediction performanceagainst our benchmark. Incorporating embeddings into long termprediction models is challenging because, unlike handcraed fea-tures, the dimensions are not easy to correspond between trainingand prediction periods. We describe how this problem can be solvedwithin the neural embedding framework.

erefore our main contributions are:

(1) A detailed description of the large-scale CLTV system de-ployed at ASOS, including a discussion of the architecture,deployment challenges and lessons learned.

(2) Analysis of two candidate architectures for hybrid systemsthat incorporate both handcraed and learned features

(3) Introduction of customer level embeddings and show thatthese produce signicantly beer performance against abenchmark classier

(4) We show how to use neural embeddings for long termprediction tasks

2 RELATEDWORKStatistical models of customer purchasing behaviour have beenstudied for decades. Early models were hampered by a lack of dataand were oen restricted to ing simple parametric statisticalmodels such as the Dirichlet [15]. It was only at the turn of thecentury, with the advent of large-scale e-commerce platforms thatthis problem began to aract the aention of machine learningresearchers.

2.1 Distribution Fitting Approachese rst statistical models of CLTV were known as “Buy ’Til YouDie” (BTYD) models. BTYD models place separate parametric dis-tributions on the customer lifetime and the purchase frequency andonly require customer recency and frequency as inputs. One of therst models was the Pareto/NBD model [19], which assumes anexponentially distributed active duration, and a Poisson distributedpurchase frequency for each customer. is yields a Pareto-IIcustomer lifetime and negative-binomial (NBD) distributed pur-chase frequency. Fader et al. [7] replaced the Pareto-II with a Beta-geometric distribution for easier implementation, assuming eachcustomer has a certain probability to become inactive aer everypurchase. Bemmaor et al. [4] proposed the Gamma/Gompertz dis-tribution to model the customer active duration, which is moreexibile than the Pareto-II as it can have a non-zero mode and beskewed in both directions.

Recency-Frequency-Monetary Value (RFM) models expand onBYTD by including an additional feature. Fader et al. [8] linkedthe RFM model, which captures the time of last purchase (recency),number of purchases (frequency) and purchase values (monetaryvalue) of each customer to CLTV estimation. eir model assumes aPareto/NBD model for recency and frequency with purchase valuesfollowing an independent gamma/gamma distribution.

While successful for the problems that they were applied to, itis dicult to incorporate the vast majority of the customer dataavailable to modern e-commerce companies into the RFM/BYTDframework. It is particularly dicult to incorporate automaticallylearned or highly sparse features. is motivates the need formachine learning approaches to the problem.

2.2 Machine Learning Methodse most related work to ours is that of Vanderveld et al. [20],which is the rst work to explicitly include customer engagementfeatures in a CLTV prediction model. ey also address the chal-lenges of learning the complex, non-linear CLTV distribution bysolving several simpler sub-problems. Firstly, a binary classieris run to identify customers with predicted CLTV greater thanzero. Secondly, the customers predicted to shop again are splitinto ve groups and independent regressors are trained for eachgroup. To our knowledge, deep neural networks have not yet beensuccessfully applied to the CLTV problem, but there is work on the

Customer Life Time Value Prediction Using Embeddings Conference’17, July 2017, Washington, DC, USA

related churn problem (CLTV greater than zero) by Wangperawonget al. [22] for telecommunications. ey create an array for eachcustomer where the columns are days of the year and the rowsdescribe dierent kinds of communication (text, call etc.) ey usedthis to train deep convolutional neural networks. e model alsoused auto-encoders [21] to learn low dimensional representationsof the customer arrays.

2.3 Neural EmbeddingsNeural Embeddings are a technique pioneered in Natural Lan-guage Processing (NLP) for learning distributed representationsof words [14]. ey provide an alternative to the one-of-k (orone-hot) representation. Unlike one-of-k , which uses large sparsevectors that are all equally similar, embeddings are compact, denserepresentations that encapsulate similarity. Embedded represen-tations have been shown to give superior performance to sparserepresentations in several downstream tasks in language and graphanalysis [14, 16]. All embedding methods dene a context thatgroups objects. Typically, the data is a sequence and the context is axed length window that is passed along the sequence. e modellearns that objects frequently occurring in the same context aresimilar and will have embeddings with high cosine similarity. Oneof the most popular embedding models is SkipGram with NegativeSampling (SGNS), which was developed by Mikolov et al. [14]. Ithas found a broad range of applications and we describe the mostrelevant to our work. Barkan and Koenigstein [3] used SGNS foritem-level embeddings in item-based collaborative ltering, whichthey called item2vec. As a context the authors use a basket of itemsthat were purchased together 1. Grbovic et al. [10] used SGNS togenerate a set of product embeddings, which they called prod2vec,by mining receipts from email data. For each customer they builda sequence of products (with arbitrary ordering for those boughttogether) and then run a context window over it. e goal was topredict products that are co-purchased by a single customer withina given number of purchases. ey also proposed a hierarchicalextension called bagged-prod2vec that groups products appearingtogether in a single email. Baeza-yates et al. [2] used a variant ofSGNS to predict the next app a mobile phone user would open.e key idea is to consider sequences of app usage within mobilesessions. eir data had associated time stamps, which they used tomodify the original model. Instead of including every pairwise setof apps within the context, the selection probability was controlledby a Gaussian sampling scheme based on the inter-open duration.

3 CUSTOMER LIFETIME VALUE MODELe ASOS CLTV model uses a rich set of features to predict the netspend of customers over the next 12 months by training a randomforest regressor on historic data. One of the major challenges ofpredicting customer CLTV is the unusual distribution of the targetvariable. A large percentage of customers have a CLTV of zero.Of the customers with greater than zero CLTV, the values rangeseveral orders of magnitude. To manage this problem we explicitlymodel CLTV percentiles using a random forest regressor. Having

1Items are synonymous with products, but the term item is used in the recommendersystems literature.

Table 1: Feature importance (top)

Feature Name ImportanceNumber of orders 0.206

Standard deviation of the order dates 0.115Number of session in the last quarter 0.114

Country 0.064Number of items from new collection 0.055

Number of items kept 0.049Net sales 0.039

Days between rst and last session 0.039Number of sessions 0.035

Customer tenure 0.033Total number of items ordered 0.025

Days since last order 0.021Days since last session 0.019

Standard deviation of the session dates 0.018Orders in last quarter 0.016

Age 0.014Average date of order 0.009Total ordered value 0.008

Number of products viewed 0.007Days since rst order in last year 0.006

Average session date 0.006Number of sessions in previous quarter 0.005

predicted percentiles, the outputs are then mapped back to realvalue ranges for use in downstream tasks.

e remainder of this section describes the features used by themodel, the architecture that allows the model to scale to over 10mcustomers and our training and evaluation methodology.

3.1 Featuresemodel incorporates features from the full spectrum of customerinformation available at ASOS. ere are four broad classes of data:(1) customer demographics, (2) purchase history (3) returns history(4) web and app session logs. Of these, by far the largest and richestare the session logs.

We apply random forest feature importance [9] to rank the 132handcraed features[9]. Table 1 shows the top features. As expectedthe number of orders, the number of sessions in the last quarterand the nationality of the customer were very important in CLTVprediction. However, we were surprised by the importance ofthe standard deviation of the order and session dates, particularlybecause the maximum spans of these variables were also features.We also did not expect purchasing from the latest collection to beone of the most predictive features.

3.2 Architecturee high-level system architecture is shown in Figure 2. Raw cus-tomer data (see Section 3.1) is pre-processed in our data warehouseand stored in Microso Azure blob storage. From blob storage, dataows through two work streams to generate customer features:A handcraed feature generator on Apache Spark and an experi-mental customer embedding generator on GPU machines runningTensorow, which uses only the web/app sessions as input. e


Figure 2: High-level overview of the CLTV system. esolid arrows represent the ow of data, and the dashed ar-rows represents interaction between stakeholders and sys-tems/data. Customer data is collected and pre-processed byour data warehouse and stored on Azure blob storage. eprocessed data is used to generate handcraed features inSpark clusters, with web/app sessions additionally used toproduce experimental customer embeddings in Tensorow.ehandcraed features and customer embeddings are thenfed through the machine learning pipeline on Spark, whichtrains calibrated random forests for churn classication andCLTV regression. e resulting prediction are piped to op-erational systems.

model is trained in two stages using an Apache Spark ML pipeline.e rst stage pre-processes the features and trains random forestsfor churn classication and CLTV regression on percentiles. esecond stage performs calibration and maps percentiles to real val-ues. Finally the predictions are presented to a range of businesssystems that trigger personalised engagement strategies with ASOScustomers.

3.3 Training and Evaluation ProcessOur live system uses a calibrated random forest model with featuresfrom the past twelve months that is re-trained every day. Weuse twelve months as there are strong seasonality eects in retailand failing to use an exact number of years would cause modeluctuations.

To train the model we use historic net sales over the last yearas a proxy for CLTV labels and learn the forest parameters usingfeatures generated from a disjoint period prior to the label periodas illustrated in Figure 1. erefore every day we generate:

(1) A set of aggregate features and product view-based embed-dings for each customer, based on their demographics, pur-chases, returns and web/app sessions between two yearsand one year ago.

(2) Corresponding target labels, including the churn statusand net one year spend, for each customer based on datafrom the last year.

e feature and label periods are disjoint to prevent informationleakage. As the predictive accuracy of our live system could onlybe evaluated in a year’s time, we establish our expectation on theperformance of the model by forecasting for points in the past forwhich we already know the actual values as illustrated in Figure 1.We use the area under the Receiver Operating Characteristic (ROC)curve as a performance measure.

3.4 CalibrationIn this context calibration refers to our eorts to ensure that themodel outputs, which are derived from the RF leaf distributionsof the training set, are consistent with the realised statistics ofboth the customer churn rate and CLTV. For the churn problem,choosing RF classier parameters that maximise the area underthe ROC curve does not guarantee that the probabilities will beconsistent with the realised churn rate [23]. To generate consistentprobabilities we calibrate by learning a mapping between the es-timates and the realised probabilities. is is done by training aone dimensional logistic regression classier to predict churn basedonly on the probabilities output by the random forest. e logisticregression output is interpreted as a calibrated probability. Simi-larly, to estimate CLTV we have no guarantees that the regressionestimates achieved by minimizing the Root Mean Squared Error(RMSE) loss function will match the realised CLTV distribution. Toaddress this problem, analogously to churn probability calibration,we rst forecast the CLTV percentile and then map the predictedpercentiles into monetary values. In this case the mapping is learntusing a decision tree. We observe two advantages in performingcalibration: (1) the model becomes more robust to the existence ofoutliers and (2) we get predictions, which when aggregated over aset of customers, beer match the true values.

3.5 ResultsTo nd the optimal meta-parameters for the RFs we use cross vali-dation on a sample of the data. For CLTV predictions (see Figure 3)we obtain Spearman rank-order correlation coecients of 0.56 (forall customers) and 0.46 (excluding customers with a CLTV of 0).We can see in Figure 3 that the range and density of the predictedCLTV matches the actual CLTV (excluding customers with an LTV


1.5 1.0 0.5 0.0 0.5 1.0 1.5Predicted LTV (log)

1.5

1.0

0.5

0.0

0.5

1.0

1.5

Act

ual L

TV (l

og)

Figure 3: Predicted CLTV against actual CLTV (excludingcustomers with an actual CLTV of 0). Units (horizontal andvertical axis) are the average CLTV value in GBP. e distri-bution of the prediction and the actual CLTV are similar inlog scale (top and right density plots). e central plot showsthe t between the predictions and the actual values, whichhave a Spearman rank-order correlation coecient of 0.46.

of 0). For the churn predictions (see Figure 4) we obtain an areaunder the ROC curve of 0.798 and calibrated probabilities.

4 IMPROVING THE CLTV MODEL WITHFEATURE LEARNING

In the remainder of the paper we describe our ongoing eorts tosupplement the handcraed features in our deployed system withautomatic feature learning. Feature learning is the process of learn-ing features directly from data, so as to maximise the objectivefunction of the classication or regression task. is techniqueis frequently used within the realms of deep learning [12] and di-mensionality reduction [18] to overcome some of the limitationsof engineered features. Despite being more dicult to interpret,learnt features avoid the resource intensive task of constructingfeatures manually from raw data and have been shown to outper-form the best handcraed features in the domains of speech, visionand natural language.

We experiment with two distinct approaches. Firstly we learnunsupervised neural embeddings using customer product views anduse these to supplement the feature set of the RF model. Secondlywe train a Deep Neural Network (DNN) on top of the handcraedfeatures to learn higher order feature representations.

Churn prediction

Calibrated probabilities reflect actual probabilities

e.g. if we take a large group of customers

whose forecasted churn risk is 75% this means

that 75 out of every 100 will churn.

Sure the customer is staying

Sure the customer is leavingNot sure

either way

Figure 4: Churn prediction density (horizontal axis) andmatch between predicted probabilities and actual probabil-ities (black line) vs. the optimal calibration (dashed greyline). e predicted probabilities match closely with the ac-tual probabilities

.

4.1 Embeddings of customer using sessionsWe learn embeddings of ASOS customers using NLP neural embed-ding models and replacing sequences of words with sequences ofcustomers viewing a product (See Figure 6). Intuitively, high-valuecustomers tend to browse products of higher value, less popularproducts and products that may not be at the lowest price on themarket. By contrast, low-value customers will tend to appear to-gether in product sequences during sale periods or for productsthat are priced below the market. is information is dicult to in-corporate into the model using hand-craed features as the numberof sequences of product views grows combinatorially.

Figure 5 shows the neural architecture of our customer embed-ding model. e model has two large weight matrices, Win andWout that learn distributed representations of customers. e out-put of the model isWin and aer training, each row ofWin is theembedded representation of a dierent customer. e inputs to themodel are pairs of customers (Cin,Cout) and the loss function is theprobability of observing the output customer Cout given Cin:

E = − log P(Cout |Cin) =exp

(v′out

T vin)

∑ |C |j=1 exp

(v′jT vin

) , (1)where v′j represents the j

th row of Wout, v ′out represents the rowofWout that corresponds to the customer Cout and vin representsthe row ofWin that corresponds to the customer Cin and |C | is thetotal number of customers.

e output somax layer in Figure 5 has a unit for every customer,which must be evaluated for each training pair. is would beprohibitively expensive to do for 10 million customers. However,it can be approximated using SkipGram with Negative Sampling(SGNS) by only evaluating a small random selection of the totalcustomers at each training step [14].


Figure 5: Neural network architecture and matrix represen-tation of the network’s input/output weights (embeddingrepresentation) in the Skipgrammodel on customer embed-dings. e Skipgram model uses a neural network with onehidden layer (of linear activation units). Customers are rep-resented by one-hot vectors in the input and output layers,and the weights between the input (output) and hidden lay-ers are represented by a randomly initialised customer in-put (output) embedding matrix Win (Wout). Each row of thematrix represents a customer embedding in Rn , with lightercells representing a more negative value and darker cellsrepresenting a more positive value.

Applying SGNS requires three key design decisions:(1) How to dene a context.(2) How to generate pairs of customers from within the con-

text.(3) How to generate negative samples.

In NLP, the context is usually a sliding window over word se-quences within a document. e word at the centre of the windowis the input word and (wordin,wordout) pairs are formed with everyother word in the context. e negative samples are drawn from amodied unigram distribution of the training corpus.2

Figure 6 shows how we dene a context and generate customerpairs. Each product in the product catalog is associated with asequence of customer views. A sliding context window is thenpassed over the sequence of customers. For every position of thecontext window, the central customer is used as Cin and all other

2Modied means that word frequencies are raised to a power before normalisation(usually 0.75).

Figure 6: Customer pair generation in the skip-gram modelbased on product views. Each row represents a product soldon ASOS and the sequence of customer views of that prod-uct. In this example the context window is of length two andconsiders only adjacent view events (each pair in this exam-ple) of the same product. Hence, the exact time a product isviewed is ignored. Customers who oen appear in the samecontext windowwill be close to each other in the embeddingrepresentation.

customers in the window are used to form (Cin,Cout) pairs. Inthis way, a window of length three containing (C1,C2,C3) wouldgenerate customer pairs (C2,C1) and (C2,C3). We empirically foundthat a window of length 11 worked well.

e algorithm begins by randomly initialising Win and Wout.en, for each customer pair (Cin,Cout) with embedded represen-tations (vin, v′out), k negative customer samples Cneg are drawn.Aer the forward pass, k + 1 rows ofWout are updated via gradientdescent, using backpropagation:

v′jnew=

{v′j

old − η (σ (v′jT vin) − tj ) vin ∀j : Cj ∈ Cout ∪Cneg

v′jold otherwise ,

(2)

where η is the update rate, σ is the logistic sigmoid and tj = 1 ifCin = Cout and 0 otherwise.

Finally, only one row of Win, corresponding to vin is updatedaccording to:

vnewin = voldin − η

∑j :Cj ∈Cout∪Cneg

(σ (v′jT vin) − tj ) v′j . (3)

Figure 7 shows that we obtained a signicant upli in AUCusing embeddings of customers. We experimented with a range ofembedding dimensions and found the best performance to be inthe range 32-128 dimensions. is result shows that this approachis highly relevant and we are working to incorporate the techniqueinto our live system at the point of writing this paper.

4.1.1 Embeddings for the Live System. To use embeddings inthe deployed system it is necessary to make a correspondencebetween the embedding dimensions in the training period and the


100 101 102

No. of neurons in hidden layer

-1 × 10-3

0 × 10-3

1 × 10-3

2 × 10-3

3 × 10-3

AU

C u

plif

t

Baseline

Figure 7: Upli in AUC achieved on random test setsof 20,000 customers with product view-based embeddingsagainst number of neurons in the hidden layer of the Skip-gram model. e error bars represent the 95% condenceinterval of the sample mean.

live system’s feature generation period. Figure 1 shows that thefeatures for training the CLTV model and features used for thelive system come from disjoint time periods. As the embeddingdimensions are unlabelled, randomly initialised and exchangeablein the SGNS loss function (See Equation (1)), the parameters learnedin the training period can not be assumed to match the embeddingsused in the live system. To solve this problem, instead of randomlyinitialisingWin andWout we perform the following initialisation:

• Customer that were present in the training period: initialisewith training embeddings.• New customers: initialise to uniform random values with

absolute values that are small compared to the trainingembeddings.

In the live system there are four types of (Cin,Cout) customer pairs:(Cnew,Cnew), (Cnew,Cold), (Cold,Cnew) and (Cold,Cold). Equation (3)shows that the update to vin is a linear combination of vout andthe negative vectors. erefore a single update of a (Cnew,Cold)pair is guaranteed to be a linear combination of embedding vec-tors from the training period. To generate embeddings that areconsistent across the two time periods we order the data in eachtraining epoch as: (1) (Cold,Cold) to update the representation ofthe old customers, (2) (Cnew,Cold) to initialise the new customerswith linear combinations of embeddings from old customers, (3)(Cold,Cnew), (4) (Cnew,Cnew).

4.2 Embeddings of hand craed featuresWe also investigate to what extent our deployed system could beimproved by replacing the RF with a Deep Neural Network (DNN).is is motivated by the recent successes of DNNs in vision, speechrecognition, and recommendation systems [6, 13]. Our resultsindicate that while incorporating DNNs in our system may improvethe performance, themonetary cost of training themodel outweighsthe performance benets.

We limit our experiments with DNNs to the churn problem(predicting a CLTV of zero for the next 12 months). is reducesthe CLTV regression problem into a binary classication problem,with more interpretable predictions and metrics on performance.

We experiment with (1) deep feed-forward neural networks and(2) hybrid models combining logistic regression and a deep feed-forward neural network similar to that used in [5]. e deep feed-forward neural networks accept all continuous-valued features anddense embeddings of categorical features, which are described insection 3.1, as input. We use Rectied Linear Units (ReLU) activa-tions in the hidden units and sigmoid activation in the output unit.e hybrid models are logistic regression models incorporating adeep feed-forward neural network. e output of the neural net-work’s nal hidden layer is used alongside all continuous-valuedfeatures and sparse categorical features, as input. is is akin toskip connections in neural networks described in [17], with theinputs connected directly to the output instead of the next hiddenlayer. Training on the neural network part of the models is donevia mini-batch stochastic gradient descent, with change of weightsbackpropagated to all layers of the network.

We evaluate the performance and scalability of the two modelswith dierent architectures, and compare them with other machinelearning techniques. We experimented with neural networks withtwo, three and four hidden layers, each with dierent combinationsof number of neurons. For each neuron architecture, we train themodels using a subset of customer data (the training set), and record(1) the maximum ROCAUC achieved when evaluating on a separatesubset of customer data (the test set), (2) the (wall clock) time takento complete a pre-specied number of training steps. We repeat thetraining/testing multiple times for each architecture to obtain anestimate on the maximum AUC achieved and the training time. Alltraining/testing is done using the TensorFlow library [1] on a TeslaK80 GPU machine.

Introducing bypass connections in the hybrid models improvesthe predictive performance compared to a deep feed-forward neuralnetwork with the same architecture. Figure 8 shows a comparisonof the maximum AUC achieved by DNNs to the hybrid logistic andDNN models on a test set of 50,000 customers. Our experimentsshow a statistically signicant upli (at least 1.4 × 10−3 in everyconguration of neurons that we experimented with). We believethe upli is due to the hybrid models’ ability to memorize therelationship between a set of customer aributes and their churnstatus in the logistic regression part. is is complementary toa deep neural network’s ability to generalise customers based onother customerswho have similar aggregation features, as describedby Cheng et al. [5].

We tried to estimate the size (in number of neurons) of hybridmodel required to outperform the AUC of the RF model. Figure 9shows that there is a linear relationship between the maximumAUC achieved on the same evaluation set of customers and thenumber of neurons in each hidden layer in logarithmic scale. Wenotice a hybrid model with a small number of neurons in eachhidden layer already gives statistically signicant improvement inmaximum AUC achieved compared with a vanilla LR, but withinthe range of our experiments we could not exceed the performanceof the RF model. e shaded regions and dashed lines in Figure 9


[2,2

]

[10,

10]

[50,

50]

[25,

10]

[50,

25,1

0]

[100

,50,

25]

Neuron architecture

0.790

0.791

0.792

0.793

0.794

0.795

Max A

UC

Deep model

Hybrid model

Logistic regression

Figure 8: MaximumAUC achieved on a test set of 50,000 cus-tomers in deep feed-forward neural networks and hybridmodelswith dierent numbers of hidden layer neurons. eerror bars represent the 95% condence interval of the sam-plemean. enumber of hidden-layer neurons are recordedin the following format: [x ,y] denotes a neural networkwithx and y neurons in the rst and second hidden layer respec-tively, [x ,y, z] denotes a neural network with x , y, z neuronsin the rst, second, and third hidden layers.

provide three estimates of the number of neurons that would berequired to outperform the RF model.

While the experiments suggest it is possible for our hybrid model,which incorporates a deep neural network, to outperform our cali-brated RF model in churn classication, we believe the monetarycost required to perform such training outweighs the benet of gainin performance. Figure 10 shows the relationship between mone-tary cost to train the hybrid models and the number of neurons ineach hidden layer. e cost of training a vanilla LR model and ourcalibrated RF model are indicated by horizontal lines. e cost risesexponentially with increasing number of neurons, indicating that ahybrid model that outperformed our calibrated RF model is not bepractical on cost grounds.

We believe the case is similar for CLTV prediction, in which ahybrid model on handcraed features (whether in numerical valuesor categorical feature embeddings) can achieve beer performancethan our current deployed random forest model, though with amuch higher cost that is not commercially viable. is is supportedin principle by our preliminary experiments, which measures theroot mean squared error (RMSE) between the hybrid models’ pre-dicted percentile and actual percentile of each customer’s spend.We observe increasing the number of neurons in hybrid modelsdecreases the RMSE, but are unable to train hybrid models withtens of thousands of neurons due to prohibitive runtimes.

5 DISCUSSION AND CONCLUSIONSWe have described the CLTV system deployed at ASOS and themain issues we faced while building it. In the rst half of the paper

100 101 102 103 104 105 106

No. of neurons

0.792

0.793

0.794

0.795

0.796

0.797

Max A

UC

LR

RF

Figure 9: Maximum Area Under the Curve (AUC) achievedon an evaluation set of 50,000 customers in hybrid mod-els against number of neurons in the hidden layers (in logscale). We only consider hybrid models with two hiddenlayers, each having the same number of neurons. e er-ror bars represent the 95% condence interval of the samplemean. e bottom (green) and top (red) horizontal line rep-resent the maximum AUC achieved by a vanilla logistic re-gression model (LR) and our random forest model (RF) onthe same set of customers. e dashed lines in the shadedregion represent dierent forecasted scenarios for larger ar-chitectures.

we describe our baseline architecture, which achieves state of theart performance for this problem and discuss an important issuethat is oen overlooked in the literature: model calibration. Giventhe recent success of representation learning across a wide range ofdomains, in the second half of the paper we focus on our ongoingeorts to further improve the model by learning two additionaltypes of representations: Training feedforward neural networkon the handcraed features in a supervised seing (Section 4.2);by learning an embedding of customers using session data in anunsupervised seing to augment our set of RF features (Section4.1). We showed that learning an embedding of a rich source ofdata (products viewed by a customer) in an unsupervised seingcan improve the performance over using only handcraed featuresand plan to incorporate this embedding in the live system as wellas apply this same approach to other types of events (e.g. productsbought by a customer). e main alternative approach to the twodescribed ways of learning representations would be to use a deepnetwork to learn end-to-end from raw data sources as opposed tousing handcraed features as inputs. We are starting to explore thisapproach, which while extremely challenging, we believe mightalso provide large improvement versus the state of the art.

REFERENCES[1] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,

Craig Citro, Greg Corrado, Andy Davis, Jerey Dean, Mahieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael Isard,Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man, RajatMonga, Sherry Moore, Derek Murray, Jon Shlens, Benoit Steiner, Ilya Sutskever,Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Oriol Vinyals, Pete Warden,


100 101 102 103 104 105 106

No. of neurons

10-1

100

101

Mean m

oneta

ry c

ost

for

train

ing

(Unit

cost

= c

ost

to t

rain

RF

model)

LR

RF

Figure 10: Mean monetary cost to train hybrid models on atraining set of 100,000 customers against the number of neu-rons in the hidden layers (both in log scale). e trainingcost shown is relative to the cost of training our random for-est (RF) model. Here we only consider hybrid models withtwo hidden layers, each having the same number of neurons.e bottom (green) and top (red) horizontal line representsthe mean cost to train a vanilla logistic regression model(LR) and our RF model on the same set of customers. ecost shown is based on the time required to train each of themodels, and the cost of using the computational resources- Spark clusters to train RF models, and GPU VMs to trainLR and hybrid models in Microso Azure. e vertical dash-dotted (grey) line represents the estimated number of neu-rons in each layer required for a two-hidden layer hybridmodel to out-perform our random forest model.

Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-ScaleMachine Learning on Heterogeneous Distributed Systems. None 1, 212 (2015),19. hp://download.tensorow.org/paper/whitepaper2015.pdf

[2] Ricardo Baeza-yates, Di Jiang, and Beverly Harrison. 2015. Predicting e NextApp at You Are Going To Use. Proceedings of the Eighth ACM InternationalConference on Web Search and Data Mining (2015), 285–294. DOI:hp://dx.doi.org/10.1145/2684822.2685302

[3] Oren Barkan and Noam Koenigstein. 2016. Item2Vec : Neural Item Embeddingfor Collaborative Filtering. Arxiv (2016), 1–8. DOI:hp://dx.doi.org/1603.04259

[4] Albert C. Bemmaor and Nicolas Glady. 2012. Modeling Purchasing Behaviorwith Sudden ”Death”: A Flexible Customer Lifetime Model. Management Science58, 5 (5 2012), 1012–1021. DOI:hp://dx.doi.org/10.1287/mnsc.1110.1461

[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, RohanAnil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.2016. Wide & Deep Learning for Recommender Systems. arXiv preprint (2016),1–4. DOI:hp://dx.doi.org/10.1145/2988450.2988454

[6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networksfor YouTube Recommendations. In Proceedings of the 10th ACM Conference onRecommender Systems (RecSys ’16). ACM, New York, NY, USA, 191–198. DOI:hp://dx.doi.org/10.1145/2959100.2959190

[7] Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. Counting Your Cus-tomers? the Easy Way: An Alternative to the Pareto/NBD Model. MarketingScience 24, 2 (2005), 275–284. DOI:hp://dx.doi.org/10.1287/mksc.1040.0098

[8] Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. RFM and CLV: UsingIso-Value Curves for Customer Base Analysis. Journal of Marketing Research XLII,November (2005), 415–430. DOI:hp://dx.doi.org/10.1509/jmkr.2005.42.4.415

[9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. e elements ofstatistical learning. Vol. 1. Springer series in statistics Springer, Berlin.

[10] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati,Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in Your Inbox :Product Recommendations at Scale Categories and Subject Descriptors. SIGKDD

2015: Proceedings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (2015), 1809–1818. DOI:hp://dx.doi.org/10.1145/2783258.2788627

[11] B. Hardie, N. Lin, V. Kumar, S. Gupta, N. Ravishanker, S. Sriram, D. Hanssens, andW. Kahn. 2006. Modeling Customer Lifetime Value. Journal of Service Research 9,2 (2006), 139–155. DOI:hp://dx.doi.org/10.1177/1094670506293810

[12] Yangqing Jia, Evan Shelhamer, Je Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Cae: Convolu-tional architecture for fast feature embedding. In Proceedings of the 22nd ACMinternational conference on Multimedia. ACM, 675–678.

[13] Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature521, 7553 (2015), 436–444.

[14] Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. 2013. DistributedRepresentations of Words and Phrases and their Compositionality. Nips (2013),1–9. DOI:hp://dx.doi.org/10.1162/jmlr.2003.3.4-5.951

[15] Donald G Morrison, David C Schmilein, Source Journal, Royal Statistical, andSociety Series. 1988. Generalizing the NBD model for customer purchases:What are the implications and is it worth the eort? Journal of Business &Economic Statistics 6, 1 (1988), 129–145. DOI:hp://dx.doi.org/10.1080/07350015.1988.10509648

[16] Bryan Perozzi and Steven Skiena. 2014. DeepWalk : Online Learning of SocialRepresentations. Sigkdd (2014), 701–710. DOI:hp://dx.doi.org/10.1145/2623330.2623732

[17] Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. Deep Learning Made Easierby Linear Transformations in Perceptrons. In Proceedings of the 15th InternationalConference on Articial Intelligence and Statistics (AISTATS). JMLR, 924–932.

[18] Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reductionby locally linear embedding. science 290, 5500 (2000), 2323–2326.

[19] David C. Schmilein, Donald G. Morrison, and Richard Colombo. 1987. CountingYour Customers: Who Are ey and What Will ey Do Next? ManagementScience 33, 1 (1987), 1–24. DOI:hp://dx.doi.org/10.1287/mnsc.33.1.1

[20] Ali Vanderveld, Addhyan Pandey, Angela Han, and Rajesh Parekh. 2016. AnEngagement-Based Customer Lifetime Value System for E-commerce. (2016).

[21] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.2008. Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning. ACM,1096–1103.

[22] Artit Wangperawong, Cyrille Brun, and Rujikorn Pavasuthipaisit. 2016. Churnanalysis using deep convolutional neural networks and autoencoders. arXivpreprint arXiv (2016), 1–6.

[23] Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probabilityestimates from decision trees and naive Bayesian classiers. In ICML, Vol. 1.Citeseer, 609–616.

http://download.tensorflow.org/paper/whitepaper2015.pdfhttp://dx.doi.org/10.1145/2684822.2685302http://dx.doi.org/10.1145/2684822.2685302http://dx.doi.org/1603.04259http://dx.doi.org/10.1287/mnsc.1110.1461http://dx.doi.org/10.1145/2988450.2988454http://dx.doi.org/10.1145/2959100.2959190http://dx.doi.org/10.1287/mksc.1040.0098http://dx.doi.org/10.1509/jmkr.2005.42.4.415http://dx.doi.org/10.1145/2783258.2788627http://dx.doi.org/10.1145/2783258.2788627http://dx.doi.org/10.1177/1094670506293810http://dx.doi.org/10.1162/jmlr.2003.3.4-5.951http://dx.doi.org/10.1080/07350015.1988.10509648http://dx.doi.org/10.1080/07350015.1988.10509648http://dx.doi.org/10.1145/2623330.2623732http://dx.doi.org/10.1145/2623330.2623732http://dx.doi.org/10.1287/mnsc.33.1.1

Abstract1 Introduction2 Related Work2.1 Distribution Fitting Approaches2.2 Machine Learning Methods2.3 Neural Embeddings

3 Customer Lifetime Value Model3.1 Features3.2 Architecture3.3 Training and Evaluation Process3.4 Calibration3.5 Results

4 Improving the CLTV model with feature learning4.1 Embeddings of customer using sessions4.2 Embeddings of hand crafted features

5 Discussion and ConclusionsReferences

customer life time value prediction using embeddings...customer life time value prediction using...

Documents