machine learning based prediction and...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Machine Learning Based Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Machine Learning Based Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits)

KTH Royal Institute of Technology year 2020

Supervisor at KTH: Tatjana Pavlenko

Examiner at KTH: Tatjana Pavlenko

TRITA-SCI-GRU 2020:002

MAT-E 2020:02

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The desire tomodel the true gain from targeting an individual inmarketing purposes has leadto the common use of uplift modeling. Uplift modeling requires the existence of a treatmentgroup as well as a control group and the objective hence becomes estimating the differencebetween the success probabilities in the two groups. Efficient methods for estimating theprobabilities in uplift models are statistical machine learning methods. In this project thedifferent uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directlyand the Class Variable Transformation are investigated. The statistical machine learningmethods applied are Random Forests and Neural Networks along with the standard methodLogistic Regression. The data is collected from a well established retail company and thepurpose of the project is thus to investigate which uplift modeling approach and statisticalmachine learning method that yields in the best performance given the data used in thisproject. The variable selection step was shown to be a crucial component in the modelingprocesses as sowas the amount of control data in each data set. For the uplift to be successful,the method of choice should be either the Modeling Uplift Directly using Random Forests,or the Class Variable Transformation using Logistic Regression. Neural network - basedapproaches are sensitive to uneven class distributions and is hence not able to obtain stablemodels given the data used in this project. Furthermore, the Subtraction of Two Modelsdid not perform well due to the fact that each model tended to focus too much on modelingthe class in both data sets separately instead of modeling the difference between the classprobabilities. The conclusion is hence to use an approach that models the uplift directly, andalso to use a great amount of control data in the data sets.

Keywords

Uplift Modeling, Data Pre-Processing, Predictive Modeling,Random Forests, Ensemble Methods, Logistic Regression, Machine Learning, Mulit-LayerPerceptron, Neural Networks.

i

Abstract

Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett tillden idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföradenna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp ochmålet är således att beräkna differensen mellan de positiva utfallen i de två grupperna.Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras medstatistiskamaskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks idetta projekt är subtraktion av två modeller, att modellera den inkrementella responsendirekt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna somtillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression.Datan är samlad från ett väletablerat detaljhandelsföretag ochmålet är därmed att undersökavilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bästgivet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultatvisade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att fåett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken användsför att modellera den inkrementella responsen direkt, eller logistisk regression tillsammansmed en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämnaklassfördelningar och klarar därmed inte av att erhålla stabilamodellermed den givna datan.Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderadeatt fokusera för mycket på att modellera klassen i båda dataseten separat, istället för attmodellera differensen mellan dem. Slutsatsen är således att en metod som modellerar deninkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ettstabilt resultat.

ii

Acknowledgements

We would like to thank Mattias Andersson at Friends & Insights who is the key personwho made this project happen to begin with. A great thanks for introducing us to theuplift modeling technique, and for suggesting our thesis project for the CRM departmentat the retail company. We would also like to thank Elin Thiberg at the retail company whosupervised us when in need, and who gladly answered every question we had regarding thestructure of the different data sets. Another person at the retail companywhowas supportingand guided us in the right direction was Sara Grünewald and for that we are truly grateful.Last but not least, we would like to send a great thank you to our examiner and supervisor,Professor Tatjana Pavlenko, for providing professional advise and for guiding us during ourmeetings.

iii

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Delimitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background and RelatedWork 6

3 Data 83.1 Markets and Campaigns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Methods and Theory 114.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1.2 Variable Selection and Dimension Reduction . . . . . . . . . . . . . . . 144.1.3 Binning of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Uplift Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.1 Subtraction of Two Models . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.2 Modeling Uplift Directly . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.3 Class Variable Transformation . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Classification and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.4 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4.1 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4.2 Qini Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Programming Environment of Choice . . . . . . . . . . . . . . . . . . . . . . . 33

5 Experiments and Results 355.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Uplift Modeling and Classification . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.4 Cutoff for Classification of Customers . . . . . . . . . . . . . . . . . . . 48

6 Conclusions 49

v

6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

References 53

vi

1 Introduction

This thesis begins with a general introduction to the area for the degree project, presented inthe following subsections.

1.1 Background

In retail and marketing, predictive modeling is a common tool used for targeting andevaluating the response from individuals when an action is taken on. The action is normallyrefereed to a campaign or offer that is sent out to the customers and the response to modelis the likelihood that a specific customer will act on the offer.

Put differently, in traditional responsemodels, the objective is to predict the conditional classprobability

P (Y = 1|X = x)

where the response Y ∈ {0, 1} reflects whether a customer responded positively (i.e. madea purchase) to an action or not (i.e. did not make a purchase). X = (X1, ..., Xp) are thequantitative and qualitative attributes of the customer and x is one observation.

Using traditional response modeling, the resulting classifier can then be used to select whatcustomers to target when sending out campaigns or offers in amarketing purpose. In reality,this is not always the desirable approach to use since the targeted customers are those whoare most likely to react positively to the offer after the offer has been sent out. The solutionis thus to use a second order approach recognized as uplift modeling.

The original idea behind upliftmodeling is to use two separate train sets and test sets, namelyone train and test set containing a treatment group and one train and test set containing acontrol group. The customers in the treatment group are subject to an action whereas thecustomers in the control group are not. Uplift modeling thus aims at modeling the differencebetween the conditional class probabilities in the control and treatment group, instead of justmodeling one class probability:

P T (Y = 1|X = x)− PC(Y = 1|X = x) (1)

where the superscript T denotes the treatment group, and the superscript C denotes thecontrol group. This method is called Subtraction of Two Models and is presented in Section4.2.1. Each probability in (1) is estimated using the statistical machine learning methodspresented in Section 1.5. If the result of (1) is negative, it indicates that the probability thata customer makes a purchase when belonging to the control group is larger than when thecustomer belongs to the treatment group. This is called a negative effect and is very importantto include in the models for being able to investigate how the campaigns are affecting thecustomers, see Section 4.4.2 for more details. There also exist other approaches for upliftmodeling which directly models the uplift by using one data set instead of two. This data setincludes both the treatment data as well as the control data, and is split into train and testdata sets. The methods for modeling the uplift directly either includes the use of a tree basedmethod, Section 4.2.2, or to use of a class variable transformation, Section 4.2.3.

1

Using the uplift modeling approach the true gain from targeting an individual can bemodeled. The purpose of using the uplift modeling approach is hence to optimize customertargeting when applying it in the marketing domain.

1.2 Problem

There is a problem that arises when using uplift modeling, i.e. using one treatment groupand one control group. For every individual in the experiment, only one outcome can beobserved. Either the individual belongs to the treatment group or the individual belongs tothe control group. One individual can never belong to both groups. Put differently, it is notpossible to know for sure that there is a causal connection that the costumer in treatmentgroup responds because of the treatment since the same costumer cannot be in the controlgroup at the same time. Thus it is not possible to evaluate the decisions at the individualobservational unit as is possible in for example classification problems where the class of theindividual is actually known. This in turn makes it a bit more tricky when evaluating upliftmodels.

Furthermore, uplift modeling has not yet been tested on the data used in this project, or onsimilar data belonging to companywho owns this data. Thus it is not clear if it is even possibleto apply the uplift modeling technique on this data and to obtain applicable results.

The question to be answered is hence how to optimize customer targeting in the marketingdomain by using the uplift modeling approach, and at the same time being able to modelthe true gain from targeting one specific individual. Furthermore, how should the upliftmodeling technique be implemented in the best way to obtain the most applicable resultsgiven this kind of data?

1.3 Purpose and Goal

The purpose of this thesis is to present methods of how to optimize the customer targetingin marketing campaigns related to the area of retail. The thesis presents investigations anddiscussions of different statistical machine learning methods that can be used when the aimis to estimate (1).

The goal of the degree project is to present the uplift modeling approach in combinationwith the statistical machine learning method that yields in the best performance, given thedata used in this project. The result of the project should yield in a guidance towards whichapproach that is best suited, and thus is the best method of choice, for analyzes that falls intothe same category as for those in this project.

1.3.1 Ethics

Today there exists a lot of data on the internet that provides powerful tools when it comes tomarketing and other predictions of behavior and personalities. Our goal with this project isto find the subgroup of costumers that will give the best respond to retail campaigns. This

2

task can look quite harmless on its own. Though in recent years, it has been shown that whentechniques similar to this is used in other circumstances it can have serious consequences.For example [9], the company Cambridge Analytica used millions of peoples behaviouraldata from Facebook without their permission for the 2016 election in the USA. The data wasthen used to build models to find persuadable voters that could bemanipulated through fakeinformation from ads on Facebookwithout their knowledge. This is obviously a serous threatto the democracy and a new effective way of spreading propaganda.

The uplift modeling technique was also used for the campaign for Obama in 2012 [20]. Usingthis technique for the campaign was acceptable in that time since the data was not illegallycollected. Also, the result from the models was used to choose which people to target withcampaign commercials. Hence, it is important to question for what purpose it is ethicalcorrect to use this technique? Also, one has to question if the data that is used is acceptableto include in the models. Is it legally collected and would every person find it reasonable thattheir data is used for the purpose of the task?

The laws regarding personal data gets toughen by timewhichmeans that different companiescannot use peoples data anyway they want. This makes it easier to draw boundaries of whatkind of data that can be used when applying the uplift modeling technique. Although it is stillimportant to always question the purpose and understand the power of the technique.

1.4 Data

The data is collected from the retail company’s database and includes qualitative andquantitative attributes about the customers. The data describes, among other things, thebehaviour of different customers in terms of howmany purchases has beenmade in differenttime periods, how many returns has been made as well as different amounts that thecustomers has spent on online purchases and in stores. There is also one binary responsevariable that shows whether a customer has made a purchase during a campaign period ornot.

Each data set used in this project corresponds to one specific campaign each, hence onecustomer can occur in several data sets. There are one variable that describes whether acustomer belongs to the control group or the treatment group. Customers belonging tothe control group are customers who did not receive any campaign offer, while customersbelonging to the treatment group did receive the offer.

1.5 Methodology

Using the uplift modeling approach assumes the use of a statistical machine learningmethod to model predictions of the actions for individuals in the treatment group as wellas individuals in the control group.

There are three overall approaches that exists for uplift modeling and the first one isrecognized as Subtraction of Two Models, i.e. using (1) as it is to model the differencebetween the class probabilities. The second approach is to model the uplift directly by using

3

a conditional divergence measure as splitting criterion in a tree-based method. The thirdapproach is to use aClass Variable Transformation that allows for a conversion of an arbitraryprobabilistic classification model into a model that predicts uplift directly. As there areadvantages and disadvantages with each of the uplift modeling approaches, all of them willbe examined in this project.

Furthermore, each upliftmodeling approach requires the use of a suitable statisticalmachinelearning method. For both the Class Variable Transformation and Subtraction of TwoModels, it is possible two use almost any statisticalmachine learningmethod that can predictconditional class probabilities. Examples of such methods are Logistic Regression, SupportVector Machines, Multilayer Perceptrons (Neural Networks), tree-based methods and K-Nearest Neighbour. For the purpose of being able to compare the model performance whenusing a simplemodel compared to amore complexmodel, Logistic Regression andMultilayerPerceptrons will be used in these uplift modeling settings. Using a conditional divergencemeasure as a splitting criterion obviously requires the use of a tree-based method. Themethod of choice for this approach in this project is thus the ensemble learning methodRandom Forests.

1.6 Delimitations and Challenges

The first note that needs to be made is that to be able to use uplift modeling, there is a needof an existing treatment group and control group related to a certain campaign or offer1.Not only must a control group exist, it also needs to be large enough for uplift modeling tobe beneficial. The control group needs to be at least ten times larger than it needs to bewhen measuring simple incremental response. Also, when modeling binary outcomes, thetreatment group and control group together needs to be quite large.

An issue to take into considerationwhen performing upliftmodeling is the complex customerinfluences. If a customer interacts with a company in several ways such as differentmarketing activities, advertisements, communications etc., it can be hard to isolate the effectof the specific marketing activity that is intended tomodel, unlike when there are fewer kindsof interactions between the customer and the company.

Lastly, uplift modeling models the difference between two outcomes rather than just oneoutcome. Radcliffe et al. [15] points out that this leads to a higher sensitivity of overfittingthe data. Thus, even methods that are normally robust to the need of variable selection andsuch, are in the need of it before any uplift modeling can be done.

1.7 Outline

In section 2, the idea behind uplift modeling is explained along with some related work thathas already been made in the area. Section 3 contains a detailed description of the data, i.e.statistics of the different campaigns that are used andwhat type of variables that are collected

1Since this thesis uses uplift modeling applied to the area of retail, the action that has a related responsewhich is modeled will be recognized as a campaign or offer throughout the whole thesis.

4

in the different data sets. The variables are listed in a table where no variable is excluded,meaning that the table contains the list of all the variables that are used before any kind ofvariable selection is made.

The description of the data is followed by Section 4 which contains all the theory relatedto this thesis. Here, a theoretical description of how to pre-process data is presentedalong with some theory of variable selection. Furthermore, the three different approachesfor uplift modeling, as well as the statistical machine learning methods that are used toperform uplift modeling is described. The three uplift modeling approaches used in thisthesis are Subtraction of Two Models, Modeling Uplift Directly and the Class VariableTransformation. The statistical machine learning methods used for uplift modeling areLogistic Regression, Random Forests and Neural Networks. Moreover, a description of theresampling method Cross Validation is presented. The evaluation metrics that are used inthis project are also presented, namely Receiver Operating Characteristic Curves and Qinicurves. Finally, Section 4 is ended with the description of the programming languages thatare used for the different approaches and methods, and why these languages are well suitedfor these kind of problems.

In Section 5 all the experimental results are presented. Firstly, the results from the pre-processing of the data are presented. Secondly, each implementation is described along withtables and figures of the results of the best performing models. The report ends with Section6 in which the conclusions of the results are discussed.

5

2 Theoretical Background and RelatedWork

Machine learning is an area within computer science and statistics which often aims to, givensome attributes, classify a specific instance into some category, or the conditional probabilitythat it belongs to each of the classes. This technique can be used in a lot of areas, with oneof them being marketing. Though in reality, this regular kind of classification technique isnot really well suited for marketing. For instance, consider a marketing campaign where anoffer is sent out to a (randomly) selected subgroup of potential customers. Then using theresults of the actions taken from the customers, a classifier can be built on top of it. Thus, theresulting classifier is used to select which customers to send the campaign to. The result willbe that the customers who are most likely to react positively to the offer after the campaignhas been sent out, will be used as targets. This is not desirable for the marketer.

Some customers would have made a purchase whether or not they were targeted with thecampaign, and thus unnecessary expenses are wasted in the case of sending the offer to thiskind of customer. Then there are customers who actually react in a negative way by getting acampaign offer. Some might find it disturbing to receive campaign offers from the companyin question, or stop being a customer out of some other reason just because they received theoffer. When a customer stops doing business with a company it is called customer churn.Customer churn is something that the company in question really wants to avoid. In otherwords, this is not a customer the marketer wants to target since it is an unnecessary expenseto send out the campaign in this case, and they needlessly looses a customer. The first kindof customer just described is called a Sure Thing and the second one is commonlymentionedas a Do-Not-Disturb.

Then there are two more categories of customers, namely the Lost Cause and thePersuadable. You can tell by the name that the lost cause is someone who would reactnegatively, i.e. would not make any purchase at all, whether they were targeted or not. Toreach out to this kind of customer is also a waste of money. The persuadable on the otherhand, is the customer that the marketer wants to find and target. This kind of customeris a person who would not have made any purchase if they would not have received thecampaign offer, but who would make a purchase if they did. These are the kind of customersthemarketer can affect in a positive direction. An overview of the different type of customerscan be seen in Table 2.1.

The solution to this kind of problem is calledUplift Modeling. The original idea behind upliftmodeling is to use two separate training sets, namely one data set containing the controlgroup and one containing the treatment group. The control group contains the customerswhowere not targeted by the campaign, and the treatment group contains the customerswhoreceived the campaign. Uplift modeling thus aims at modeling the difference between theconditional class probabilities in the control and treatment group, instead of just modelingone class probability. Thus the true gain from targeting an individual can be modeled. Amore detailed and theoretical description of upliftmodeling can be seen in Section 4.2. Upliftmodeling is already applied frequently in the marketing domain according to [19], althoughit has not received as much attention in the literature as one might believe.

6

Responseiftreated

No Do-Not-Disturb Lost Cause

Yes Sure Thing Persuadable

Yes No

Response if not treated

Table 2.1: The four categories of individuals considered when applying the uplift modelingtechnique.

In an article about uplift modeling in direct marketing, Rzepakowski et al. [17] uses decisiontrees to model the uplift for e-mail campaigns. The decision tree based models are alsocompared to more simple standard response basedmodels such that three uplift models andthree standard response models are used in total. The data that is modeled on reflects thecostumers of a retail company. The goal is thereby to classify costumers as persuadablewherethe response reflects whether they go to the retail company’s website or not because of thecampaign. The result of the study is that they find it possible and more effective to use upliftmodeling than response models to predict the persuadables, i.e. which costumers who has apositive response to the campaigns. The standard response models were good at predictingif a costumer were going to the website or not, but performed very bad in predicting ifthey responded to the campaign or not. Rzepakowski et al. also show that uplift modelingdone with decision trees (Modeling Uplift Directly) yields in a better result than when usingSubtraction of Two Models. This is hence the reason why this project will focus solely oncomparing different approaches for uplift modeling, and not to include traditional responseor purchase models since they have been proven in many cases to perform worse.

The same object is discussed by Radcliffe et al. [15], who write about uplift modeling andwhy it performsbetter thanother traditional responsemethods. They also discuss thoroughlyaboutmany important aspects such as evaluation of upliftmodels aswell as variable selectionfor uplift modeling, which is very helpful for getting deeper insights in the matter. It isalso written about Subtraction of Two Models and why this approach does not work wellcompared to Modeling Uplift Directly. Radcliffe et al. indicates that it is clearly importantto understand that just because Subtraction of Two Models is capable of building two goodseparate models that preforms well on unseen data, this does not necessarily yield in a gooduplift when taking the difference of the two models.

7

3 Data

The data used in all the statistical machine learning methods in this project is collected froma well established retail company that has physical stores as well as a website where thecustomers can make orders online.

The behaviour for the different customers can vary a lot when it comes to how a purchaseis being made. Some customers only make orders online while some might only shop in aphysical store. Furthermore there are customers who make purchases both online and in astore. The data used in the methods in this project will consider all kinds of purchases (bothstore and online).

In the following subsections, themarkets and campaigns used in this thesis will be presentedamong with a table of descriptions of all the variables.

3.1 Markets and Campaigns

The customer base can be segmented into different categories depending on the customerspurchase behaviour and the company is working actively with encouraging frequentcustomers to do more purchases. For the purpose of not loosing a frequent customer to asilent stage, the data used in the methods in this thesis will only consider campaigns sent tofrequent customers. The focus will be on this category of customers since the wish is to useuplift modeling so that campaigns will mainly be sent to customer of the type persuadable.Also, the campaigns differs depending on the stage of the customer and thus by focusing onthe frequent customers, there will be a consistency when it comes to what kind of campaignthat is used in the methods of this project.

Today the company is present in more than 70 retail markets. One specific market is chosenfor this project, and thus all the data used in the uplift models will be generated from thismarket.

The campaigns in question that are sent to the customers are actual postcards. Thesepostcards are sent to the customers mailbox and the postcards are only valid on one onlinepurchase each. All the campaigns contains an offer of a 10% discount at one purchase at thecompany’s webshop. The only thing that differs between different campaigns are the timeperiod of which they were sent out.

In this thesis, six different campaigns that were sent out to customers in the chosen marketwill be considered. The campaigns and their start date among with other information can beseen in the Table 3.1.

3.2 Variables

Following is a table of all the variables used in the data set before variable selection is made.Each row is related to a different customer and each column in the data set contains all thedifferent variables. If a variable is not specified to concern online purchases only, then it

8

Campaign CampaignStartDate AddressFileDate N T C

1 2017− 10− 30 2017− 09− 25 181 221 97.24% 2.67%

2 2018− 02− 05 2018− 01− 18 82 828 90.34% 9.66%

3 2018− 02− 12 2018− 01− 18 155 096 90.34% 9.66%

4 2018− 04− 02 2018− 03− 05 62 121 90.34% 9.66%

5 2018− 06− 25 2018− 05− 31 310 607 90.34% 9.66%

6 2019− 03− 04 2019− 02− 18 207 071 90.34% 9.66%

Table 3.1: The six different campaigns used in the uplift models. AdressFileDate is thedate the customers were chosen to be a part of the campaign and N is the total numberof customers in each data set. T and C are the percentage of customer that belongs to thetreatment group and the control group, respectively.

concerns purchasesmade both online and in stores. Note that the data concerning customersthat has made a purchase in a store are customers who are also clubmembers (or staff). Thisis because the company are not able to collect data about store customers who do not havea membership or are not staff. Purchases made online on the other hand can concern bothmembers and non members as well as staff.

All the amounts are in EUR, with the most recent currency rate. The gross amount is theamount a piece is set to cost before any kind of reduction or discount is made. If a reductionismade, i.e. if a specific item is on sale or has a new reduced price or equivalent, the new priceis then the brutto amount. Discount implies that a customer has a personal discount that isused on a purchase, i.e. it is not the same as an overall reduction that is set to an item, butsome kind of discount used by a specific customer. The final amount payed by the customeris then the net amount.

Variable Description

group0 if customer belongs to the control group, 1 if treatmentgroup.

genderGender of customer, 1 if female, 0 if male and set tomissing if unknown.

age Age of the customer.

resp_dd_flagResponse flag which is 1 if customer has made a purchaseonline within the response window2, 0 otherwise.

resp_dd_piecesNumber of pieces ordered online in total during responsewindow.

resp_dd_price_net Total net price on online orders during response window.

ClubmemberClubmembership which is 1 if customer was amember ataddress file date3, 0 otherwise.

IsStaff 1 if customer is staff, 0 otherwise.

lastPurchaseDate Date of the latest purchase in the observation period.

2The response window is from the day the campaign started plus 14 days.3Address file date is the date the customer was chosen to be a part of the campaign.

9

Has_Purch_i1 if customer has made a purchase within the last i =

3, 12, 244 months before address file date, 0 otherwise.

Has_Purch_Child_i1 if customer has made a purchase from children withinthe last i = 3, 12, 24 months before address file date, 0otherwise.

Has_Purch_Ladies_i1 if customer has made a purchase from ladies withinthe last i = 3, 12, 24 months before address file date, 0otherwise.

Has_Purch_LadiesAcc_i1 if customer hasmade a purchase from ladies accessorieswithin the last i = 3, 12, 24 months before address filedate, 0 otherwise.

Has_Purch_Men_i1 if customer has made a purchase from men withinthe last i = 3, 12, 24 months before address file date, 0otherwise.

orders_iNumber of orders the past i = 3, 12, 24 months beforeaddress file date.

orders_red_or_dis_iNumber of orders with reduction or discount the pasti = 3, 12, 24months before address file date.

orders_ret_iNumber of returned orders the past i = 3, 12, 24 monthsbefore address file date.

share_red_or_dis_order_iShare of orders with reduction or discount the past i =

3, 12, 24months before address file date.

share_ret_order_iShare of orders with returned pieces the past i = 3, 12, 24

months before address file date.

dd_pcs_iNumber of pieces in total the past i = 3, 12, 24 monthsbefore address file date.

dd_net_amt_iNet amount in total the past i = 3, 12, 24 months beforeaddress file date.

dd_red_or_dis_pcs_iNumber of pieces with reduction or discount in total thepast i = 3, 12, 24months before address file date.

dd_ret_pcs_iNumber of returned pieces the past i = 3, 12, 24 monthsbefore address file date.

dd_ret_net_amt_iReturned net amount the past i = 3, 12, 24months beforeaddress file date.

Table 3.2: Table of all the variables used in the data set, as well as the description for eachvariable. There are 54 variables in total in each data set.

4i = 3, 12, 24 indicates that there are three different variables with the same kind of information, but for 3,12 and 24months.

10

4 Methods and Theory

Uplift modeling is a data mining/predictive modeling technique that directly models theincremental impact of a treatment on an individual’s behaviour. This will be the underlyingmodel for constructing the statisticalmachine learningmethods. There are a great amount ofstatistical machine learning methods that can be used for regression or classification. In thiscasewhen using a statisticalmachine learningmethodwith the purpose to apply it in an upliftmodeling setting, suitable models are Logistic Regression, Random Forests and MultilayerPerceptrons (Neural Networks) as these performs binary classification.

The following sections will hence include the theoretical background for data pre-processing,uplift modeling, classification and evaluation metrics. Last but not least, the differentprogramming environments of choice are presented along with some arguments oftheir compatibility with the data and statistical machine learning methods used in thisproject.

In this project the input variables are denoted asXm ∈ {X1, ..., Xp}, also called input ”nodes”for Neural Networks, where p is the number of attributes in the data and m is an indexcorresponding to one variable. The response variable is denoted as Y and a prediction isdenoted Y and which takes on values within [0, 1]. The values represents the probabilitythat an observation belongs to a certain class (0 or 1). A vector with all the variables isdefined as X = {X1, ..., Xp}, where an observation xi is a column vector of p elements.Furthermore, referring to a matrix with N observations and p variables is denoted with abold letter X ∈ RN×p and the response vector is denoted y = (y1, ..., yN). One observation ofX is then the row vector xT

i = (xi,1, ..., xi,p)T where i = 1, ..., N .

4.1 Data Pre-Processing

The data produced nowadays is large in size and does usually have a very high dimension.Also, the data does most likely include a lot of errors such as missing values and outliers.Pre-processing data is about removing and manipulating these values so that the data is agood representation of the desired objects. Also, a part of the processmay include dimensionreductionwhenneeded. Themanagement of data can be a very challenging task sincemanualpre-processing of data takes a lot of time, see [8].

Moreover, the variables in the data can be in very different ranges and can have differentamount of impact on the prediction. When making a predictive analysis (and other analysisas well), it is of high importance to have a data set with good representation and quality toget an acceptable result. Also, it is important to choose the variables that is best associatedwith the response and not to have a too large dimension of the data. To obtain this, datapre-processing is made in several ways to form a good data representation.

11

4.1.1 Data Cleaning

Cleaning the raw data is a crucial step in order to get good quality of the data representations.It is important to identify and remove incorrect and incomplete data and also, if needed, toreplace and modify bad data points. In the following subsections, different ways to handlemissing values and outliers will be presented.

Missing Values

It is very common that some features in a data set has missing values, and thus it is of a highimportance to handle the missing data some how. To delete certain columns or rows thathas a missing value is one way to handle it, but depending on what kind of missing value itis, there exists other techniques that might be more suitable for handling these values.

Overall, the missing values can be divided into three different categories according to [5],namelymissing at random (MAR),missing completely at random (MCAR) andmissing notat random (NMAR).

The missing data is MAR if for example respondents in a certain profession are less likely toreport their income in a survey. The missing value thus depends on other variables than theone that ismissing. If the data is said to beMCAR, then themissing value does not depend onthe rest of the data. This can for example be if some questionnaires in a survey accidentallyget deleted. If themissing data depends on the variable that ismissing, the data is said tomedNMAR. An example of this can be if respondents with high income are less likely to reporttheir income in a survey. Having this kind ofmissing data causes the observed trainingdata togive a corrupted picture of the true population. Imputation methods are in these conditionsdangerous.

It is possible to use imputation methods both on data that is under the assumption to beMAR as well as MCAR, although MCAR is a stronger assumption. Whether or not the datais MCAR often needs to be determined in the process of collecting the data.

As mentioned before, there are several ways to handle the missing data and the simplest oneis to delete the observations that contains the missing data. This method is usually called thelistwise-deletion method and it is only workable if the proportion of deleted observations issmall relative to the entire data set. Furthermore, it can only be used under the assumptionthat the missing values are MAR or MCAR.

On the other hand, if the amount of missing data is large enough in size compared to theentire dataset, the method just mentioned is not good enough. In such cases it is possible tofill in an estimated value for each missing value by using a Single Imputation method suchas Mean Imputation which means that the missing value is replaced with the mean of allthe completely recorded values for that variable. Another way to handle missing values is touse some sophisticated algorithm such as EM-algorithm orMultiple Imputations. The latterfills themissing valuesm > 1 times and thus createsm different data sets which are analyzedseparately and then them results are combined to estimate the model parameters, standarderrors and confident intervals. Each time the values are imputed they are generated from adistribution that might be different for each missing value, see [6].

12

In this project, the statistical software suite SAS is used for all the pre-processing of the dataand hence the existing procedure MI is used for handling of some of the missing values.The MI procedure is a multiple imputation method that has a few different statementsto choose from depending on what type of variables that needs to be imputed. The FCSstatement is used in this project along with the imputation methods LOGISTIC (used forbinary classification variables) and REG (used for the continuous variables). The FCSstatement stands for the Fully Conditional Specification and it determines how the variableswith an arbitrary missing data pattern are imputed, and thus the methods LOGISTIC andREG are two of the available methods related to this statement.

The result after the use of the procedure yields in m separately imputed data sets withappropriate variability across the m imputations. These imputed data sets then needs tobe analyzed using a standard SAS procedure which in this project is the MIXED procedure,since it is valid for a mixture of binary and continuous variables. Once the analyses fromthe m imputed data sets are obtained, they are combined in the MIANALYZE procedure toderive valid inferences. This procedure is described in detail in [18].

Outliers

Another important step in the process of cleaning the data is the handling of outliers. Anoutlier is an observation that is located in an abnormal distance from the rest of the data, i.e.the observation do not seem to fit the other data values. What is an abnormal distance ornot can be decided by for example comparing the mean or median of the data from similarhistorical data sets, see [6]. The handling of outliers is not necessary for all kind of statisticalmachine learning methods as some methods are immune to the existence of outliers. In thisproject, the detection and handling of outliers are done for Logistic Regression and NeuralNetworks as these methods are sensitive to the existence of predictor outliers. Decision-trees are immune to outliers and thus outlier detection is not done as a part of the data pre-processing step for Random Forests.

The method used for dealing with outliers in this project is Hidden Extrapolation. HiddenExtrapolation can be used in multivariate regression cases. The idea is to define a convex setcalled regressor variable hull (RVH). If an observation is outside this set, it can be confirmedto be an outlier. In Figure 4.1 it can be seen, for a two variable case, that the point (x01, x02)

lies within the range of the variables X1 and X2 but not within the convex area. Hence, thisobservation is an outlier of the data set that is used fit the model.

To determine the RVH, let us define the hat matrix

H = X(XTX)−1XT (2)

where X is the N × p matrix with the data set that is used to fit the model. The diagonalelements of the hat matrix hii can be used to determine if an observation is an outlier ornot. hii depends on the Euclidean distance between observation xi and the centroid. It alsodepends on the density of observations in RVH. The value hii that lies on the boundary ofRVH is called hmax and it has the largest value of all the diagonal elements. If an observation

13

Figure 4.1: A vizualisation of the idea behind hidden extrapolation. The gray area is theellipsoid that include all observations of the RVH. The figure is taken from [11].

xi satisfies:

xTi (X

TX)−1xi ≤ hmax (3)

that observation lies within the ellipsoid that consists of all the observations in RVH. Forexample, if the wish is to determine whether an observation x0 is an outlier or not, h00 cansimply be calculated and the result can then be checked to see if it is smaller than or equal tohmax. If the following holds:

h00 = xT0 (X

TX)−1x0 ≤ hmax

the observation is not an outlier since it lies within the RVH, see [11].

4.1.2 Variable Selection and Dimension Reduction

Selection of variables is an important part in the data pre-processing step. Statisticalmachinelearning methods are used to find relationships between the response variable and the inputvariables in form of a function Y = f(X) + ϵ where ϵ is an error term. If there are toomany variables compared to the amount of training data, it is hard for the model to find theunderlying function and then it gets overfitted. If the final model only includes variables thatis truly associated with the response, the model accuracy gets improved by adding them. Inreality this is usually not the case since most variables are noisy and they are not completelyassociated with the response. Adding many noisy variables to the model deteriorates it, andit will as a consequence perform worse on unseen data.

Some statistical machine learning methods, like decision-tree learners, performs variableselection as a part of the modeling process and are thus often not in a need of variableselection. However, for statistical machine learning methods used in an uplift modelingsetting, variable selection needs to be done as the difference between to outcomes is modeledand in many cases the uplift is small relative to the direct outcomes which leads to the risk ofoverfitting the data to increase heavily according to [15].

14

Net Information Value

A common technique for variable selection when performing uplift modeling, i.e. (1), iscalled the Net Information Value, NIV , which is demonstrated in [15]. The method ranksthe variables and is used for every method in this project.

The NIV is formed from the Weight of Evidence, WOE. Each continuous and categoricalpredictor is split into bins i, where i = 1, ..., G. G is the number of bins created for continuouspredictors or the number of categories for categorical predictors. The predictors are thusturned into discrete predictors. For each bin i, theWOE is defined as

WOEi = ln

(P (Xm = i|Y = 1)

P (Xm = i|Y = 0)

)

where Y ∈ {0, 1} is the label that tells whether a customer made a purchase or not and Xm

is one predictor from the vector X = (X1, ..., Xp) with index m. Further, the Net Weight ofEvidence NWOEi is defined as

NWOEi = WOETi −WOEC

i

where T again denotes the treatment group and C denotes the control group. UsingNWOEi,the NIV for each variable in the data set can be calculated using

NIV =G∑i=1

NWOEi ·(P T (Xm = i|Y = 1) · PC(Xm = i|Y = 0)−

P T (Xm = i|Y = 0) · PC(Xm = i|Y = 1))

The uplift package [3] in R calculates the NIV in the following way:

Algorithm 1 Net Information Value in the uplift package [3].1. Take B bootstrap samples and compute the NIV for each variable on each sample

according to:

NIV = 100 ·G∑i=1

NWOEi ·(P T (Xm = i|Y = 1) · PC(Xm = i|Y = 0)−

P T (Xm = i|Y = 0) · PC(Xm = i|Y = 1))

2. Compute the average of the NIV (µNIV ) and the sample standard deviation of theNIV (σNIV ) for each variable over all the B bootstrap samples

3. The adjustedNIV for a given variable is computed by adding a penalty term to µNIV :

NIV = µNIV − σNIV√B

If a variable has a highNIV it can be considered to be a good predictor. The higher theNIV

15

is for a variable, the better predictor it can be considered to be.

Variable Selection using Random Forests

RandomForests performs variable selection as a part of themodeling process and can thus beused to evaluate the Variable Importance (VI) in a data set. Random Forests is an ensemblelearning method that works by constructing multiple decision trees during training, andwhich outputs the most commonly occurring class among the different predictions (inclassification settings) or the mean prediction (in regression settings). Using decision trees,one aims at creating a model that predicts the label/target using some input variables.

Decision trees consists of a tree structure with one root node which is split into two daughternodes and where the node m represents the corresponding region Rm. The process is thenrepeated for all the new regions.

The splitting is based on a splitting criterion based on the input variables. Put differently,the variable chosen at each step is the one that splits the region in the best manner. Usingthe so called Gini index in a classification tree, it is possible to get an overall summary of theVI which is an output of the Random Forests algorithm and which shows the variables thathas been chosen at each split. The Gini index is thus used to evaluate the quality of each splitand is defined in [5] in the following way, for each nodem:

Gm =∑k =k′

pmkpmk′ =2∑

k=1

pmk(1− pmk)

where pmk is the proportion of the training observations from the kth class in themth region.As the target only has two outcomes in this project, i.e. Y ∈ {0, 1}, there are only two classesk. The proportion pmk is defined as:

pmk =1

Nm

∑xi∈Rm

I(yi = k)

where yi is one response observation and xi is one vector corresponding to one observationin the region Rm. The node m represents a region Rm with Nm number of observations andan observation in nodem is classified according to the majority class in nodem:

k(m) = argmaxk

pmk

A large VI value indicates that the variable is an important predictor, and thus it is possible torank the variables accordingly when VI is measured using Random Forests. In this project,this method is used separately to rank the variables according to the VI and the best rankedvariables are then used as input to the Random Forests method that performs uplift.

16

Dimension Reduction using Principal Component Analysis (PCA)

Dimension reduction is made using Principal Component Analysis (PCA) for both NeuralNetworks and Logistic Regression on non-binary variables. PCA reduces the dimension ofthe data into The Principle Components in the direction of where the variance is maximized.The resulting components becomes orthogonal, i.e. they becomemutually uncorrelated. Thefollowing theory is taken from [21].

Lets say the original data matrix is given by X with p variables and N observations, i.e.X ∈ RN×p. A one-dimensional projection of the data, Xααα with N elements, can be madeusing any unit-norm vector ααα ∈ Rp×1. The sample variance of that projection is given byequation (4) assuming the variables ofX are centered and where xi are observations fromX,i.e. x1, ...,xN ∈ Rp×1.

V ar(Xααα) =1

N

N∑i=1

(xTi ααα)

2 (4)

The direction of the maximum sample variance, also called a loading vector, is given by v1 inequation (5) where (XTX)/N is the sample covariance.

v1 = argmax||ααα||2=1

{V ar(Xααα)

}= argmax

||ααα||2=1

{αααTX

TXN

ααα}

(5)

The loading vector v1 is the largest eigenvalue of the sample covariance and it gives the firstprinciple component z1 = Xv1. The next principle component is generated by calculatinganother vector v2 using (5) that is uncorrelated with v1. This is repeated r times and itgenerates the following optimization problemwhere the matrixVr consists of all the optimalloading vectors.

Vr = argmaxA:ATA=Ir

trace(ATXTXA) (6)

The matrix A consists of the unit-norm vectors α that optimizes the problem. ”trace” isthe sum of the diagonal elements of the resulting matrix ATXTXA. Vr also maximizesthe total variance of the resulting components even if the loading vectors are definedsequentially.

4.1.3 Binning of Variables

Binning of variables is the procedure of converting continues variables into discrete variables.Usually, discretization of continues variables can yield in the variable to loose someinformation. Although in this project, binning is implemented for some variables anywayssince there is an advantage of doing so for linearly dependent variables.

Since SAS is used for all the pre-processing of data in this project, the HPBIN procedure willbe used for the purpose of binning some variables, see [18] for more details. This proceduresimply creates a data set which the binned variables gets saved in. The procedure has an

17

option called numbin, which is used to decide the number of bins, i.e. number of categoriesthat the variables are discretized into.

There exists several different binning methods and in this project the binning is done usingbucket binning. Bucket binning means that evenly spaced cut points are used in the binningprocess. For example, if the number of bins are 3 and the continuous variable is in the rangeof [0, 1], the cut points are then 0.33, 0.67 and 1. Thus, the resulting discrete variable thentake on values in the range of [1, 3].

4.2 Uplift Modeling

In this section, the problem formulation to the uplift modeling problem will be introducedand represented, furthermore three common approaches to the uplift problem is beingdiscussed.

To distinguish between the treatment group and the control group, notations with thesuperscript T will denote quantities related to the treatment group, while notations withthe superscript C will denote quantities related to the control group. As an example, theprobabilities in the treatment group will be denoted P T and likewise, the probabilities in thecontrol group will be denoted PC . In addition, the notation MU will denote the resultinguplift model.

The response variable takes on values as Y ∈ {0, 1} where 1 corresponds to a positiveresponse to the treatment while 0 corresponds to a negative response. Put differently, 1means that the individual has made a purchase while 0 means that the individual has notmade a purchase. The input attributes are the same for both models, i.e. for both themodel containing the treatment data as well as the model containing the control data. Thedefinition of the expected uplift is defined as the difference between success probabilities inthe treatment and control groups according to equation (1) i.e. the uplift is caused by takingthe action conditional on X = (X1, ..., Xp). If the result is negative, it indicates that theprobability that a customer makes a purchase when belonging to the control group is largerthan when the customer belongs to the treatment group. This is called a negative effect andis very important to include in the models for being able to investigate how the campaignsare affecting the customers, see Section 4.4.2 for more details.

Whether uplift modeling is an instance of a classification or regression problem is not fullyclear as it can be treated as both. Uplift modeling can be viewed as a regression task whenthe conditional net gain (1) is treated as a numerical quantity to be measured. It can alsobe viewed as a classification task as the class to predict is whether a specific individualwill respond positively to an action or not. Thus if the expected uplift is greater than zerofor a given individual, the action should be taken on. However, as mentioned earlier it isnot possible to evaluate the uplift model correctness on an individual level, see [19]. Forsimplicity, uplift modeling will be refereed to as a classifier throughout this thesis.

18

4.2.1 Subtraction of TwoModels

When creating the algorithms for estimating equation (1) described in the introduction, thereare three overall approaches that are commonly used. The first approach consists in buildingtwo separate classification models, one for the data in the treatment group, P T , and one forthe data in the control group, PC . The uplift model approach Subtraction of TwoModels canhence be defined as

MU = P T (Y = 1|X = x)− PC(Y = 1|X = x)

which means that for each classified object, the class probabilities predicted by the modelcontaining the data of the control group is subtracted from the class probabilities predictedby the model containing the data of the treatment group. This way, the difference in theclass probabilities caused by the treatment is estimated directly (demonstrated in [7]). Theinput X = (X1, ..., Xp) is the same for both models but origins from two different data sets.This means that the model parameters in P T will be different from the model parameters inPC .

The advantage of this approach is that it can be applied using any classification model andit is easy to estimate the uplift. The disadvantage is that this approach does not always workwell in practice since the difference between two independent accurate models does notnecessarily lead to an accurate model itself, see [4]. Put differently, the risk is that eachmodel might focus too much on modeling the class in both data sets separately, insteadof modeling the difference between the two class probabilities. Also, the variation in thedifference between the class probabilities is usually much smaller than the variability in classprobabilities themselves. This in turn can lead to an even worse accuracy, see [17].

Despite of the disadvantages just mentioned, there are some cases when this approach iscompetitive. According to Sołtys et al. [19], this can be either when the uplift is correlatedwith the class variable (e.g. when individuals that are likely tomake a purchase also are likelyto respond positively to an offer related to the purchase), or when the amount of training datais large enough to make a proper estimation of the conditional class probabilities in bothgroups.

Since this approach can be applied with any classification model, and for the purpose ofhaving a simple approach to compare with when investigating more advanced approaches,this approach will be implemented using both Logistic Regression and Neural Networks.Logistic Regression is a linear statistical machine learning method that is easy to implementwhile the more complex Multilayer Perceptron (MLP) is a class of feedforward artificialNeural Network. By implementing both of these methods, it is possible to analyze whetheror not the more simpler method Logistic Regression performs better or worse than a morecomplex method of Neural Networks.

4.2.2 Modeling Uplift Directly

The second approach that is commonly used for uplift modeling is tomodel the uplift directlyby modifying existing statistical machine learning algorithms, see [19]. The drawback of this

19

approach is hence the need of modification since the model of choice needs to be adaptedto differentiate between samples belonging to the control and the treatment groups. Theadvantage on the other hand, is the possibility to optimize the estimation of the upliftdirectly.

Decision trees arewell suited formodeling uplift directly because of the nature of the splittingcriteria in the trees. A splitting criteria is used to select the tests in nonleaf nodes of the tree.To maximize the differences between the class distributions in the control and treatmentdata sets, Rzepakowski et al. [16] proposes that the splitting criteria should be based onconditional distribution divergences, which is a measure of how two probability distributiondiffer. Put differently, using this approach, at each level of the tree the test is selected so thatthe divergence between the class distributions in the treatment group and control group ismaximized after a split has been made.

The Divergence measure used for this project is the squared Euclidean distance. Given theprobabilities P = {p1, p2} and Q = {q1, q2}, the divergence is defined as

E(P,Q) =2∑

k=1

(pk − qk)2

where k is equal to 1 and 2 for binary classification like in this project, i.e the responsehas two classes Y ∈ {0, 1}. In this case, p1 and p2 is equal to the treatment probabilitiesP T (Y = 0) and P T (Y = 1). q1 and q2 is then equal to the control probabilities PC(Y = 0) andPC(Y = 1).

For any divergencemeasureD, the proposed splitting criteria is defined in (7) and the largestvalue ofDgain decides the split of that node.

Dgain = Dafter_split

(P T (Y ), PC(Y )

)−Dbefore_split

(P T (Y ), PC(Y )

)(7)

P T and PC are the class probabilities in the treatment and control group before and after thesplit. The resulting divergence measure after a split has been made is defined as:

Dafter_split

(P T (Y ), PC(Y )

)=

a2∑a=a1

Na

ND(P T (Y |a), PC(Y |a)

)(8)

whereN is the number of observations before the split has been made, a ∈ {a1, a2} is the leftand right leaf of that split and Na is the number of observations in each leaf after the splithas been made. E.g. if the split is made out of a binary variable, A ∈ {0, 1}, the left leaf a1corresponds to A = 0 and the right leaf a2 corresponds to A = 1 in (8).

This uplift modeling approach will be implemented using decision tree learners which in thisproject is chosen to be the ensemble learning method Random Forests.

4.2.3 Class Variable Transformation

The third approach, likewise the one described in Section 4.2.2, models the uplift directly.Jaskowski et al. [7] proposes the introduction of a Class Variable Transformation, i.e. let us

20

define Z ∈ {0, 1} such that

Z =

1 if Y = 1 and T,

1 if Y = 0 and C,

0 otherwise.(9)

where T denotes the treatment group data and C denotes the control group data. (9) allowsfor the conversion of an arbitrary probabilistic classification model into a model whichpredicts uplift. In other words, if the customer has made a purchase, i.e. Y = 1, and belongsto the treatment group, Z is set to 1. This kind of person is then either a sure thing or apersuadable, see Table 2.1. If the customer on the other hand has not made a purchase, i.e.Y = 0, and belongs to the control group, Z is also set to 1. The customer is then either a lostcause or a persuadable. For all other cases Z is set to 0 which means that all do-not-disturbbelongs to this group, i.e. there will be no risk of approaching the do-not-disturbs with acampaign.

Note that this approach does not exclusively target the persuadables as would be the optimalthing to do. The reason for this is simply because one individual can never belong to boththe treatment group and the control group, thus only one outcome for that individual canbe observed. Therefore, it is not possible to use Class Variable Transformation to target thepersuadables exclusively.

By assuming that T and C are independent of X = (X1, ..., Xp), and that P (C) = P (T ) = 12,

Jaskowski et al. shows that

P T (Y = 1|X = x)− PC(Y = 1|X = x) = 2P (Z = 1|X = x)− 1

whichmeans thatmodeling the conditional uplift ofY is the sameasmodeling the conditionaldistribution of Z (see [7] for more details). It is thereby possible to use (9) and combine thetreatment and control training data sets and then apply any standard classification methodto the new data set and thus get an uplift model for Y . Jaskowski et al. also shows that theassumption P (C) = P (T ) = 1

2must not hold in practise. It is possible to rewrite the training

data sets so that the assumption becomes valid and such a transformation does not affect theconditional class distributions. Put differently, this approach can still be beneficial in caseswhere there are imbalanced control and treatment groups.

In this project, the campaigns are actual postcards instead of phone calls that is widely usedin for example the insurance or the telecommunication business. Many uplift modelingapproaches rely on the fact that it is of great importance to not target the do-not-disturbs,since this group of individuals are most probably greater when approached using actualphone calls instead of advertisement that is sent out by for example email or a text message.Hence in this project, the group of do-not-disturbs can be argued to be not as large as itmightwould have been if the offer instead were given using a physical phone call. Furthermore,recall from Section 3 that the share of observations belonging to the control group in eachdata set is relatively small compared to the share of observations in the treatment group.Considering these two facts, the transformation suggested in (9) will be slightly modified tofit this project.

The first modification is to exclude all the negative samples in the control group, i.e. whenY = 0 and C. This is, as mentioned earlier, either a lost cause or a persuadable. Since

21

the control group is very small in relation to the treatment group, this modification is notexpected to affect the persuadables in the uplift in a crucial manner. Furthermore, whenintroducing the modified Class Variable Transformation to the reduced data set, the focuswill lie on only targeting the persuadables and sure things in the treatment group, namelyin the following manner

Z =

1 if Y = 1 and T,

0 if Y = 0 and T,

0 if Y = 1 and C

(10)

where once again, T and C denotes the treatment and control group data, respectively.

Put differently, the resulting classification model can be defined as

MU = PZ(X = x) = 2P (Z = 1|X = x)− 1 (11)

where x is an observation that could be from both the treatment group and the control group,modified using theZ transformation (10). The probabilityPZ can then be estimatedwith anyclassification method.

As for the uplift modeling approach Subtraction of Two Models (Section 4.2.1), the ClassVariable Transformation will be implemented using both Logistic Regression and NeuralNetworks. The aim is thus to compare the two uplift modeling approaches, as well as the twodifferent statistical machine learning methods, to be able to conclude which uplift modelingapproach and learning algorithm that is best suited for this kind of problem.

4.3 Classification and Prediction

When the pre-processing step is done, the data is ready to train models for classification.Statistical machine learning methods is used to find relationships between the response andthe variables in form of a function, i.e Y = f(X) + ϵ where ϵ is some error term.

In the following subsections, the three statistical machine learning methods LogisticRegression, Random Forests and Neural Networks will be described theoretically. These arethe classificationmethods used for uplift modeling in this project, where Logistic Regressionis the simplest one as it is easy to implement and it is linear. Random Forests is a type ofensemble classifiers and which is tested to see if a more complex classifier yields in a betterresult. Neural Networks has the ability to classify even more complex decision boundaries,hence this is the most complex method that will be tested.

Logistic Regression and Neural Networks are thus used to make estimations of theprobabilities from section 4.2.1 and 4.2.3, i.e. for Subtraction of Two Models:

MU = P T (Y = 1|X = x)− PC(Y = 1|X = x) (12)

and for the Class Variable Transformation:

MU = PZ(X = x) = 2P (Z = 1|X = x)− 1 (13)

22

Random Forests is also used to estimate (12), but by using the splitting criteria described insection 4.2.2. MU is hence the estimated uplift model.

4.3.1 Logistic Regression

Logistic Regression is a so called generalized linear model and is one of the most widely-used classifiers. According to [21], when having a binary response as in this project, by usingLogistic Regression one typically aims at estimating the conditional probabilityP (Y = 1|X =

x) = E[Y |X = x]whereX = (X1, ..., Xp). The linear logistic model models the log-likelihoodratio:

logP (Y = 1|X = x)P (Y = 0|X = x)

= β0 + βββTx (14)

where β0 ∈ R is the intercept term, βββ ∈ Rp is the vector of regression coefficients and x is oneobservation. After some manipulation of (14), the following expression for the conditionalprobability can be obtained:

P (Y = 1|X = x) =eβ0+βββT x

1 + eβ0+βββT x(15)

The model is fit by maximizing the binomial log-likelihood of the data which is equivalentto minimizing the negative log-likelihood. Minimization of the negative log-likelihood alongwith the addition of ℓ1-penalty (regularization) takes on the form:

minβ0,βββ

{− 1

NL(β0,βββ;y,X) + λ||βββ||1

}(16)

where y is the response vector, X is the N × p data matrix of predictors and L is the log-likelihood. Put differently, given (14), the negative log-likelihood with ℓ1-penalty can beexpressed in the following way:

− 1

N

N∑i=1

{yi logP (Y = 1|xi) + (1− yi) logP (Y = 0|xi)

}+ λ||βββ||1 = (17)

− 1

N

N∑i=1

{yi(β0 + βββTxi)− log(1 + eβ0+βββT xi)

}+ λ||βββ||1 (18)

where λ ≥ 0 is a complexity parameter that controls the impact of the shrinkage, i.e. theregularization. The objective thus becomes finding the estimates β0 and βββ that minimizes(18).

The addition of ℓ1-penalty is a regularization technique called the Lassowhere the ℓ1 norm ofa coefficient vector βββ is defined as ||βββ||1 =

∑|βj| where j = 1, ..., p. Using this regularization

technique yields in shrinking the coefficient estimates towards zero, and forcing some ofthem to become exactly zero when λ is large enough. The optimal value of λ can be obtained

23

with the contemporary use of Cross Validation (4.3.4). The usage of the Lasso requires astandardization of the predictors so that they all are in the same scale, i.e. have mean 0 andstandard deviation 1.

Hence, by using the Lasso, variable selection is made and a sparse model can beobtained[21].

4.3.2 Random Forests

Random Forests is a statistical machine learning method that can perform both regressionand classification. The following theory is taken from [5]. This method builds many decisiontrees that are averaged to obtain the final prediction. The technique of averaging a statisticalmachine learning model is called bagging and it improves stability and avoids overfitting.Normally, decision trees are not that competitive to the best supervised learning approachesin terms of prediction accuracy since they tend to have high variance and low bias. This isbecause building two different decision trees can yield in two really different trees. Baggingis therefore well suited for decision tress since it reduces the variance.

The idea behind Random Forests is to draw B bootstrap samples from the training data setand then build a number of different decision trees on the B different training samples.The reason why this method is called Random Forests is because it chooses random inputvariables before every split when building each tree. By doing this, each tree will have areduced covariance which in turn will lower the overall variance even further. The algorithmfor Random Forests for both regression and classification can be seen in Algorithm 2 (takenfrom [5]).

Algorithm 2 Random Forest1. for b = 1 to B do

(a) Draw a bootstrap sample Z∗ of sizeN from the training data with replacement.(b) Grow a random-forest tree Tb to the bootstrapped data Z∗, by recursively

repeating the following steps for each terminal node of the tree, until theminimum node size nmin is reached.

i. Selectm variables at random from the p variables.ii. Pick the best variable/split-point among them.iii. Split the node into two daughter nodes.

end2. Output the ensemble of trees {Tb=1}B.3. To make a prediction to a new point x:

Regression: fBrf (x) =

1

B

B∑b=1

Tb(x)

Classification: Let Cb(x) be the class prediction of the bth random-forest tree:

CBrf (x) = majority vote {Cb(x)}B1 .

When a split is made, a random sample ofm random variables are chosen as split candidates

24

from the p predictors. Typically m is set to be approximately√p in classification settings,

or p/3 in regression settings. The reason for this is because a small value of m reduces thevariance when there are many correlated variables. In the split, only one of the m variablesare used and sincem is small it leads to that not even a majority of the available variables areconsidered.

Guelman et al. [4] proposes an algorithm for Random Forests for uplift modeling where thedata from both the treatment and control groups are included in the training data. The upliftpredictions of the individual trees should be averaged and thus an uplift can be obtained.There should be two tuning parameters, namely the number of trees in the forest as wellas the number of predictions in the random subset of each node. The proposed algorithm,which is a modification of Algorithm 2, is presented in Algorithm 3.

To grow a tree, a recursive binary splitting is used and thus a criterion is needed for makingthese binary splits. There are several different splitting criterion that are possible to usefor this purpose. In the case of decision trees in an uplift modeling setting, Rzepakowskiet al. [16] proposes that the splitting criteria should be based on conditional distributiondivergences, recall Section 4.2.2.

Algorithm 3 Random Forest for Uplift.1. for b = 1 to B do

(a) Draw a bootstrap sample Z∗ of sizeN from the training data with replacement.(b) Grow an uplift decision tree UTb to the bootstrapped data Z∗, by recursively

repeating the following steps for each terminal node of the tree, until theminimum node size nmin is reached.

i. Selectm variables at random from the p variables.ii. Pick the best variable/split-point among them. The split criterion should

be based on a conditional divergence measure.iii. Split the node into two daughter nodes.

end2. Output the ensemble of uplift trees {UTb=1}B.

The predicted uplift for a new data point x is obtained by averaging the uplift predictionsof the individual trees in the ensemble: fB

uplift(x) =1B

∑Bb=1 UTb(x).

4.3.3 Neural Networks

Artificial NeuralNetworks (ANN)was first developed tomimic the humanbrain and iswidelyused for artificial intelligence such as face recognition, speech recognition etc. [1] defines itas a network of units that receives inputs that represents the networks of neurons of a humanbrain. Neural Networks are simply nonlinear statistical models and works well for bothregression and classification problems according to [5] They are usually used when the datahas a high dimension with big sample size and the modeling needs to be of high complexity.This is therefore the most complex model that is implemented in this project.

The Neural Network method that is chosen is the Multilayer Perceptron (MLP). Thearchitecture consist of the input nodes, hidden layers of nodes and a layer of the output

25

nodes. An example of a two-layer MLP can be seen in Figure 4.2. The number of hiddenlayers and nodes determines the complexity of the model. MLP can do a nonlinear mappingof the input to the output, with hidden layers in between, using activation functions. Theactivation functions can be linear functions but are usually chosen to be the sigmoid functionto obtain the nonlinear modeling. Following is the mathematical explanation of a two-layerMLP, i.e. a MLP with one hidden layer, originally demonstrated in [1].

Figure 4.2: The architecture of theMultilayer Perceptron with three input nodes, one hiddenlayer with two nodes and an output layer with two nodes. The functions f and g are theactivation functions.

Assume that the input consists of p nodes (Xm, m = 1, 2, ..., p), the hidden layer t hiddennodes (Zj, j = 1, 2, ..., t) and the output layer s nodes (Yk, k = 1, 2, ...s). The weights of theconnection between the input and the hidden layer are βmj and the weights of the connectionbetween the hidden layer and the output layer are αjk. The weights also have the bias termsβ0j and α0k, respectively. Furthermore, suppose that the input and hidden nodes forms thevectors X = (X1, ..., Xp)

T and Z = (Z1, ..., Zt)T . Moreover, the weights forms the vectors

βββj = (β1j, ...βpj) and αααk = (α1k, ...αtk). Then let Uj = β0j + XTβββj, and Vk = α0k + ZTαααk [1].Using this, the activation functions fj(·) and gk(·) can be introduced with the following:

Zj = fj(Uj), j = 1, 2..., t

µk(X) = gk(Vk) = gk

(α0k +

t∑j=1

αjkfj

(β0j +

p∑m=1

βmjXm

)), k = 1, 2..., s

Where the activation function fj(·) stands for the hidden layer and gk(·) stands for theoutput layer[5]. The activation functions are usually chosen to be the sigmoid functionσ(v) = 1/(1 + e−v) (Figure 4.3) and works very well for classification problems.

The kth generated output node is then µk and the true output is the same, but plus the errorterm ϵk.

Yk = µk(X)

Yk = µk(X) + ϵk

26

Figure 4.3: An illustration of the sigmoid function that can be used as the activation functionin Multilayer Perceptron. The red curve is the sigmoid function σ(v) and the dashed curvesare the σ(sv) functions, where s is a scale parameter that controls the activation rate. Whens = 1/2, the appearance is like the blue curve and when s = 10, the appearance is instead likethe purple curve. The figure is taken from[5].

The MLP does supervised learning using the backpropagation learning rule when updatingthe weights. Backpropagation is an iterative gradient-decent method and which updates theweights where the derivative of the error sum of squares is at minimum. The mathematicalbackground of backpropagation learning for a two-layer MLP is presented below and thetheory can be found in [1].

Following is the error sum of squares at the kth output node where K is the set output ofnodes and i is the index of each observation.

Ei =1

2

∑k∈K

(Yi,k − Yi,k

)2=

1

2

∑k∈K

e2i,k, i = 1, 2, ..., n

The new term, ei,k = Yi,k − Yi,k, is the error signal at the kth output node. For binaryclassification problems, the output is only one node, i.e. Yi,k = Yi and Yi,k = Yi . The errorsum of squares of the whole data set is then the average of all Ei:

ESS =1

n

n∑i=1

Ei =1

2n

n∑i=1

∑k∈K

e2i,k

The algorithm is then updating the weights in the direction of where the error is minimized.Concerning the weights αi,jk between the hidden layer and the output layer, the updatingformula is as follows:

αi+1,jk = αi,jk +∆αi,jk

∆αi,jk = −η∂Ei

∂αi,jk

where η is the learning rate and which determines the size of each step of learning. If η istoo large, it might miss a local minimum of the error. On the other hand if η is too small, thecomputing time gets very large.

Following is the derivatives of the error sum of squares using the chain rule, assuming that

27

the activation function gk(·) is differentiable:

∂Ei

∂αi,jk

=∂Ei

∂ei,k· ∂ei,k∂Yi,k

· ∂Yi,k

∂Vi,k

· ∂Vi,k

∂αi,jk

= ei,k · (−1) · µ′k(Xi) · Zi,j

= −ei,k · g′k(Vi,k) · Zi,j

= −ei,k · g′k(αi,k0 + ZTi αααi,k) · Zi,j

Similar holds for the update formula for the weights βi,mj between the input nodes and thehidden layer, namely:

βi+1,mj = βi,mj +∆βi,mj

∆βi,mj = −η∂Ei

∂βi,mj

The derivative can again be obtained using the chain rule

∂Ei

∂βi,mj

=∂Ei

∂Zi,j

· ∂Zi,j

∂Ui,j

· ∂Ui,j

∂βi,mj

where the derivatives are the three equations (19), (20) and (21). Note that it is again assumedthat the activation functions fj(·) and gk(·) are differentiable.

∂Ei

∂Zi,j

=∑k∈K

ei,k ·∂ei,k∂Zi,j

=∑k∈K

ei,k ·∂ei,k∂Vi,k

· ∂Vi,k

∂Zi,j

= −∑k∈K

ei,k · g′k(Vij) · αi,jk (19)

∂Zi,j

∂Ui,j

= f ′j(Ui,j) = f ′

j(βi,j0 +XTi βββi,j) (20)

∂Ui,j

∂βi,mj

= Xi,m (21)

Putting this together, the gradient-descent updating rules for the weights αi,jk and βi,mj

becomes:

αi+1,jk = αi,jk − η∂Ei

∂αi,jk

= αi,jk + ηei,kg′k(Vi,k)Zi,j (22)

βi+1,mj = βi,mj − η∂Ei

∂βi,mj

= βi,mj + η∑k∈K

ei,kg′k(Vi,k)αi,jkf

′j(Ui,j)Xi,m (23)

The sensitivity δi,k and δi,j of the ith observation, also called local gradient, is now introduced.k is for the kth node of the output layer and j is for the jth node of the hidden layer.

δi,k = ei,kg′k(Vi,k) (24)

δi,j = f ′j(Ui,j)

∑k∈K

δi,kαi,kj (25)

Using (24) and (25) in the equations (22) and (23) yields in the following updating formulas

28

for the weights:

αi+1,jk = αi,jk + ηδi,kZi,j (26)

βi+1,mj = βi,mj + ηδi,jXi,m (27)

The weights are initialized with random-generated uniform distributed numbers that areclose to zero. The goal is to make the algorithm converge to a global minimum. If thealgorithm does not converge it might be stuck at a local minimum. To solve this problem,training can be performed again but with new random weights. Although it is not alwayspossible for the algorithm to converge.

Next, training is done for some number of epochs. One epoch is when the training has beendone once on the whole training set. Training can then be carried out in two different ways,namely online training or batch learning. Online training is when the weights gets updatedfor each observation, one at a time. Hence, the updating formulas (26) and (27) are onlinetraining with the observations i. When the updates are done for all the observations, oneepoch is completed. Online training is usually better than batch learning since the learningis faster for data observations that are similar. It is also better at avoiding local minimumduring training.

Batch learning is when the weights gets updated simultaneously for the whole training set,i.e. once at the same time for each epoch. The updating formulas for the weights do theninclude the summation of the derivatives of the whole training set where n is the number ofobservations, and i defines each epoch.

αi+1,jk = αi,jk + ηn∑

h=1

δh,kzh,j (28)

βi+1,mj = βi,mj + ηn∑

h=1

δh,jxh,m (29)

The training of the network then runs for a number of epochs so that the learning convergestowards the global minimum. It is important to not let it run for too many epochs since themodel thenmight get overfitted. An overfittedmodel performs verywell on training data, butvery poorly on unseen data (test data). Since the model is used tomodel the uplift, the risk ofoverfitting is even larger (Section 4.1.2). In order for the network to have good generalizationpower, overfitting needs to be avoided.

A method for handling overfitting is to construct a network that is not too complex. In thisproject, trial and error is used to find the right number of layers and nodes. Finding theoptimal number of layers and nodes is also an important issue in order for the network tofind an underlying function. Furthermore, weight decay using ridge regression is anothertechnique that is used in this project to avoid overfitting. This techniquemakes someweightsshrink towards zero so that the complexity of the network gets adjusted[5]. The inputdimension is also reduced by using PCA (4.1.2) to avoid overfitting. The input data thatis used for MLP can come in many different scales. In order for PCA to work the data isstandardized. After standardization, the variables have mean 0 and standard deviation 1, see[1] for more details.

29

4.3.4 Cross Validation

Cross Validation (CV) is a commonly used resampling method which is performed byrepeatedly drawing samples from a training set and refitting a model of interest on eachsample for obtaining additional information about the model. In this project CV is used as apart of the modeling process for obtaining the optimal value of different tuning parametersor the optimal number of variables to use in the models.

When using CV, the data is split intoK equal parts where one part is used as the test set whilethe remaining parts are used as the training set. The prediction error of the fitted model,f−k(x), is then calculated using the kth set as hold out set (test set). All K parts of the dataare used as hold out sets one at a time and theK resulting estimates of the prediction errorsare combined and thus a CV estimate of the prediction error can be obtained. According to[5], by letting κ : {1, ...,M} 7→ {1, ..., K} be an indexing function, the CV estimate can becalculated using:

CV (f) =1

M

M∑i=1

L(yi, f−κ(i)(xi)) (30)

where L is the prediction error associated with each fitted model. When using K = N ,the method is recognized as Leave One Out Cross Validation. This means the learningprocedure is fit N times, i.e. the same number of times as there are observations in the dataset. Normally k-Fold Cross Validation is used, and then K is set to 5 or 10 due to lowercomputational cost than when using a larger number of folds and also because it yields in agood bias-variance trade-off.

When the objective is to find the optimal value of a tuning parameter γ, given a set of modelsf−1(x, γ), the formula (30) can be modified to:

CV (f , γ) =1

M

M∑i=1

L(yi, f−κ(i)(xi, γ)) (31)

The CV estimate in (31) yields in a test error curve and the objective hence becomes findingthe value of γ that minimizes the CV function.

4.4 Evaluation

There are several ways to evaluate classification models, and one of the most commonillustrative tools used for this purpose is the so called Receiver Operating Characteristic(ROC) curve. The ROC curve is used to evaluate performance of a classification modeli.e. how well the model classifies positive individuals as positives and negative individualsas negatives. Usually in classification models, the prediction can be compared to the trueanswer (test target data) for each individual. An uplift model cannot have a true answer ofthe uplift for each individual since the same individual can never belong to both the treatmentand the control group at the same time. To solve this, the evaluationmetrics that is commonlyused for uplift models is the Qini curve which evaluates on a population level.

30

4.4.1 ROC Curve

The result of a classification model has four different outcomes and can be explained with aconfusion matrix (Figure 4.4 from [2]). True positives (TP ) is when the model classifies topositive and when the instance also is positive. This means that themodel correctly classifiesa positive. True negatives (TN) is when the model correctly classifies a negative. Falsepositives (FP ) is when the model classifies to positive but the instance is negative so themodel classifies to the wrong class. The same holds for false negative (FN) where the modelclassifies to negative but is positive. These four outcomes can be used to calculate different

Figure 4.4: A confusion matrix shows the result of a classification model.

performance metrics that are used to plot the ROC curve. The true positive rate (TP rate),also called sensitivity or recall, is on the y-axis of the ROC curve and is defined as:

TP rate =TP

P=number of positives correctly classified

total number of positives

The false positive rate (FP rate) is on the x-axis of the ROC curve and is defined as:

FP rate =FP

N=number of negatives incorrectly classified

total number of negatives

Hence, the ROC curve is a plot of the probability that a positive prediction is positive againstthe probability that a positive prediction is negative.

Other important performance metrics obtained from confusion matrices are thefollowing

accuracy =TP + TN

P +Nprecision =

TP

TP + FP

For discrete classifiers such as decision trees, the result is a point in the ROC space. This isbecause it produces a single class for each individual instead of a probability or score, whichthus yields in a single confusion matrix as a result. A confusion matrix gives one value forthe FP rate and one value for the TP rate and hence a point in the ROC space.

Probabilistic classifiers such as Neural Networks, produces a probability that an instancebelongs to a certain class. In such cases a threshold is needed to get the final class prediction.Many models are using 0.5 as the default value for the threshold, but it is not always the case

31

that this value yields in the best result. One value of a threshold gives one point in the ROCspace, as for the discrete classifiers. To be able to obtain the ROC curve for probabilisticclassifiers, one can produce the result of a classifier using many different threshold values.As a consequence, this results in many different points, i.e. a curve, in the ROC space. Theaccuracy can be calculated for each threshold and the threshold with the highest accuracy isthus the best threshold.

An algorithm that is more efficient for big data sets, is instead to use the monotonicity of thethreshold classification. That is, for any prediction with a certain threshold that is classifiedas positive will also be classified as positive for any lower threshold. However, this algorithmis not implemented from scratch in this project. Built in functions for this algorithm in R andPython are simply used instead.

Figure 4.5: Example of two ROC curves of two different models.

The diagonal line y = x in the ROC space represents a random classifier. A classifier thatperforms better than a random classifier will get a result in the upper triangle above thediagonal. The bestmodel is therefore the one that has the largest distance from and above thediagonal line. Another way of comparing differentmodels using the ROC curve is to calculatethe area under the curve (AUC). The ROC space is the unit square and the values of AUC is asa consequence ∈ [0, 1]. For a classification model to perform better than a random classifier,the value of AUC needs to be greater than 0.5. The largest AUC value among different modelsrepresents the model with the best average performance. An important note is though thatthe model with the largest AUC is not always better than other models with lower AUC, see[2] for more details.

4.4.2 Qini Curve

The Qini curve is a good tool for comparing different uplift models. It is constructed usinggains chart. It sorts the predictions from best to worst score and divides the result intosegments where different amount of individuals are treated. The vertical axis of the Qinicurve is the uplift or the cumulative number of incremental sales achieved, and the horizontalaxis is the number of individuals treated i.e the different segments. The estimated numberof incremental sales achieved per segment are calculated using

u = Rt −RcNt

Nc

(32)

32

For each segment, Rt and Rc are the number of individuals predicted to make a purchase,for the treatment group and the control group respectively. Further, Nt and Nc are the totalnumber of individuals in the treatment group and control group respectively. An example ofQini curves are shown in Figure 4.6, this is demonstrated in [14]. The Purple curve has itsmaximumbefore all individuals are treated. Thismeans that some individuals are influencednegatively by the treatment and one should thereby choose a smaller treatment group for bestuplift.

Figure 4.6: An example of Qini curves where the red curve is the optimal uplift model. Theblue line is a random classifier, as used in ROC curve. The purple and the green curves aretwo different uplift models and where the purple one is the best one in this specific case.

For each model, the Qini curve is used to decide the optimal cutoff that gives the bestprofit. E.g. the purple curve and its corresponding model in Figure 4.6 has the maximumprofit around 60%. At that point, there is a certain cutoff which can be used to make thedecision of whether or not the customer will make a purchase because of the campaign offer.Furthermore, when themodel is used for a newgroupof costumers, that same cutoff is used toclassify the persuadables, i.e. whom to target for a campaign, see [15] for more details.

Another way for comparingmodels is to compute the Qini value. The Qini value is defined asthe area between the actual incremental gains curve from the fittedmodel, and the area underthe diagonal corresponding to a random model. A negative sign of the Qini value indicatesthat the result of an action is worse than doing nothing while a positive value indicates theopposite, see [3].

4.5 Programming Environment of Choice

The natural choice of software used for generation of data and handling of missing values isthe software suite SAS, which is developed by SAS Institute. More specifically, SAS Studiois used. It is thereby possible to use procedures for writing SQL code, and thus to modifytables directly in the SAS software environment. For detection and deletion of outliers, the

33

software environment R is used[13]. In R it is possible to both implement statistical machinelearningmethods as well as performingmatrix operations. Due to these facts, R is best suitedto use for the outlier detection and deletion. The influential statistics can be obtained fromimplementation of a statistical machine learning method and the extrapolation observationscan be identified using some matrix operations.

There is no standard way for implementing Random Forests or Neural Networks in SASStudio. As there exists a package in R called uplift[3] which includes several readilyimplementations for different uplift modeling approaches, it will be used forModeling UpliftDirectly usingRandomForests. The uplift package includes the algorithm formodeling upliftwith Random Forests proposed by Guelman et al. [4].

Python 3.6 is used for implementing Neural Networks as there exists suitable packages formachine learning in Python. The package used for this purpose is scikit-learn[12] and themethod for Neural Networks is MLPClassier(). Logistic Regression is implemented in Rusingglm for theClassVariable Transformation and inPythonusingLogisticRegressionCV()for the Subtraction of Two Models.

34

5 Experiments and Results

In this section, the practical implementation of the project is presented and described alongwith the results of the different statistical machine learning methods which are presentedusing figures and tables.

5.1 Data Pre-Processing

Cleaning the raw data is a crucial step in order to get good quality of the data representations.The pre-processing of the data in this project is about removing och manipulating bad datavalues so that the data is a good representation of the desired objects. Put differently, inthe following subsection the handling of missing values, outliers and binning of linearlydependent variables is presented.

5.1.1 Data Cleaning

The first step of the data cleaning process is to identify and handlemissing values. Dependingon the size of themissing observations as well as if they areMAR,MCAR or NMAR (recall thedefinitions in Section 4.1.1), the technique for handling themissing valuesmight differ. Thus,the descriptive statistics for each data set is determined to get an overall view of the variablesthat has missing values. The descriptive statistics related to Campaign 1 can be found inTable 5.1. The other campaigns follows the same pattern meaning that the variables that hasmissing values are the same for each data set with the number of missing being greatest forgender and least for share_red_or_dis_order_3 and share_ret_order_3.

Variable NMiss Mean Min Median Max

gender 16 785 0.94 0.00 1.00 1.00

age 392 35.41 15.00 34.00 89.00

Clubmember 375 0.55 0.00 1.00 1.00

IsStaff 375 0.00 0.00 0.00 1.00

lastPurchaseDate 183 2017− 01− 27 2015− 09− 25 2017− 04− 18 2017− 09− 25

share_red_or_dis_order_3 14 0.60 0.00 0.89 1.00

share_ret_order_3 14 0.50 0.00 0.50 1.00

Table 5.1: Descriptive statistics for Campaign 1. Only the variables with missing data arepresented. The total number of customers in this data set is 181 221 and the campaign startdate is 2017− 10− 30. Nmiss is the number of missing observations for each correspondingvariable.

The data sets related to the first five campaigns contains variables that has a relative highnumber of missing data in relation to N , the total number of customers in each data set.Hence, a Multiple Imputation method would be well suited to handle these missing values.For the sake of simplicity, all data sets are treated the same way.

The variables gender and age contains information that is voluntary for the customers toenter. This might be the reason for the missing data. A quick glance at Clubmember andIsStaff tells that the number of missing values are the same for these two in every data

35

set, respectively. When investigating this further, it is possible to conclude that it is the sameobservations that has these variables asmissing. Furthermore, the varieble age is missing forthe same observations as for these variables. The conclusion that can be made is that someinformation related to these observations are missing in some tables out of some randomsystematic error in the building of the data set. It can also be due to the fact that there could besome missing information in some tables related to certain districts in the market of choice.Put differently, this data is most probably MCAR.

The idea is, as just mentioned, to perform Multiple Imputation for these variables if valid.To begin with, the MI procedure is used along with a statement that tells the procedure tonot perform any imputations, but to print the pattern of the missing data. Using the output,it can be concluded that the missing data pattern is arbitrary, and thus the use of the FCSstatement is valid. Moreover, the data is concluded to be under the MCAR assumption andhence, Multiple Imputations using the MI procedure is performed. m is set to 25 since thisis the recommended number of imputations that is set by default in the procedure. Next, theimputed data sets are analyzed using theMIXED procedure and lastly, them = 25 individualanalyses are combined using the MIANALYZE procedure resulting in one pooled estimatefor each parameter. The resulting parameter estimates can be seen in Table 5.2.

Campaign gender age Clubmember IsStaff

1 1 35 1 0

2 1 37 0 0

3 1 35 0 0

4 1 36 1 0

5 1 35 1 0

6 1 36 1 0

Table 5.2: Summary of the resulting parameter estimates derived using the MultipleImputation method. Recall that the variable lastPurchaseDate is deleted for theobservations that has a missing value. Moreover, share_red_or_dis_order_3 andshare_ret_order_3 are set to zero when the data is missing.

The missing values for the variable lastPurchaseDate can most likely be explained with thereasoning that these customers has not made any purchase at all for the last 24 months.Recall from Section 3 that the data in this project only reflects customers in the frequentstage. Thus, the customers related to the missing values for this variable are most likelyin the lost stage, but has gotten to be apart of the campaigns by mistake. This means thatthese values are MCAR. Considering this, these customers should be removed from the datasets as they are not suitable candidates for the campaigns and thus, risk to contribute withinappropriate influence to the models. In other words, the Listwise-Deletion method is usedfor this variable.

Further, to be able to use the information from the variable lastPurchaseDate, anothervariable is added named lastPurchase. This new variable is then the difference in daysbetween the lastPurchaseDate and AdressF ileDate5.

Lastly, the variables share_red_or_dis_order_3 and share_ret_order_3 has the exact same

5Recall that adressF ileDate is the date the customers were chosen to be a part of the campaign.

36

amount of missing data in each data set respectively. The simple explanation to this is thatnone the customers related to the missing data of these variables has made any purchasewithin the 3 past months before address file date. Hence the denominator, which is the totalnumber of purchases, in the expression for calculating the share is set to zero. The resultthus becomes NULL, i.e. the value in turn is set to missing. The intuitive way to handlethe missing values for these variables is thereby to replace them with a zero for every dataset. The reason for this is that they actually are zero, but has been set to NULL in this casebecause of the zero in the denominator of the expression used for calculating the share.

Moreover, concerning share_red_or_dis_order_3 and share_ret_order_3, it isadvantageously to bin them in order to be able to model non-linear effects. These variablesare calculated from other variables used in the data sets, and thus they are linearly dependentof those other variables. Thus, these variables as well as share_red_or_dis_order_12,share_red_or_dis_order_24, share_ret_order_12 and share_ret_order_24 are binnedusing the Bucket Binning method. Recall from Section 4.1.1 that bucket binning means thatevenly spaced cut points are used in the binning process. In this case the number of bins areset to 5 for each variable, and thus the new values for these variables are discrete and rangesbetween [1, 5].

Campaign N NRF N sub NZ

1 181 221 181 038 181 035 176 867

2 82 828 82 746 82 743 76 101

3 155 096 153 345 153 343 140 178

4 62 121 62 098 62 097 57 437

5 310 607 309 300 309 294 284 276

6 207 071 207 066 189 061 191 441

Table 5.3: The resulting number of observations in each table when missing values andoutliers has been handled. N is the the number of observation for each data set before anydata pre-processing has been made. NRF is the resulting number of observations in eachtable when missing values has been handled, this is also the number of observations for thedata sets used in the Modeling Uplift Directly approach. N sub and NZ are the number ofobservations left aftermissing values has been handled and outliers has been removed for thedata sets used in Subtraction of TwoModels and theClassVariable Transformation approach,respectively. Note that NZ is smaller than N sub since the Class Variable Transformation hasbeen applied on the data sets related to NZ , meaning that all the negative samples in thecontrol group has been deleted according to (10).

Once the missing values has been handled and the binning has been made, it is time for thedetection and handling of outliers. There exists a risk that the Imputation Methods that wasused for the handling of missing values might have replaced some of the missing values withvalues that can be counted as outliers. This is not the case in this project since the the imputedvalues are approximately the same as the mean value of the same variable. However, if thiswould have been the case, it would not have been a problem since these hypothetical extremevalues would have been taken care of in the outlier detection.

As mentioned in Section 4.2, uplift modeling can be viewed as an instance of regressionas well as an instance of classification. Thus, recall from Section 4.1.1 that the methodused for detecting and removing outliers in this project is Hidden Extrapolation. Further,

37

recall from Section 4.3.2 that Random Forests is immune to the effect of outliers. Hence,Hidden Extrapolation is implemented for the data sets that are used in Logistic Regressionand Neural Networks only. Put differently, the data sets used in Random Forests isonly pre-processed such that missing values are handled. Since Logistic Regression andNeural Networks are implemented using the same uplift modeling approaches in thisproject, namely Subtraction of Two Models and Class Variable Transformation, HiddenExtrapolation is applied to each data set used in these two approaches.

To perform Hidden Extrapolation, each data set is fit using the method glm in R. The glmmethod fits generalized linear models. The influential statistics are then collected so thatthe hat matrix, H, and hmax can be obtained for each data set, recall (2) and (3). For eachdata set, once hmax is obtained, the extrapolating observations can be identified and removed.The reduced data sets as a result of the handling of missing values and outliers can be seenin Table 5.3.

In this project the variable selection will be handled in different ways for the differentstatistical machine learning methods and is presented in each section for the differentmethods in Section 5.2.

5.2 Uplift Modeling and Classification

In the following subsections different variable selection methods are presented along withhow the different statisticalmachine learningmethods are built and examined. The resultingmodels with the best model performances are also presented.

5.2.1 Random Forests

Uplift for Random Forest is implemented in R using the uplift package which has a readilyimplementation for Random Forest in an uplift modeling setting. However, before theimplementation is done, all the variables are converted to the right data types, i.e. categoricalvariables are converted to factors, integers to integers and decimals to numeric.

Asmentioned in Section 4.1.2, even though decision-tree learners performs variable selectionas a part of the modeling process, variable selection is an important part of the processfor decision-tree learners used in an uplift modeling setting. This is because it models thedifference between the outcomes of twomodels and thus easily overfits the data. This meansthat variable selection is a critical step in the modeling process for uplift for Random Forestand two different methods are tested to obtain the best model for each data set.

The first method applied to each data set is to use Variable Importance (VI) to rank thevariables according to their importance along with Cross Validation for Random Forest toobtain the optimal number of variables to use in themodel. The optimal number of variables,i.e. the number of variables that yields in the lowest error rate, is presented in Table 5.4. It isworth noting that the Cross Validation function for Random Forest in R is not testing everynumber of variables due to high computational cost. Hence, by looking at the two lowesterrors obtained using Cross Validation, it is possible to conclude the approximate number of

38

Campaign Variables Lowest error Variables Second lowest error

1 4 1.547 · 10−08 7 4.640 · 10−08

2 4 5.693 · 10−27 7 2.900 · 10−08

3 4 5.217 · 10−9 7 9.140 · 10−8

4 4 4.312 · 10−27 7 2.319 · 10−07

5 4 5.554 · 10−26 7 3.621 · 10−8

6 4 1.333 · 10−7 7 2.690 · 10−7

Table 5.4: The optimal number of variables to use for each data set for random forests. Thisis the number of variables that yields in the lowest error rate according to Cross Validationapplied to Random Forest. Moreover, the second lowest error and corresponding number ofvariables are shown.

variables that yields in the lowest error. Using the result in Table 5.4, different number ofvariables are tested and the variables used are those with the largest VI.

The second variable selection method applied to each data set is to use the Adjusted NetInformation Value (NIV ) to be able to rank the variables accordingly. Recall from Section4.1.2 that depending on the NIV for a variable, the strength as a predictor will vary. Oncethe NIV s are calculated for each data set, different number of variables are tested to obtainthe best model performance. The variables that are used are always those with the largestNIV .

Campaign Qini using VI Qini using NIV

1 0.0058 0.0077

2 0.0020 0.0030

3 0.0002 0.0015

4 0.0030 0.0035

5 0.0014 0.0022

6 −0.0010 0.0020

Table 5.5: The largest Qini value obtained for the different data sets when usingNIV and VIas variable selection method in random forests, respectively.

Once the variable selection is made, different values of the tuning parameters in theRandom Forest has to be examined to obtain the optimal model performance. Good modelperformance is recognized as a Qini curve with an uplift that is greater when treating asubgroup of the population compared to when treating the entire population (Section 4.4.2).Also, a positive Qini value is preferred as a negative one indicates that the result of an actionis worse than doing nothing. The resultingQini values for the the optimalmodels performingon the different data sets can be seen in Table 5.5. In this project, the splitting criterion inRandom Forests is based on the squared Euclidean distance, recall Section 4.2.2. ResultingQini curves from modeling on each data set can be seen in Figure 5.1.

39

0 20 40 60 80 100

0.00

00.

010

0.02

00.

030

Campaign 1

Proportion of population targeted (%)

Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

0.02

00.

025

Campaign 2


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

0.02

0

Campaign 3


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

Campaign 4


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

Campaign 5


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

Campaign 6


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

Figure 5.1: Resulting Qini curves for each data set when using Random Forests. 20 segmentshas been used when evaluating the result. Resulting curves using NIV as variable selectionmethod can be seen in blue, while the resulting curves using VI can be seen in red. The blackline represents a random classifier, and the black point at each curve points out the largestvalue for each curve, respectively.

5.2.2 Logistic Regression

Uplift modeling is implemented in two ways when using Logistic Regression. Thefirst one is Subtraction of Two Models and the second one is the Class VariableTransformation. Subtraction of Two Models is implemented in Python while the ClassVariable Transformation likewise Random Forests, is implemented in R using theUplift[13]package among others. The implementation process is described in the two followingsubsections along with the corresponding results for each data set.

Subtraction of TwoModels

Two different ways of performing variable selection is applied when using this approach,i.e NIV and the Lasso regularization. The reason of this is because Subtraction of TwoModels was proven to be difficult to implement for obtaining applicable results, and alsosince variable selection is a crucial step in the modeling process for uplift models. Thus, thefirst variable selection method applied to each data set is NIV . The NIV is calculated and

40

every variable is ranked accordingly.

Next, each control data set is split into training and test with an equal amount of controland treatment data in the test data set for being able to plot the Qini curves later on.Having a small number of observations from the control group in the test data set results insome segments to have zero observations from the control data group when calculating theincremental gains. The incremental gains thus becomes undefined for those segments, recallequation (32) in Section 4.4.2. Hence, only one test data set containing an equal amount ofcontrol and treatment data is used.

For the purpose of reducing the number of linearly dependent variables, PrincipleComponent Analysis (PCA) is performed on both training and test sets so that the dimensionsare reduced. Next, twomodels are built, one for the treatment data set and one for the controldata set. The models are built using Logistic Regression and the regularization method theLasso which performs variable selection as it shrinks some coefficients towards zero (andsome becomes exactly zero). The optimal value of the penalty term is obtained using CrossValidation.

The number of variables left after variable selection (NIV ) has beenmade is 27 for campaign1, where 4 of themare binary and the rest is continuous. The continuous variables are reducedinto Principle Components. The result is shown for 11 and 13Principle components. The finaldimensions are then 15 and 17 for campaign 1. The results for campaign 1 are shown in thefigures 5.2 and 5.3.

Figure 5.2: ROC curve (left) and Qini curve (right) for campaign 1 using 11 PrincipleComponents. The red curve is the model for the control group data and blue curve is themodel for the treatment group data. The AUC is 0.7518 and 0.6945 for the red and the bluecurve respectively.

The number of variables left after variable selection has been made for campaign 2 is 26,where 3 of them are binary and the rest is continuous. The result is shown for 6 and 10

Principle components. The final dimensions are then 9 and 13 for campaign 2. The resultsfor campaign 2 is shown in the figures 5.4 and 5.5.

Likewise the results for campaign 1 and 2, the results for the rest of the campaigns yields inbad uplifts. Hence, only the results for campaign 1 and 2 are presented here.

41




Class Variable Transformation

Before any modeling is done, the variable selection method chosen for the Class VariableTransformation is theNIV . After choosing the best variables according to theNIV , the datasets are split into training and test with the same amount of data from both the control groupand the treatment group in respective test data set.

Next, the Class Variable Transformation is applied to both the training and test data setsaccording to (10). After the transformation is completed, resampling is used to increase the

42

Campaign Qini for Z AUC for Z

1 0.0643 0.8619

2 0.0611 0.8642

3 0.0448 0.8872

4 0.0221 0.6054

5 0.0232 0.6520

6 0.0669 0.8292

Table 5.6: The largest Qini value obtained for the different data sets when using LogisticRegression and the Class Variable Transformation.

share of the control observations in the training data sets. Next each training data set is fitwith Logistic Regression using Z as response variable. Different number of variables used aspredictors are evaluated in order to obtain the best performing model.

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

Campaign 1


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

Campaign 2


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Campaign 3


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

0.25

Campaign 4


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

0.05

0.10

0.15

Campaign 5


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

0.25

Campaign 6


Cum

ulat

ive

incr

emen

tal g

ains

(pc

pt)

Figure 5.6: Resulting Qini curves for each data set when using Logistic Regression and theClass Variable Transformation. 20 segments has been used when evaluating the result. Theblack line represents a random classifier and the black point at each curve points out thelargest value for each curve, respectively.

Resulting Qini and AUC values can be seen in Table 5.6 and the resulting Qini curves andROC curves for the Class Variable Transformation can be seen in Figure 5.6 and Figure 5.7,respectively.

43

ROC curves: class variable transformation

Specificity

Sen

sitiv

ity

1.0 0.8 0.6 0.4 0.2 0.0

−0.

20.

00.

20.

40.

60.

81.

01.

2

Random

Campaign 1

Campaign 2

Campaign 3

Campaign 4

Campaign 5

Campaign 6

Figure 5.7: Resulting ROC curves for each data set when using Logistic Regression and theClass Variable Transformation.

5.2.3 Neural Networks

The results from using the uplift modeling approaches Subtraction of Two Models and theClass Variable Transformation along with Multilayer Perceptron (MLP) is presented in thefollowing subsections.

Subtraction of TwoModels

Using this approach, two models are trained to fit the control data and the treatment dataseparately using a MLP network. As for Logistic Regression, the resulting models are testedtowards the same test data set which contains an equal amount of control data and treatmentdata. The performance of the two models are then visualized with two ROC curves. Theresulting estimation of the uplift is calculated according to Subtraction of Two Models in4.2.2 and visualized with the Qini curve.

The method gives similar results for the 6 campaigns. It finds good models for the controland treatment group separately but fails to find any uplift. Hence, only results for campaign1 and 2 are presented with different amounts of Principle Components.

For campaign 1 the number of variables is 27 after the variable selection method NIV hasbeen applied. There are 4 binary variables and the rest is continuous. The result is shown for8 and 13 components but themodels yields in similar results for other number of componentsas well. The final dimension is then 12 and 17 for campaign 1, see Figures 5.8 and 5.9.

Campaign 2 has 26 variables left after the variable selection step. The resulting data sethas 3 binary variables and the rest are continuous. The result is again shown for 8 and 13

44

Figure 5.8: ROC curve (left) and Qini curve (right) for campaign 1 using 8 principlecomponents. The red curve is the model built on the control group data and the blue curveis the model built on the treatment group data. The AUC is 0.7187 and 0.7176 for the red andthe blue curve respectively.


Principle Components. The final dimensions for campaign 2 are then 11 and 16 respectively,see Figures 5.10 and 5.11.


It is important to clarify that the resulting data sets for the treatment and control training dataare not in the same size due to the treatment data being much larger than the control data,recall the share of treatment and control data inTable 3.1, Section 3. Themodel trained on thecontrol data has as a consequence a larger risk of getting overfitted because of the small data

45


size and also because Neural Networks is a more complex method than the other methodsused in this project. This is adjusted with different values of the regularization parameter,i.e. Ridge Regression. The models are also regularized by determine the number of layersand nodes for each data set.

Class Variable Transformation

Resampling is made in this method as well, but by using a built in method in the R Upliftpackage [3]. This method resamples the data in a manner that is well suited for upliftmodeling when using the Class Variable Transformation. Overall, it is difficult to resamplethe data in a way such that the class distribution is balanced and at the same time having anacceptable amount of control and treatment data, hence this built in method is used.

Figure 5.12 shows the resulting Qini curves for each campaign and respective ROC curve ispresented in Figure 5.13.

Campaign Qini AUC Components(PCA) Binary Var.

1 0.0202 0.5503 8 6

2 0.0251 0.5555 12 4

3 0.0268 0.7934 9 5

4 0.0264 0.8024 8 5

5 0.0039 0.5791 8 4

6 0.0133 0.7052 8 5

Table 5.7: All the important values for each campaign. The Qini value, AUC, number ofPrinciple Components and number of binary variables.

46

Figure 5.12: The Qini curves for Campaign 1, 2 and 3 are on the first row and for campaign4, 5 and 6 on the second row. The results are from using the Class Variable Transformationalong with Neural Networks.

Figure 5.13: The ROC curves for Campaign 1, 2 and 3 are on the first row and for campaign4, 5 and 6 on the second row. The results are from using the Class Variable Transformationalong with Neural Networks.

47

5.2.4 Cutoff for Classification of Customers

The best performing models are used to obtain the cutoff of how to classify costumers, i.e.deciding what customers to send the campaigns to. For example, if the cutoff is 0.55, acostumer is classified to be a persuadable if the predicted probability≥ 0.55 that the costumeris a persuadable.

The uplift is calculatedwith respective cutoff forRandomForest, Table 5.8, the Class VariableTransformation using Logistic Regression, Table 5.9, and Neural Networks 5.10. Themodelswith good uplift result are included in this section only. The corresponding percentage of thetotal population (in the test data set) to target is included in the tables as well.

Campaign Cutoff Incremental Gains Percentage of Treated

1 0.1755 0.0277 50%2 0.1938 0.0157 55%3 0.1287 0.0164 60%4 0.2399 0.0282 80%5 0.1975 0.0137 70%6 0.3356 0.0264 75%

Table 5.8: The chosen cutoffs for Random Forest.


1 0.0730 0.1821 40%2 0.0721 0.0771 40%3 0.0545 0.1247 35%4 0.1080 0.2383 50%

Table 5.9: The chosen cutoffs for Class Variable Transformation using Logistic Regression.


3 0.0387 0.1284 75%

Table 5.10: The chosen cutoff for Class Variable Transformation using Neural Network.

The chosen cutoffs are not necessarily the ones that has the largest incremental gains. Insome cases, the largest incremental gain is found when the whole population is targeted,recall the black points in Figures 5.1 and 5.6. Thus, cutoffs with relatively good incrementalgains are chosen for the purpose of not targeting the whole population.

48

6 Conclusions

The overall conclusion is that for all data sets, each model required plenty of tries for beingable to capture themodel parameters that yielded in the bestmodel performances. Generallyspeaking, the models performed poorly although some models were able to obtain satisfyingresults. In the following sections, the results will be discussed and suggestions for futurestudies will be presented.

6.1 Discussion

When building the uplift models using Random Forests, it was not always possible to obtainmodels that performed better than a random classifier which means that the result of anaction is worse than doing nothing in those cases. This was the case for some data setswhen using VI as variable selection method. Overall, the NIV performed better as variableselection method for Random Forests than the VI combined with Cross Validation forRandom Forests did (Figure 5.1). For some data sets it was easy to capture great performingmodels using VI, but for most it was not. The NIV on the other hand, was able to capturegood performing models for all data sets. Furthermore, by observing Table 5.5 it can be seenthat the Qini value is greater for every data set when usingNIV as variable selectionmethod.The issue that arises when using Random Forests and the Gini index as splitting criterion,i.e. using VI as in this project, is that the algorithm tend to favor categorical predictors withmany categories which in turn can lead to that the model gets overfitted. Therefore, suchpredictors should be avoided according to [5]. Since there are a great amount of categoricalpredictors used as input to the RandomForests in this project, this is possibly the reasonwhyselecting variables according to VI did not yield in as good model performance as selectingthem according to NIV . Hence, NIV is better suited as variable selection method than VIusing the Gini index as splitting criterion for Random Forests, when the purpose is to applyit in an uplift modeling setting.

Random Forests is the only statistical machine learning method used in this project that wasable to capture applicablemodels without having to resample the training data set to increasethe amount of control data when training the models. One reason for this is due to one of theparameters in the upliftRF()method in the uplift [3] package in R that defines the minimumnumber of control observations that must exist in any terminal node. Hence, during thebuilding process, the tree is forced to contain control data in every region. This way, the Qinicurves were obtained without any obstacles.

Looking at the results from Subtraction of TwoModels, both Logistic Regression and NeuralNetworks were able to capture good models for treatment data and control data separately.This can be seen in the Roc curves, figures 5.2 or 5.8, where the curves are high above thediagonal line and the values of AUC are very good, i.e. ≥ 0.5. This means that the modelsperformswell on unseen data (test data). However, this does not always correspond to a gooduplift, see Qini curves in 5.2 or 5.8. The same results are obtained in the article [17]. Whentraining the models separately it is not certain that the models predict a large difference inprobability. I.e. if it was possible to train the models in relation to each other the result

49

could give a better uplift. The result can be seen in respective Qini curve where the upliftfor Logistic Regression and Neural Networks becomes negative. This means that the modelspredicts a negative gain of the treatment, i.e the campaign always makes costumer buy lesswhich is highly unlikely.

Logistic Regression has satisfying results for the Class variable Transformation. Thisconclusion can bemade since the Qini curves are above the diagonal line and the incrementalgains are positive. Also, the AUC ≥ 0.5 which means the models performs better than arandom classifier. The results for Neural Networks are also acceptable looking at both theQini curve and the ROC curve. Although, Logistic Regression gives larger uplift for a smalleramount of costumers targeted for campaign 1 to 4. Also, the ROC curves shows that LogisticRegression performs better. Themain problem for Neural Networks is that it needs balancedclasses to perform well. When doing the Class Variable Transformation described in section4.2.3, it is not easy to obtain training data with balanced classes. This is because the share ofthe control group was only 9.66% in most of the campaigns. By applying the transformationthe amount of the control data becomes even smaller. The data in the test data set needs tohave an equal amount of control and treatment data and by having this, the minority class isbecoming even smaller in the training data set. Thus, over-sampling was used to overcomethe issue with having a small minority class but with such a solution another issue arises, i.e.re-sampling can result in that the model gets overfitted.

The resulting Qini curves for Class Variable Transformation shows that the maximumincremental gain is obtained by treating all the costumers. This means that there are nonegative effects of the treatments, i.e there are no costumers that refrain to make a purchasejust because they got the campaign. Moreover, by looking at campaign 1, 2, 3 and 4 for LogisticRegression, figure 5.6, it can be seen that the incremental gains are near its maximum whenapproximately 50%of the population are targeted. Thismeans that only 50%of the costumersin the test data can receive the campaign in order for the company to obtain almost the sameprofit as if the entire population were targeted. Put differently, the campaigns does not needto be sent to the remaining 50% of the costumers as their purchase behaviour will be similarno matter if they received a campaign offer or not.

The final results are the cutoffs obtained from the best performing models for RandomForests as well as Logistic Regression and Neural Networks along with the Class VariableTransformation. The cutoff is decided based on the Qini curves. In Figures 5.1 and 5.6the maximum value of the incremental gains for each model is marked with a black pointin each graph. For example, by looking at campaign 2 in figure 5.6, one can see that theincremental gain is almost the same at 40% as at 100%. Hence, the cutoff can the be chosenat that point so that the campaignswill only be sent out to a subgroup of the entire population.Put differently, the cutoff is used when a new campaign will be sent out to costumers and adecision of which costumers to target has to be made. The test data of the costumers aresimply sent to the trained model where the output is probabilities of which class they belongto. The cutoff is then used to make the final classifications, i.e the decision of whom will getthe campaign.

50

6.2 FutureWork

The market evaluated in this project was one country. It would have been interesting toevaluate whether different markets differ from each other. It might be that customers indifferent markets react differently to marketing campaigns, and even differently to differentkind of campaigns. Thus, possible future work could be to look into if the marketingcampaign/offer should be of a different kind depending on what market is targeted. Thiscould lead to happier and more loyal customers as well as an uplift for the company in termsof a greater gain in the selling. Furthermore, the segment of customers that is evaluated inthis project is the frequent kind, i.e. the company’s most loyal customer. Depending on thestage of the customer, the offers might vary. One question to think of is whether the bestoffer should be offered to the most loyal customer to keep their interest, or if the best offershould be offered to a new customer or even the least loyal customer. A new customer mightneed a reason to gain trust in the company, while the least loyal customer needs a reasonto start interacting more frequently with the company. Hence, to also investigate the otherstages of customers in different segments of the customer base might lead to insights in howto interact with different type of customers.

As mentioned throughout this thesis, the sizes of the treatment and control group is a crucialmatter to be able to performupliftmodeling in a satisfyingway, and to get an appealing result.For uplift modeling to actually be beneficial, when sending out campaign offers in the future,the share of the control group should be larger than just 9.66% as was the share in most ofthe campaigns in this project. This is especially important for being able to perform the ClassVariable Transformation as some of the observations in the control group gets excluded inthe final data set. Using a larger control group in future campaigns could hence yield in abetter predictive model when performing uplift modeling.

Once the persuadables are identified using uplift modeling, the company gets an indicatorof what customers to target with campaign offers. The only known attribute that is equalfor these customers is that they can be considered to belong to the group of persuadables,but nothing more than that. To investigate this segment of customers further would beinteresting, and could yield in deeper insights about the customer base. For this purpose,unsupervised learning approaches like clusteringmethods could be useful. Using a clusteringmethod to investigate this group of customers could give insights into what other attributesare similar for these individuals. Knowing attributes that is similar for these individuals couldgive the tool to personalize the campaign offers. This could lead to evenmore loyal customersand thus a gain for the company.

As noted earlier in this project, the variable selection is a crucial component of the data pre-processing part and which is an area that can be improved when it comes to RandomForests.When using VI and Cross Validation for classification, the default quality measure for nodeimpurity when using the randomForest package [10] in R is the Gini index. Radcliffe et al.[15] proposes that the quality measure should be based on a pessimistic qini estimate. This issupposed to reduce the likelihood of choosing variables that leads to unstablemodels. Hence,this could be an improvement to apply in future studies.

Another investigation to add to the modeling with Random Forests is to test othersplitting criterions than the squared Euclidean distance. Rzepakowski et al. [16] proposes

51

that the splitting criterion can be based on the Kullback-Leibler divergence and chi-squared divergence as well. This might not lead to better performances, but is worthconsidering.

6.3 Final Words

The overall conclusion is that, given the data related to the different campaigns in thisproject, it is possible to perform uplift modeling to obtain models that makes it possible tocomprehend how to target only a subgroup of the entire customer base instead of targetingthe whole customer base with campaign offers. Doing this, the retail company still receivesan incremental gain. For the uplift to be successful, the method of choice should beeither the Modeling Uplift Directly approach using Random Forests, or the Class VariableTransformation using Logistic Regression. This is due to the fact that Neural Networks aresensitive to uneven class distributions and are thus not able to obtain stable models giventhe data in this project. Moreover, Subtraction of Two Models was proven to not yield inapplicable results as the two separate models of the treatment data and the control data didnot result in satisfying models when combining the two.

The variable selectionwas proven to be a crucial part of themodeling process and thus a lot offocus should be put on this stepwhen building themodels. Overall, using theNIV as variableselection method yielded in good performances. Another crucial component related to thisproject was the amount of the treatment and the control data in each campaign. Having alarger amount of control data in future studies would yield in even better performing andstable models so that resampling can be avoided, and the risk of overfitting the models canbe decreased.

In total, given the data sets used in this project and the market of choice, the uplift approachis working given the right circumstances, and thus it can yield in a gain for the retail companyto start using it.

52

References

[1] Casella, G., Fienberg, S., and Olkin, I. Manifold Modern Multivariate StatisticalTechniques. Artificial Neural Networks. Springer, 2008. DOI: 10.1007/978-0-387-78189-1.

[2] Fawcett, Tom. “An Introduction to ROC analysis”. In: Pattern Recognition Letters27.2 (June 2006), pp. 861–874. URL: https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X.

[3] Guelman, Leo. “uplift: Uplift Modeling”. In: (2014). R package version 0.3.5. URL:https://CRAN.R-project.org/package=uplift.

[4] Guelman, Leo, Guillén, Montserrat, and Pérez-Marín, Ana M. “Random Forests forUplift Modeling: An Insurance Customer Retention Case”. In: (2012). Ed. by Kurt J.Engemann, Anna M. Gil-Lafuente, and José M. Merigó, pp. 123–133.

[5] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome.The Elements of StatisticalLearning. Data Mining, Inference and Prediction. 2nd ed. Springer Series inStatistics. Springer, 2008.

[6] Izenman, Alan J. Modern Multivariate Statistial Techniques. Regression,Classification and Manifold Learning. Springer Series in Statistics. Springer, 2008.

[7] Jaśkowski, Maciej and Jaroszewicz, Szymon. “Uplift modeling for clinical trial data”.In: ICMLWorkshop on machine learning for clinical data analysis (2012).

[8] Jiliang Tang Salem Alelyani, Huan Liu. “Feature Selection for Classification: AReview”. In: Arizona state university 2014 (Jan. 2014), p. 1.

[9] KOZLOWSKA, IGA. “Facebook and Data Privacy in the Age of Cambridge Analytica”.In: The HenryM. Jackson School of International Studies, University ofWashington(2018). URL: https://jsis.washington.edu/news/facebook-data-privacy-age-cambridge-analytica/.

[10] Liaw, Andy and Wiener, Matthew. “Classification and Regression by randomForest”.In: R News 2.3 (2002), pp. 18–22. URL: https://CRAN.R-project.org/doc/Rnews/.

[11] Montgomery, Douglas C., Peck, Elizabeth A., and Vining, G. Geoffrey. Introduction toLinear Regression Analysis. Fifth Edition. Wiley Series in Probability and Statistics.Wiley, 2012.

[12] Pedregosa, F. et al. “Scikit-learn: Machine Learning in Python ”. In: Journal ofMachine Learning Research 12 (2011), pp. 2825–2830.

[13] R Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing. Vienna, Austria, 2019. URL: https://www.R-project.org/.

[14] Radcliffe, Nicholas J. “Using control groups to target on predicted lift: Building andassessing uplift model”. In: Semantic Scholar (2007), pp. 4–7. URL: https : / /www . semanticscholar . org / paper / Using - control - groups - to - target - on -predicted - lift % 3A - Radcliffe / 147b32f3d56566c8654a9999c5477dded233328e ?citingPapersSort=is-influential#citing-papers.

53

http://dx.doi.org/10.1007/978-0-387-78189-1

http://dx.doi.org/10.1007/978-0-387-78189-1

https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X

https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X

https://CRAN.R-project.org/package=uplift

https://jsis.washington.edu/news/facebook-data-privacy-age-cambridge-analytica/

https://jsis.washington.edu/news/facebook-data-privacy-age-cambridge-analytica/

https://CRAN.R-project.org/doc/Rnews/

https://www.R-project.org/

https://www.R-project.org/

https://www.semanticscholar.org/paper/Using-control-groups-to-target-on-predicted-lift%3A-Radcliffe/147b32f3d56566c8654a9999c5477dded233328e?citingPapersSort=is-influential#citing-papers




[15] Radcliffe, Nicholas J. and Surry, Patrick D. “Real-World Uplift Modelling withSignificance-Based Uplift Trees”. In: 2012.

[16] Rzepakowski, Piotr and Jaroszewicz, Szymon. “Decision trees for uplift modeling withsingle and multiple treatments”. In: Knowledge and Information Systems 32.2 (Aug.2012), pp. 303–327. ISSN: 0219-3116. DOI: 10.1007/s10115- 011- 0434- 0. URL:https://doi.org/10.1007/s10115-011-0434-0.

[17] Rzepakowski, Piotr and Jaroszewicz, Szymon. “Uplift modeling in direct marketing”.In: Journal of Telecommunications and Information Technology 2012 (Jan. 2012),pp. 43–50.

[18] “SAS/STAT. 13.1 User’s Guide”. In: SAS Institute Inc (2013).

[19] Sołtys, Michał, Jaroszewicz, Szymon, and Rzepakowski, Piotr. “Ensemble methodsfor uplift modeling”. In: Data Mining and Knowledge Discovery 29.6 (Nov. 2015),pp. 1531–1559. ISSN: 1573-756X. DOI: 10.1007/s10618-014-0383-9. URL: https://doi.org/10.1007/s10618-014-0383-9.

[20] Stedman, Craig. “How uplift modeling helped Obama’s campaign — and canaid marketers”. In: Predictive Analytics Times (2013). URL: https : / / www .predictiveanalyticsworld.com/patimes/how-uplift-modeling-helped-obamas-campaign-and-can-aid-marketers/2613/.

[21] Trevor Hastie Robert Tibshirani, Martin Wainwright. Statistical Learning withSparsity: The Lasso and Generalizations. First Edition. CRC Press, 2015. ISBN:9781498712163.

54

http://dx.doi.org/10.1007/s10115-011-0434-0

https://doi.org/10.1007/s10115-011-0434-0

http://dx.doi.org/10.1007/s10618-014-0383-9

https://doi.org/10.1007/s10618-014-0383-9

https://doi.org/10.1007/s10618-014-0383-9

https://www.predictiveanalyticsworld.com/patimes/how-uplift-modeling-helped-obamas-campaign-and-can-aid-marketers/2613/



TRITA -SCI-GRU 2020:002

www.kth.se

machine learning based prediction and...

Documents