derek van den elsen - beta.vu.nl · 1996) datamining is essential in this process and this practice...

26
Santander Customer Satisfaction Derek van den Elsen 2580100 October 30, 2017 Research Paper Business Analytics Dr. Mark Hoogendoorn (supervisor)

Upload: others

Post on 07-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

Santander Customer Satisfaction

Derek van den Elsen

2580100

October 30, 2017

Research Paper Business Analytics

Dr. Mark Hoogendoorn (supervisor)

Page 2: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

Contents

1 Introduction 1

2 Related Work 1

3 Data Exploration 23.1 Feature Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Individual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.1 var3 (Nationality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.2 var15 (Age) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.3 var21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2.4 var36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2.5 var38 (Mortgage Value) . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2.6 num_var4 (Number of Bank Products) . . . . . . . . . . . . . . . . . . . 7

4 Preprocessing 84.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Modeling 135.1 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Results 176.1 Tuning RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Tuning XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Conclusion 21

A Appendix Feature Sets 22

2

Page 3: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

1 Introduction

Having adequate customer relations is paramount to success in any service industry. Identifyingand analyzing your customer’s contentment to improve customer retention can yield many bene-fits. The longer a client stays with an organisation, the more value he creates. There are highercosts attached to introducing and attracting new customers. The clients also have a better un-derstanding of the organisation and can give positive word-of-mouth promotion. (Colgate et al.,1996) Datamining is essential in this process and this practice is widely applied across industriesfor instance FMCG retailers (Buckinx and van den Poel, 2005), telecommunications (Mozer et al.,2000) and banking (Clemes et al., 2010) and (Xie et al., 2009).

This paper focuses on Santander Bank, a large corporation focusing principally on the market inthe northeast United States. Through means of a Kaggle competition (Santander, 2015), it is theobjective to find an appropriate model to predict whether a client will be dissatisfied in the futurebased on certain characteristics. Having this model in place can ensure that Santander can takeproactive steps to improve a customer’s happiness before they would take their business elsewhere.

First the paper will discuss related work done on this, by now concluded, Kaggle case. Secondlyit delves into the data we work with, analyzing groups of variables and individual features to giveus insight in what is relevant. Thirdly several cleaning procedures that were employed to lead tobetter results are outlined. Fourthly we explain the performance measure of this competition andthe three models: Logistic Regression, Random Forest and XGBoost that we utilize to tackle theproblem. Lastly the tuning process and results are discussed. We reach an AUC score of 0.823152.

2 Related Work

This section looks into the work of various Kaggle competitors in different sections of the leader-board. The private leaderboard score is mentioned first after which a small description details thework done. For reference the top position had a score of: 0.829072

0.828530: Silva et al. (2016) all seemingly independently do their own preprocessing, featureengineering and model selection and then combine all predictions of the models together in anensemble using the R package optim. Preprocessing steps taken are for instance: replacing certainvalues by NA, dropping sparse, constant and duplicated features, normalization, log transformingfeatures and one-hot encoding categorical features. Sophisticated feature engineering methodsemployed include t-Distributed Stochastic Neighbour Embedding, Principal Component Analy-sis and K-means. Models explored are: Follow the Proximally Regularized Leader, RegularizedGreedy Forest, Adaboost, XGBoost, Neural Networks, Least Absolute Shrinkage and SelectionOperator, Support Vector Machine, Random Forest and an Extra Trees Classifier.

0.826826: Yooyen and Ma (2016) are one of the few that find and handle some duplicated obser-vations in the train set. Furthermore they do standard preprocessing steps like removing duplicatedand constant features, normalizing, rescaling and handling missing values. They select featuresbased on Pearson’s Correlation Coefficient with the target and crossvalidation. Attempts at specif-

1

Page 4: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

ically handling the class imbalance and principal component analysis were fruitless. With successthey added heuristic rules like var15 < 23 means the target is 0. Their main models are based ondecision trees.

0.8249: Wang (2016) applies hardly any preprocessing as he intends to use only decision treesthat are relatively insensitive to this. He does extensively select features with the importance inthe Gradient Boosting Classifier as criterion. After which he adds percentile change from featureto feature, and he applies selection again. Parameters are trained iteratively one at a time with acoarse to fine approach. Adaboost, Bagging Classifier, Extra Trees Classifier, Gradient Boosting,Random Forest and XGBoost are ensembled to lead to the final scores.

0.824332: Kumar (2015) tries linear regression with the top ten features, a support vector machinein combination with principal component analysis with limited success as a start. Neural Net-works, tuned Random Forest and XGBoost yield him better results. XGBoost is seen as obviousbest candidate and it is used with the top N features, which makes it apparent that anything beyond5 features only adds very minor improvements to the crossvalidation score.

3 Data Exploration

The train set consists of 76020 observations and 370 features plus 1 binary target. The test set hasa roughly equal amount of 75818 observations with the same features. There is a large imbalancewith 96.04% being 0, meaning the customer was not dissatisfied and 3.96% being 1, signifyingthat the customer was dissatisfied. This is in line with the expectation of the customer satisfactionof a successful bank. There are no missing values in the train or test set, but some values mightencode ’missing’. There are only numeric or possibly categorical features. No features seem tohave substantial outliers.

The dataset is semi-anonymized, so it is unclear what a feature represents. The only clue wehave is a header with a name for each feature that is clearly not randomly determined. For illus-trative purposes, the first seven names are given in order from left to right: ID, var3, var15,imp_ent_var16_ult1, imp_op_var39_comer_ult1, imp_op_var39_comer_ult3and imp_op_var40_comer_ult1. Some of these words appear to be abbreviations for Span-ish words like ’imp’ for importe or amount. A non-comprehensive dictionary is discussed on theKaggle forum (Andreu, 2015). On first glance, one can also infer that imp_op_var39_comer_ult1and imp_op_var39_comer_ult3 are probably related and they are likely not related tovar3. Looking at the distribution of the data can confirm these suspicions. We first distinguishsome groups based on their name and broadly research each group in turn. Then we look intoindividual features that do not fit into a group in more depth. Aside from the clearly irrelevantID variable, this comprises all the features. This is more practical than discussing all 370 featuresthoroughly.

2

Page 5: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

3.1 Feature Groups

The sheer number of features in this dataset makes it hard to individually analyze and discuss eachfeature, so instead similar groups have been identified in the dataset, assisted by their correspond-ing names. In Table 1 it is visible on which substrings has been filtered. There is certain overlapbetween the groups as all ’Meses’ features are also ’Num’ features for example and all ’Delta’features are also either ’Imp’ or ’Num’ features.

Table 1: Variable GroupingsSubstring Example Number (Raw)

’Num’ num_var37_med_ult2 87 (155)’Ind’ ind_var13_0 46 (75)

’Saldo’ saldo_medio_var5_hace3 43 (71)’Imp’ imp_op_var39_comer_ult1 21 (63)

’Delta’ delta_imp_amort_var18_1y3 3 (26)’Meses’ num_meses_var5_ult3 8 (11)

The substring ’Num’ likely stands for numeric variables. Typically, excluding the ’Delta’ and’Meses’ subsets, these have values 0, 3, 6, 9 and further multiples of 3 as most common obser-vations. These could indicate quarters for example. 0 and 3 tend to be most common and thedistribution is usually unbalanced. The substring ’Ind’ likely stands for indicator variables as allof them are 0 or 1. The distribution is usually unbalanced, but not consistently towards 1 or 0.

The substring ’Saldo’ suggest the current actual amount on balance for certain financial products.A lot of these variables have an overwhelming amount of zero’s, possibly being finished financialproducts or products that are not utilized in the first place. Other values are typically numeric andare of large scale like 6119500. Some values are also negative further providing evidence that thisis a ’balance’ type variable. Very similarly, ’Imp’ can stand for Importe (Spanish for amount) andthe distributions match this conjecture. Notable is that the scale tends to be smaller, so perhaps’Saldo’ is a sum of consecutive periods.

The substring ’Delta’ signifies a difference of some kind, but the variable’s distributions matchratio’s, so it is possibly the ratio of an amount between a certain time period. Also here there is amassive imbalance towards the value 0. All ’Meses’ variables are ’Num’ variables, but are specif-ically taken as a subgroup, because they have a wildly different distribution. ’Meses’ is Spanishfor months and in the data these variables only take values 0, 1, 2 and 3. Some of the ’Meses’variables are even fairly balanced, which is fairly unique within this dataset.

3.2 Individual Features

These features all have a very short name and could be considered a group of their own. Howeverupon closer inspection of at minimum two of them, it becomes apparent that they are clearly notrelated in the same manner as the previous groups of features. This individuality also means weneed to consider each feature separately.

3

Page 6: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

3.2.1 var3 (Nationality)

Table 2: var3Maximum Value 238Minimum Value -999999

Unique Observation Count 208Most common (count) 2 (74165)

2nd most common (count) 8 (138)3rd most common (count) -999999 (116)4th most common (count) 9 (110)5th most common (count) 3 (108)

Suspected Category Categorical

var3 is suspected to be nationality or country of residence. 208 Unique countries sounds like aplausible number that the bank can supply. 74165 observations are 2, which probably stand for theUnited States, the main market for this particular bank. A binary feature var3_most_commonis made to put emphasis on this, which is 1 if var3 is equal to 2 and 0 otherwise. -999999 likelyencodes for missing values and another binary feature var3_missing accounts for this. Afterthis feature is made the -999999 values are replaced by the most commonly occurring value 2.

3.2.2 var15 (Age)

Table 3: var15Maximum Value 105Minimum Value 5

Mean 33.212865Unique Observation Count 100

Most common (count) 23 (20170)2nd most common (count) 24 (6232)3rd most common (count) 25 (4217)4th most common (count) 26 (3270)5th most common (count) 27 (2861)

Suspected Category Numeric

var15 is suspected to represents age as the minimum and maximum values are respectively 5and 105, but the majority of the data is over 21. This data seems very biased to younger peopleand perhaps 23 is filled in if the age is unknown. For this reason we make another binary featurevar15_most_common, which is 1 if var15 is equal to 23 and 0 otherwise.

4

Page 7: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

0 20 40 60 80 100Value

0

5000

10000

15000

20000

25000

Cou

nt

var15

AllTarget: 1

Figure 1: Histogram Age

3.2.3 var21

Table 4: var21Maximum Value 30000Minimum Value 0

Unique Observation Count 25Most common (count) 0 (75152)

2nd most common (count) 900 (236)3rd most common (count) 1800 (206)4th most common (count) 4500 (96)5th most common (count) 3000 (84)

Suspected Category Numeric

Not all variables have an easy interpretation in this anonymized dataset and var21 is a primeexample. It is highly imbalanced and the non-zero values do not give off a likely meaning. Never-theless it could still possibly be important, as it is clearly distinct from other variables.

5

Page 8: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

3.2.4 var36

Table 5: var36Unique Observation Count 5

Most common (count) 99 (30064)2nd most common (count) 3 (22177)3rd most common (count) 1 (14664)4th most common (count) 2 (8704)5th most common (count) 0 (411)

Suspected Category Categorical

There is not much to be said about this particular variable, except that it is very likely categorical.One-hot encoding is applied such that it can be better understood by our classifiers. In the samevein, 99, which likely stands for missing values, is encoded properly.

3.2.5 var38 (Mortgage Value)

Table 6: var38Maximum Value 22034740Minimum Value 5163.75

Mean 117235Unique Observation Count 57736

Most common (count) 117310.979016494 (14868)2nd most common (count) 451931.22 (16)3rd most common (count) 463625.16 (12)4th most common (count) 288997.44 (11)5th most common (count) 104563.80 (11)

Suspected Category Numeric

This distribution ranges from high to low positive numbers with a very large number of the samevalue 117310. It is our conjecture that this represents the mortgage value of a customer or atleast some kind of value indicating variable. If it is unknown for whatever reason the countryaverage is instead filled in. For this reason we create a dummy variable var38_most_commonthat remembers this information. It is 1 if var38 is 117310 and 0 otherwise. We visualize thedistribution of the known values by making a histogram, excluding 117310 and cutting of therange at 350000, which excludes 1559 more observations, in Figure 2.

6

Page 9: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

0 50000 100000 150000 200000 250000 300000 350000Value

0

500

1000

1500

2000

2500

3000

3500

4000

Cou

ntvar38

AllTarget: 1

Figure 2: Histogram Mortgage

3.2.6 num_var4 (Number of Bank Products)

Table 7: num_var4Maximum Value 7Minimum Value 0

Unique Observation Count 8Most common (count) 1 (38147)

2nd most common (count) 0 (19528)3rd most common (count) 2 (12692)4th most common (count) 3 (4377)5th most common (count) 4 (1031)

Suspected Category Numeric

According to dmi3kno (2015) this variable represents the number of bank products this clientcurrently has with the bank. The distribution suggests that fewer people have multiple productswith the bank and those tend to not be dissatisfied, giving this explanatory value. This is alsointuitive, as clients investing multiple times probably have a good relationship with the bank andconversely the bank has more information to satisfy their client appropriately.

7

Page 10: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

0 1 2 3 4 5 6 7Value

0

5000

10000

15000

20000

25000

30000

35000

40000

Cou

nt

num_var4

AllTarget: 1

Figure 3: Histogram Bank Products

4 Preprocessing

4.1 Data Cleaning

On first glance the data is fairly clean, however there are definite problems for running certainfeatures and observations through a machine learning algorithm. Figure 4 demonstrates the orderthat the cleaning steps were taken in. The left number represents the number of observations andthe right the number of features.

8

Page 11: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

76020 | 371

76020 | 337

76020 | 308

76020 | 204

76020 | 203

71108 | 203

71108 | 200

71108 | 171

Removed Constant Features

Removed Duplicated Features

Removed Uninformative Features

Removed Index

Removed Duplicated Observations

Removed Delta Features

Removed Correlated Features

Figure 4: Cleaning Process

There are 34 features that are one constant value (zero for all checked cases). These features thatdo not vary at all cannot teach our classifier anything meaningful, so all features that have a stan-dard deviation of zero are omitted. 29 features are exact duplicates of one another as well. Weremove the redundancy by removing all but the first feature of a duplicate group. Note that theorder of these cleaning procedures influences the number of removed features.

Some features have as little as 2 non-zero values. We categorize features that have very few non-zero values as uninformative for our classifier. We remove 104 features from the data by arbitrarilydrawing a line at 100, effectively removing all features that have less than 100 observations thatare not zero. Several of these were checked by eye and having a non-zero value was not verydiscriminatory regarding the Target value. For contextual purposes we show in Figure 5 what dif-ferent choices would have meant for the dimension of the feature set. Finally we remove the indexand in doing so assume the training data does appear randomly to us and not in a particular order.

9

Page 12: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

0 100 200 300 400 500Information Parameter

160

180

200

220

240

260

280

300

Feat

ures

Dim

ensio

n

Features Dimension vs Information Parameter

Figure 5: Information Parameter Range

There are also 4912 exact duplicates in features of the 76020 observations in the training data.Even worse, 109 of those duplicates have a differing target, which can only possibly ’confuse’a classifier. Although it can also be said that it will simply attach less confidence towards theimportance of these observations, but to avoid the issue altogether these are all excluded from thedataset. The first of the duplicates that have the same target is kept around and 71108 observationsremain after this procedure.

The three ’Delta’ variables that are leftover are still relatively uninformative. They also do notappear as having great predictive power in any related literature studied. For these reasons theyare also removed. They are the only variable group for which this is deemed appropriate.

4.2 Correlation

Correlation can be a powerful tool in machine learning. One of a pair of heavily correlated pre-dictors can be removed without harming the predictive power of a model. Also very high absolutecorrelation with the target can indicate that this is an important variable. We apply the formerand we look at the top features in the sense of being correlated with the target. We first apply

10

Page 13: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

normalization, where appropriate, to scale all variables between 0 and 1. We note that several ofthe features are hugely imbalanced and this type of scaling does not damage that. A feature X willbecome normalized feature Z with the following formula:

zi =xi −min(X)

max(X)−min(X)∀ i = 1, ..., length(X)

After applying this to var15, var38, ’Num’s, ’Imp’s and ’Saldo’s, we plot the following Corre-lation matrix in Figure 6 and zoom in in Figure 7

Figure 6: Correlation Matrix

var3

var1

5im

p_en

t_va

r16_

ult1

imp_

op_v

ar39

_com

er_u

lt1im

p_op

_var

39_c

omer

_ult3

imp_

op_v

ar40

_com

er_u

lt1im

p_op

_var

40_c

omer

_ult3

imp_

op_v

ar40

_ult1

imp_

op_v

ar41

_com

er_u

lt1im

p_op

_var

41_c

omer

_ult3

imp_

op_v

ar41

_efe

ct_u

lt1im

p_op

_var

41_e

fect

_ult3

imp_

op_v

ar41

_ult1

imp_

op_v

ar39

_efe

ct_u

lt1im

p_op

_var

39_e

fect

_ult3

imp_

op_v

ar39

_ult1

ind_

var1

_0

var3var15

imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3

imp_op_var40_ult1imp_op_var41_comer_ult1imp_op_var41_comer_ult3

imp_op_var41_efect_ult1imp_op_var41_efect_ult3

imp_op_var41_ult1imp_op_var39_efect_ult1imp_op_var39_efect_ult3

imp_op_var39_ult1ind_var1_0

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Figure 7: Correlation Matrix Zoom in

11

Page 14: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

Here it is clearly visible that for instance imp_op_var39_comer_ult1 is heavily correlatedwith imp_op_var39_comer_ult3, further confirming that similar variable names are related,justifying the grouping of variables done earlier. Some features even have a correlation extremelyclose to 1 with each other like num_var1_0 and ind_var1_0 with 0.9988. All variable pairsthat have higher absolute correlation than a conservative 0.99 are filtered out and the first of thepairs is deleted. An extension of this could delete one of the pair based on lower absolute correla-tion with the target. This removes 29 features and brings the cleaned dataset to 171 features total.It is visualized in Figure 8 what different correlation thresholds would mean for the dimensions ofthe feature set. We have been relatively conservative as the tree-based algorithms we use are adeptat handling correlated features and the dataset is already comparatively small.

0.700.750.800.850.900.951.00Correlation Threshold

80

100

120

140

160

180

200

Feat

ures

Dim

ensio

n

Features Dimension vs Correlation Threshold

Figure 8: Correlation Thresholds

Furthermore some of the top correlations are showcased in Table 8. It is likely that these will bethe important predictors later on.

Table 8: Top Correlations with TargetNegative Positive

ind_var30 (-0.149756) var36 (0.102401)num_meses_var5_ult3 (-0.147362) var15 (0.097341)

num_var30 (-0.137623) ind_var8_0 (0.048493)num_var42 (-0.134246) imp_op_var41_efect_ult1 (0.030599)ind_var5 (-0.133128) ind_var8 (0.029038)

12

Page 15: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

4.3 Feature Engineering

This dataset mostly lent itself for decoding what the existing data meant and making sure it is anappropriate format for machine learning algorithms. Nevertheless this led to the creation of somefeatures, that were mentioned before, to accommodate expected oddities in the data. For instancethe feature var38_most_common was created to ensure that the algorithm can recognize thatthis is likely a unique value outside of the numerical order that var38 has. Outside of the onesmentioned before, the ’Meses’ group was completely one-hot encoded, because they were smallenough to warrant such an approach and seemed on first glance categorical.

Furthermore it is noticeable that the data contains a lot of zero’s. This usually stands for a lack ofinformation or interaction and precisely that lack of knowing thy customer could hold predictivevalue for determining if the customer will possibly be dissatisfied in the future. For this reasonn0 was created, which simply sums the number of zero’s that appear in the row. (dmi3kno, 2015)Special care is taken to not include the Target variable as this is a form of information leakagewhich disturbs the learning process of the models. Conversely a n1 was created that shouldsignify that the bank does know their customer and has a lot of interaction with them. This wouldbe different, and not redundant in that way, than creating a variable that simply sums all spots inthe row that are not zero. This brings the total dimensions of our training set to 71108 observationsand 215 features to start the modeling process.

5 Modeling

5.1 Performance Measure

The performance measure of this Kaggle competition is the area under the receiver operatingcharacteristic curve, AUROC or AUC for short. This metric deals well with the imbalance thatis typical in churn prediction. (Burez and van den Poel, 2009) We build up to this concept byconsidering some simpler notions first. (Dernoncourt, 2015) If we consider a binary classifier wehave four possible outcomes when we use it make a binary prediction and we call the collectionof this a confusion matrix and an example is shown in Figure 9.

True negative: We predict 0 and the class is actually 0.False negative: We predict 0 and the class is actually 1.True positive: We predict 1 and the class is actually 1.False positive: We predict 1 and the class is actually 1.

Figure 9: Confusion Matrix Example

13

Page 16: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

We define the following ratio’s. The True Positive Rate corresponds to the proportion of positivedata points that are correctly considered as positive, with respect to all positive data points. Thehigher the better, all else equal. The False Positive Rate corresponds to the proportion of negativedata points that are incorrectly classified as positive, with respect to all negative data points. Thelower the better all else equal.

True Positive Rate (TPR):TP

TP + FNFalse Positive Rate (FPR):

FPFP + TN

Binary classifiers usually predict with what probability they expect 1 to occur. The thresholdwhere a high probability leading to predicting a 1 lies is an arbitrary decision. It is possible toconsider every single possible threshold and plot the corresponding pair of TPR and FPR’s of theresulting predictions in a ROC curve plot. An example is given in Figure 10. To obtain a singleperformance metric we can take the area under the ROC curve and effectively take into accountboth TPR and FPR. The baseline for this metric lies at 0.5, where we predict completely randomly.Anything performing worse can simply be inverted to do better.

Figure 10: AUC Example

5.2 Solution Methods

We use 3 different methods to come to a solution: Logistic Regression, Random Forest fromScikit Learn (Pedregosa et al., 2011) and XGBoost (Chen and Guestrin, 2016). We apply 10-foldstratified crossvalidation on the train set to compare models internally, consequently fit the bestmodel on the entirety of the train data and then predict labels for the test set and validate these

14

Page 17: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

on the Kaggle site for the final score. Stratification in this context refers to ensuring that eachfold of the crossvalidation has a roughly equal distribution of classes as the original whole dataset. Stratification tends to improve the accuracy of the crossvalidation as means of testing whendealing with imbalanced data. (Kohavi, 1995)

5.2.1 Logistic Regression

Logistic regression is a relatively ’simple’ machine learning algorithm and we expect fast, but notgreat results. It tries to attach the best constant values to how features interact with the target,based on the train set, minimizing an error term. It then applies this same formula to the test dataset. It is pursued here as a baseline in order to compare to more sophisticated models. For moredetails see (Bishop, 2006).

5.2.2 Decision Tree

The following two algorithms rely on multiple decision trees. This section roughly describes whata decision tree is. See (Breiman et al., 1984) for more details. To illustrate the concept of adecision tree we run a basic tree classifier implementation on our data and obtain Figure 11.

var15 <= 0.215gini = 0.5

samples = 100.0%value = [0.5, 0.5]

class = y[0]

n1 <= 4.5gini = 0.327

samples = 44.6%value = [0.794, 0.206]

class = y[0]

True

saldo_var30 <= 0.001gini = 0.471

samples = 55.4%value = [0.38, 0.62]

class = y[1]

False

gini = 0.467samples = 10.5%

value = [0.628, 0.372]class = y[0]

gini = 0.236samples = 34.1%

value = [0.863, 0.137]class = y[0]

gini = 0.382samples = 27.0%

value = [0.257, 0.743]class = y[1]

gini = 0.453samples = 28.4%

value = [0.653, 0.347]class = y[0]

Figure 11: Decision Tree Example

The concept is fairly simple. If a prediction needs to be made, go from the top of the tree to thebottom. Go left at the top if var15 is lower than 0.215, right otherwise. Repeat this processfor all nodes until a leaf is reached and the prediction made corresponds to the class label of thatleaf. This tree is constructed by at each depth level greedily finding the best feature to split on,maximizing information gain via the Gini impurity. Let p0 be the proportion of observations thatbelong to class 0 and let p1 be the proportion of observations that belong to class 1 out of allobservations.

Gini = 1− (p20 + p21)

15

Page 18: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

This metric is an example of how to measure how pure a split is. It is more pure the lower it is withminimum 0 and maximum 0.5. A split is purer if in the branches the ratio of positive to negativeexamples is close to 0 or 1 or in other words the split is highly discriminatory. The maximumdepth of this tree is set at 2 to keep it tractable and this effectively also keeps it from overfitting.A decision tree can otherwise perfectly match a training set, which does not generalize well.

5.2.3 Random Forest

The Random Forest model generates multiple decision trees. (Breiman, 2001) Only a subset ofpredictive features is now considered during each split that is randomly selected. The decisiontrees lead to one single prediction together by averaging all the predictions they give individually.

The parameters of a Random Forest model are really important and need to be tuned appropri-ately. N_ESTIMATORS determines the number of trees in the forest. MAX_FEATURES controlsthe maximum random amount of features to consider when determining a best split during thealgorithm. MAX_DEPTH limits the depth of a tree in the random forest. MIN_SAMPLES_SPLIT

determines the minimum amount of observations that need to be in a node for it to be consideredfor splits. MIN_SAMPLES_LEAF constrains that leaf nodes have at least this number of obser-vations. N_JOBS controls the number of processors. Trees in a random forest can be made inparallel, so the more cores working, the less computation time needed. CLASS_WEIGHT can beset to ’balanced’ to deal with imbalanced datasets.

5.2.4 XGBoost

XGBoost is a method where the outcome is formed by an combination of multiple trees. The treesare build iteratively. Every time a new tree is built it focuses on parts where the previous onesmake mistakes, by assigning higher weights to these instances. This stands in contrast to RandomForest, where trees are made independently from each other.

The parameters are again of grave importance and there are even more. N_ESTIMATORS andMAX_DEPTH are the same as with the Random Forest model. LEARNING_RATE is a parameterthat controls the speed and preciseness of the model. The lower it is, the more accurate your modelbecomes, but the more rounds it needs to converge. It represents a constant times how much thenext tree built affects current predictions. SUBSAMPLE stands for the fraction of observations tobe sampled randomly. COLSAMPLE_BYTREE denotes the fraction of features to take into consid-eration during the random sampling for each tree. MIN_CHILD_WEIGHT defines the minimumsum of weights of all observations required in a child. SCALE_POS_WEIGHT is a parameter tocombat imbalancedness. XGBoost directly avoids overfitting by promoting simplicity of modelsin the objective via regularization, unlike Random Forest that only limits the way trees can growby imposing restraints. (Chen, 2014)

min Obj(Ω) = Training Loss Function + Regularization

min Obj(Ω) =n∑

i=1

l(yi, yi) + αk∑

i=1

|wi|+ λk∑

i=1

w2i + γT

16

Page 19: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

REG_ALPHA is a parameter for l1 regularization. LAMBDA is a parameter for l2 regularization.GAMMA is a regularization parameter that multiplies itself with the number of leaves. This regular-ization is in place to penalize the objective for building overtly complex trees to avoid overfitting.

6 Results

6.1 Tuning RF

It is difficult to determine what set of parameters is optimal for a specific problem. We utilize theGridSearchCV() (Buitinck et al., 2013) function with 5-fold stratified crossvalidation and scoringoption area under the ROC curve to come to an answer. This basically boils down to trying allpossible combinations of a set of predetermined parameter spaces. The parameter spaces andresults of the grid search are shown in Table 9. These parameter spaces were seen as the optimaltradeoff between computation time and accurately tuning the model.

Table 9: GridSearchCV()Parameter Range Best Value Combination

N_ESTIMATORS [100, 500, 1000, 2000, 3000] 2000MAX_FEATURES [’sqrt’, ’log2’] ’sqrt’

MAX_DEPTH [None, 5, 20] 20MIN_SAMPLES_LEAF [10, 30, 50, 100] 50

With the aforementioned best parameter set we fit the model on the training data and make aplot of how relevant a specific feature is in Figure 12. These importances represent the totalGini impurity decrease weighted by the probability of reaching a node containing this feature,averaged over all trees. We can be pleased to see that several of the features we created likevar15_most_common seem to be significant.

var1

5va

r15_

mos

t_co

mm

on n1sa

ldo_

var3

0sa

ldo_

med

io_v

ar5_

ult3

var3

8 n0sa

ldo_

med

io_v

ar5_

ult1

sald

o_va

r5in

d_va

r30

sald

o_va

r42

num

_var

35sa

ldo_

med

io_v

ar5_

hace

2nu

m_m

eses

_var

5_ul

t3nu

m_v

ar30

num

_var

4nu

m_m

eses

_var

5_ul

t3_0

ind_

var5

num

_mes

es_v

ar5_

ult3

_3sa

ldo_

med

io_v

ar5_

hace

3nu

m_v

ar22

_ult3

num

_var

42im

p_op

_var

41_e

fect

_ult3

var3

6nu

m_v

ar45

_hac

e3nu

m_m

ed_v

ar45

_ult3

imp_

op_v

ar41

_ult1

num

_var

45_h

ace2

imp_

op_v

ar41

_efe

ct_u

lt1nu

m_v

ar22

_hac

e2

Features

0.05

0.00

0.05

0.10

0.15

0.20

0.25

F sc

ore

Feature Importances

Figure 12: Feature Importance

17

Page 20: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

We also plot one of the many trees the forest model creates in Figure 13, namely the very first oneand we can see it is considerable in size.

saldo_var5 <= 0.005gini = 0.5

samples = 100.0%value = [0.49, 0.51]

class = y[1]

saldo_var13 <= 0.0gini = 0.488

samples = 64.5%value = [0.421, 0.579]

class = y[1]

True

ind_var10_ult1 <= 0.5gini = 0.429

samples = 35.5%value = [0.688, 0.312]

class = y[0]

False

ind_var24_0 <= 0.5gini = 0.483

samples = 61.0%value = [0.408, 0.592]

class = y[1]

var36_3 <= 0.5gini = 0.135

samples = 3.5%value = [0.927, 0.073]

class = y[0]

num_op_var41_efect_ult3 <= 0.01gini = 0.479

samples = 58.3%value = [0.399, 0.601]

class = y[1]

saldo_medio_var5_ult1 <= 0.002gini = 0.336

samples = 2.7%value = [0.786, 0.214]

class = y[0]

saldo_medio_var5_ult1 <= 0.002gini = 0.484

samples = 54.3%value = [0.411, 0.589]

class = y[1]

saldo_medio_var8_ult1 <= 0.023gini = 0.4

samples = 4.0%value = [0.277, 0.723]

class = y[1]

ind_var30 <= 0.5gini = 0.409

samples = 25.6%value = [0.287, 0.713]

class = y[1]

num_ent_var16_ult1 <= 0.025gini = 0.456

samples = 28.7%value = [0.648, 0.352]

class = y[0]

num_var30_0 <= 0.013gini = 0.403

samples = 23.7%value = [0.28, 0.72]

class = y[1]

var38 <= 0.01gini = 0.479

samples = 2.0%value = [0.399, 0.601]

class = y[1]

ind_var41_0 <= 0.5gini = 0.263

samples = 0.4%value = [0.844, 0.156]

class = y[0]

num_var45_hace3 <= 0.004gini = 0.4

samples = 23.3%value = [0.277, 0.723]

class = y[1]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.409samples = 0.2%

value = [0.713, 0.287]class = y[0]

num_meses_var5_ult3 <= 0.5gini = 0.422

samples = 19.9%value = [0.302, 0.698]

class = y[1]

gini = 0.3samples = 3.4%

value = [0.183, 0.817]class = y[1]

gini = 0.421samples = 19.7%

value = [0.301, 0.699]class = y[1]

gini = 0.499samples = 0.2%

value = [0.526, 0.474]class = y[0]

num_meses_var12_ult3 <= 1.5gini = 0.472

samples = 1.8%value = [0.382, 0.618]

class = y[1]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

num_var4 <= 0.214gini = 0.455

samples = 1.6%value = [0.35, 0.65]

class = y[1]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

var36_1 <= 0.5gini = 0.428

samples = 0.7%value = [0.31, 0.69]

class = y[1]

num_op_var39_comer_ult3 <= 0.007gini = 0.478

samples = 0.9%value = [0.395, 0.605]

class = y[1]

n1 <= 6.5gini = 0.474

samples = 0.5%value = [0.386, 0.614]

class = y[1]

gini = 0.328samples = 0.2%

value = [0.207, 0.793]class = y[1]

gini = 0.497samples = 0.1%

value = [0.538, 0.462]class = y[0]

gini = 0.458samples = 0.4%

value = [0.356, 0.644]class = y[1]

var15 <= 0.245gini = 0.498

samples = 0.6%value = [0.468, 0.532]

class = y[1]

saldo_var26 <= 0.001gini = 0.413

samples = 0.3%value = [0.291, 0.709]

class = y[1]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

num_meses_var5_ult3_0 <= 0.5gini = 0.476

samples = 0.4%value = [0.39, 0.61]

class = y[1]

n0 <= 152.5gini = 0.455

samples = 0.3%value = [0.35, 0.65]

class = y[1]

gini = 0.497samples = 0.1%

value = [0.541, 0.459]class = y[0]

gini = 0.341samples = 0.1%

value = [0.218, 0.782]class = y[1]

gini = 0.419samples = 0.2%

value = [0.701, 0.299]class = y[0]

gini = 0.334samples = 0.1%

value = [0.212, 0.788]class = y[1]

gini = 0.492samples = 0.1%

value = [0.439, 0.561]class = y[1]

num_var22_hace2 <= 0.012gini = 0.451

samples = 28.2%value = [0.656, 0.344]

class = y[0]

num_op_var41_ult1 <= 0.01gini = 0.458

samples = 0.5%value = [0.356, 0.644]

class = y[1]

num_var45_hace3 <= 0.013gini = 0.429

samples = 25.9%value = [0.688, 0.312]

class = y[0]

num_var45_ult1 <= 0.003gini = 0.488

samples = 2.3%value = [0.421, 0.579]

class = y[1]

num_var35 <= 0.042gini = 0.413

samples = 24.3%value = [0.709, 0.291]

class = y[0]

num_var45_hace2 <= 0.031gini = 0.499

samples = 1.6%value = [0.476, 0.524]

class = y[1]

gini = 0.459samples = 0.1%

value = [0.357, 0.643]class = y[1]

saldo_medio_var5_ult3 <= 0.001gini = 0.41

samples = 24.2%value = [0.712, 0.288]

class = y[0]

var38 <= 0.005gini = 0.413

samples = 23.9%value = [0.709, 0.291]

class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

num_var22_ult1 <= 0.016gini = 0.39

samples = 15.6%value = [0.734, 0.266]

class = y[0]

num_var45_hace3 <= 0.004gini = 0.445

samples = 8.3%value = [0.666, 0.334]

class = y[0]

ind_var39_0 <= 0.5gini = 0.385

samples = 15.3%value = [0.74, 0.26]

class = y[0]

gini = 0.499samples = 0.3%

value = [0.519, 0.481]class = y[0]

gini = 0.451samples = 1.8%

value = [0.657, 0.343]class = y[0]

var15_most_common <= 0.5gini = 0.372

samples = 13.5%value = [0.753, 0.247]

class = y[0]

var36 <= 51.0gini = 0.45

samples = 8.4%value = [0.659, 0.341]

class = y[0]

gini = 0.051samples = 5.1%

value = [0.974, 0.026]class = y[0]

num_var22_hace3 <= 0.014gini = 0.47

samples = 6.0%value = [0.622, 0.378]

class = y[0]

num_meses_var5_ult3_3 <= 0.5gini = 0.346

samples = 2.3%value = [0.777, 0.223]

class = y[0]

saldo_var30 <= 0.001gini = 0.472

samples = 5.9%value = [0.618, 0.382]

class = y[0]

gini = 0.309samples = 0.2%

value = [0.809, 0.191]class = y[0]

gini = 0.478samples = 5.5%

value = [0.606, 0.394]class = y[0]

gini = 0.162samples = 0.4%

value = [0.911, 0.089]class = y[0]

gini = 0.5samples = 0.2%

value = [0.493, 0.507]class = y[1]

gini = 0.291samples = 2.1%

value = [0.823, 0.177]class = y[0]

gini = 0.44samples = 7.4%

value = [0.673, 0.327]class = y[0]

gini = 0.476samples = 0.8%

value = [0.609, 0.391]class = y[0]

saldo_var5 <= 0.005gini = 0.494

samples = 1.4%value = [0.447, 0.553]

class = y[1]

gini = 0.355samples = 0.2%

value = [0.769, 0.231]class = y[0]

var36 <= 51.0gini = 0.481

samples = 0.9%value = [0.599, 0.401]

class = y[0]

gini = 0.416samples = 0.5%

value = [0.296, 0.704]class = y[1]

gini = 0.485samples = 0.7%

value = [0.587, 0.413]class = y[0]

gini = 0.465samples = 0.2%

value = [0.632, 0.368]class = y[0]

num_var35 <= 0.125gini = 0.5

samples = 1.4%value = [0.497, 0.503]

class = y[1]

var38_most_common <= 0.5gini = 0.448

samples = 0.9%value = [0.338, 0.662]

class = y[1]

num_meses_var5_ult3_2 <= 0.5gini = 0.499

samples = 1.2%value = [0.524, 0.476]

class = y[0]

gini = 0.46samples = 0.2%

value = [0.359, 0.641]class = y[1]

num_var22_hace3 <= 0.014gini = 0.482

samples = 0.9%value = [0.595, 0.405]

class = y[0]

var38 <= 0.004gini = 0.466

samples = 0.3%value = [0.371, 0.629]

class = y[1]

var15_most_common <= 0.5gini = 0.473

samples = 0.7%value = [0.617, 0.383]

class = y[0]

gini = 0.499samples = 0.2%

value = [0.524, 0.476]class = y[0]

gini = 0.499samples = 0.5%

value = [0.518, 0.482]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.495samples = 0.1%

value = [0.448, 0.552]class = y[1]

gini = 0.441samples = 0.2%

value = [0.328, 0.672]class = y[1]

num_var22_hace2 <= 0.061gini = 0.457

samples = 0.7%value = [0.353, 0.647]

class = y[1]

gini = 0.412samples = 0.2%

value = [0.29, 0.71]class = y[1]

num_op_var39_comer_ult3 <= 0.007gini = 0.483

samples = 0.6%value = [0.409, 0.591]

class = y[1]

gini = 0.355samples = 0.2%

value = [0.231, 0.769]class = y[1]

saldo_medio_var5_hace2 <= 0.0gini = 0.496

samples = 0.4%value = [0.455, 0.545]

class = y[1]

gini = 0.425samples = 0.1%

value = [0.307, 0.693]class = y[1]

gini = 0.474samples = 0.3%

value = [0.386, 0.614]class = y[1]

gini = 0.439samples = 0.1%

value = [0.675, 0.325]class = y[0]

saldo_var30 <= 0.001gini = 0.41

samples = 0.3%value = [0.288, 0.712]

class = y[1]

gini = 0.443samples = 0.1%

value = [0.668, 0.332]class = y[0]

gini = 0.482samples = 0.2%

value = [0.404, 0.596]class = y[1]

gini = 0.347samples = 0.2%

value = [0.224, 0.776]class = y[1]

saldo_medio_var5_ult3 <= 0.001gini = 0.377

samples = 3.3%value = [0.252, 0.748]

class = y[1]

saldo_var25 <= 0.06gini = 0.496

samples = 0.7%value = [0.544, 0.456]

class = y[0]

saldo_var42 <= 0.002gini = 0.322

samples = 1.6%value = [0.202, 0.798]

class = y[1]

imp_op_var41_comer_ult1 <= 0.001gini = 0.436

samples = 1.7%value = [0.321, 0.679]

class = y[1]

var15 <= 0.215gini = 0.288

samples = 1.3%value = [0.175, 0.825]

class = y[1]

saldo_var30 <= 0.002gini = 0.497

samples = 0.3%value = [0.539, 0.461]

class = y[0]

num_op_var41_ult3 <= 0.016gini = 0.404

samples = 0.3%value = [0.719, 0.281]

class = y[0]

num_var35 <= 0.125gini = 0.244

samples = 1.0%value = [0.142, 0.858]

class = y[1]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.51, 0.49]class = y[0]

gini = 0.329samples = 0.2%

value = [0.208, 0.792]class = y[1]

imp_op_var41_efect_ult1 <= 0.037gini = 0.229

samples = 0.8%value = [0.132, 0.868]

class = y[1]

saldo_var8 <= 0.02gini = 0.275

samples = 0.7%value = [0.165, 0.835]

class = y[1]

gini = 0.114samples = 0.1%

value = [0.061, 0.939]class = y[1]

num_op_var39_ult1 <= 0.022gini = 0.157

samples = 0.3%value = [0.086, 0.914]

class = y[1]

saldo_medio_var8_hace2 <= 0.001gini = 0.479

samples = 0.4%value = [0.398, 0.602]

class = y[1]

gini = 0.136samples = 0.2%

value = [0.073, 0.927]class = y[1]

gini = 0.205samples = 0.1%

value = [0.116, 0.884]class = y[1]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.365samples = 0.2%

value = [0.24, 0.76]class = y[1]

gini = 0.29samples = 0.2%

value = [0.824, 0.176]class = y[0]

gini = 0.465samples = 0.1%

value = [0.368, 0.632]class = y[1]

imp_op_var41_efect_ult1 <= 0.0gini = 0.391

samples = 0.7%value = [0.267, 0.733]

class = y[1]

num_op_var41_comer_ult3 <= 0.01gini = 0.47

samples = 1.0%value = [0.377, 0.623]

class = y[1]

num_op_var39_comer_ult3 <= 0.007gini = 0.346

samples = 0.3%value = [0.223, 0.777]

class = y[1]

num_var45_hace2 <= 0.031gini = 0.43

samples = 0.4%value = [0.312, 0.688]

class = y[1]

gini = 0.303samples = 0.2%

value = [0.186, 0.814]class = y[1]

gini = 0.435samples = 0.1%

value = [0.32, 0.68]class = y[1]

var15 <= 0.205gini = 0.483

samples = 0.3%value = [0.407, 0.593]

class = y[1]

gini = 0.313samples = 0.1%

value = [0.194, 0.806]class = y[1]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

gini = 0.403samples = 0.2%

value = [0.279, 0.721]class = y[1]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

imp_op_var41_efect_ult3 <= 0.002gini = 0.448

samples = 0.8%value = [0.338, 0.662]

class = y[1]

imp_op_var41_comer_ult3 <= 0.009gini = 0.451

samples = 0.3%value = [0.656, 0.344]

class = y[0]

saldo_var30 <= 0.001gini = 0.377

samples = 0.5%value = [0.252, 0.748]

class = y[1]

gini = 0.499samples = 0.1%

value = [0.523, 0.477]class = y[0]

gini = 0.367samples = 0.2%

value = [0.758, 0.242]class = y[0]

gini = 0.256samples = 0.2%

value = [0.151, 0.849]class = y[1]

num_var45_hace3 <= 0.013gini = 0.453

samples = 0.3%value = [0.347, 0.653]

class = y[1]

gini = 0.492samples = 0.2%

value = [0.564, 0.436]class = y[0]

gini = 0.362samples = 0.2%

value = [0.238, 0.762]class = y[1]

imp_op_var39_comer_ult1 <= 0.037gini = 0.494

samples = 0.4%value = [0.445, 0.555]

class = y[1]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.398samples = 0.2%

value = [0.726, 0.274]class = y[0]

num_op_var41_comer_ult3 <= 0.099gini = 0.449

samples = 0.3%value = [0.34, 0.66]

class = y[1]

gini = 0.415samples = 0.1%

value = [0.294, 0.706]class = y[1]

gini = 0.481samples = 0.1%

value = [0.403, 0.597]class = y[1]

saldo_var30 <= 0.025gini = 0.224

samples = 2.1%value = [0.872, 0.128]

class = y[0]

saldo_medio_var12_ult1 <= 0.012gini = 0.49

samples = 0.6%value = [0.571, 0.429]

class = y[0]

saldo_medio_var12_ult1 <= 0.021gini = 0.397

samples = 0.7%value = [0.727, 0.273]

class = y[0]

saldo_medio_var12_hace2 <= 0.099gini = 0.048

samples = 1.4%value = [0.975, 0.025]

class = y[0]

saldo_medio_var5_hace2 <= 0.0gini = 0.336

samples = 0.6%value = [0.787, 0.213]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.493, 0.507]class = y[1]

saldo_var30 <= 0.007gini = 0.442

samples = 0.3%value = [0.671, 0.329]

class = y[0]

saldo_medio_var5_hace3 <= 0.0gini = 0.162

samples = 0.4%value = [0.911, 0.089]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.497, 0.503]class = y[1]

gini = 0.271samples = 0.2%

value = [0.838, 0.162]class = y[0]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.365samples = 0.1%

value = [0.76, 0.24]class = y[0]

gini = -0.0samples = 1.2%value = [1.0, 0.0]

class = y[0]

gini = 0.244samples = 0.2%

value = [0.858, 0.142]class = y[0]

gini = 0.458samples = 0.2%

value = [0.355, 0.645]class = y[1]

num_var45_hace2 <= 0.013gini = 0.166

samples = 0.3%value = [0.908, 0.092]

class = y[0]

gini = 0.344samples = 0.1%

value = [0.78, 0.22]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

saldo_medio_var5_ult1 <= 0.002gini = 0.199

samples = 2.2%value = [0.888, 0.112]

class = y[0]

gini = 0.0samples = 1.3%value = [1.0, 0.0]

class = y[0]

num_var45_hace2 <= 0.004gini = 0.254

samples = 1.6%value = [0.85, 0.15]

class = y[0]

gini = 0.0samples = 0.6%value = [1.0, 0.0]

class = y[0]

gini = 0.455samples = 0.2%

value = [0.649, 0.351]class = y[0]

saldo_var13_corto <= 0.497gini = 0.198

samples = 1.4%value = [0.889, 0.111]

class = y[0]

num_var45_hace2 <= 0.048gini = 0.147

samples = 1.2%value = [0.92, 0.08]

class = y[0]

gini = 0.393samples = 0.2%

value = [0.731, 0.269]class = y[0]

gini = 0.0samples = 0.7%value = [1.0, 0.0]

class = y[0]

num_var45_hace3 <= 0.04gini = 0.294

samples = 0.5%value = [0.821, 0.179]

class = y[0]

saldo_var13 <= 0.058gini = 0.185

samples = 0.3%value = [0.897, 0.103]

class = y[0]

gini = 0.406samples = 0.2%

value = [0.717, 0.283]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.326samples = 0.1%

value = [0.795, 0.205]class = y[0]

saldo_medio_var13_corto_hace2 <= 0.0gini = 0.407

samples = 29.9%value = [0.715, 0.285]

class = y[0]

saldo_var5 <= 0.007gini = 0.49

samples = 5.7%value = [0.572, 0.428]

class = y[0]

num_op_var41_ult3 <= 0.061gini = 0.413

samples = 28.6%value = [0.709, 0.291]

class = y[0]

saldo_medio_var13_corto_hace2 <= 0.465gini = 0.209

samples = 1.3%value = [0.882, 0.118]

class = y[0]

var15_most_common <= 0.5gini = 0.394

samples = 26.5%value = [0.731, 0.269]

class = y[0]

saldo_var5 <= 0.005gini = 0.5

samples = 2.0%value = [0.51, 0.49]

class = y[0]

var15 <= 0.225gini = 0.44

samples = 19.5%value = [0.673, 0.327]

class = y[0]

num_meses_var5_ult3 <= 2.5gini = 0.081

samples = 7.0%value = [0.958, 0.042]

class = y[0]

var36 <= 51.0gini = 0.173

samples = 5.2%value = [0.905, 0.095]

class = y[0]

num_var42 <= 0.25gini = 0.474

samples = 14.4%value = [0.615, 0.385]

class = y[0]

num_med_var45_ult3 <= 0.062gini = 0.091

samples = 4.0%value = [0.952, 0.048]

class = y[0]

num_var45_hace2 <= 0.013gini = 0.362

samples = 1.1%value = [0.762, 0.238]

class = y[0]

num_var45_ult1 <= 0.003gini = 0.064

samples = 3.9%value = [0.967, 0.033]

class = y[0]

gini = 0.46samples = 0.1%

value = [0.642, 0.358]class = y[0]

num_var22_ult3 <= 0.019gini = 0.089

samples = 2.7%value = [0.953, 0.047]

class = y[0]

gini = 0.0samples = 1.2%value = [1.0, 0.0]

class = y[0]

saldo_medio_var5_ult3 <= 0.001gini = 0.052

samples = 2.4%value = [0.974, 0.026]

class = y[0]

gini = 0.319samples = 0.3%value = [0.8, 0.2]

class = y[0]

saldo_medio_var5_hace2 <= 0.0gini = 0.124

samples = 0.9%value = [0.934, 0.066]

class = y[0]

gini = 0.0samples = 1.5%value = [1.0, 0.0]

class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

n1 <= 6.5gini = 0.181

samples = 0.6%value = [0.899, 0.101]

class = y[0]

saldo_medio_var5_ult3 <= 0.001gini = 0.171

samples = 0.3%value = [0.906, 0.094]

class = y[0]

gini = 0.192samples = 0.3%

value = [0.893, 0.107]class = y[0]

gini = 0.298samples = 0.2%

value = [0.818, 0.182]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

saldo_medio_var5_hace3 <= 0.0gini = 0.22

samples = 1.0%value = [0.874, 0.126]

class = y[0]

gini = 0.487samples = 0.2%

value = [0.421, 0.579]class = y[1]

gini = 0.458samples = 0.1%

value = [0.644, 0.356]class = y[0]

gini = 0.141samples = 0.8%

value = [0.923, 0.077]class = y[0]

num_meses_var5_ult3 <= 2.5gini = 0.479

samples = 13.3%value = [0.602, 0.398]

class = y[0]

num_var39_0 <= 0.136gini = 0.272

samples = 1.1%value = [0.838, 0.162]

class = y[0]

ind_var43_recib_ult1 <= 0.5gini = 0.5

samples = 3.9%value = [0.51, 0.49]

class = y[0]

saldo_var5 <= 0.007gini = 0.454

samples = 9.4%value = [0.652, 0.348]

class = y[0]

var15 <= 0.315gini = 0.499

samples = 2.6%value = [0.478, 0.522]

class = y[1]

ind_var43_emit_ult1 <= 0.5gini = 0.484

samples = 1.3%value = [0.588, 0.412]

class = y[0]

num_meses_var39_vig_ult3_1 <= 0.5gini = 0.476

samples = 0.9%value = [0.61, 0.39]

class = y[0]

num_op_var39_ult3 <= 0.003gini = 0.488

samples = 1.7%value = [0.423, 0.577]

class = y[1]

var15 <= 0.295gini = 0.495

samples = 0.7%value = [0.551, 0.449]

class = y[0]

gini = 0.308samples = 0.3%

value = [0.81, 0.19]class = y[0]

gini = 0.5samples = 0.5%

value = [0.492, 0.508]class = y[1]

gini = 0.274samples = 0.2%

value = [0.836, 0.164]class = y[0]

n1 <= 7.5gini = 0.495

samples = 1.4%value = [0.451, 0.549]

class = y[1]

gini = 0.434samples = 0.3%

value = [0.319, 0.681]class = y[1]

num_var45_ult1 <= 0.003gini = 0.477

samples = 0.8%value = [0.392, 0.608]

class = y[1]

var38 <= 0.004gini = 0.496

samples = 0.7%value = [0.547, 0.453]

class = y[0]

var36_2 <= 0.5gini = 0.447

samples = 0.4%value = [0.337, 0.663]

class = y[1]

saldo_medio_var5_hace2 <= 0.0gini = 0.5

samples = 0.3%value = [0.503, 0.497]

class = y[0]

gini = 0.498samples = 0.1%

value = [0.532, 0.468]class = y[0]

var15 <= 0.455gini = 0.416

samples = 0.3%value = [0.295, 0.705]

class = y[1]

gini = 0.335samples = 0.2%

value = [0.213, 0.787]class = y[1]

gini = 0.457samples = 0.1%

value = [0.647, 0.353]class = y[0]gini = 0.419

samples = 0.1%value = [0.299, 0.701]

class = y[1]

gini = 0.255samples = 0.2%

value = [0.85, 0.15]class = y[0]

num_var22_ult3 <= 0.019gini = 0.485

samples = 0.3%value = [0.413, 0.587]

class = y[1]

num_var22_hace2 <= 0.012gini = 0.361

samples = 0.3%value = [0.763, 0.237]

class = y[0]

gini = 0.496samples = 0.1%

value = [0.454, 0.546]class = y[1]

gini = 0.471samples = 0.2%

value = [0.379, 0.621]class = y[1]

gini = 0.412samples = 0.2%

value = [0.71, 0.29]class = y[0]

gini = 0.286samples = 0.2%

value = [0.827, 0.173]class = y[0]

imp_op_var41_efect_ult3 <= 0.0gini = 0.456

samples = 1.1%value = [0.649, 0.351]

class = y[0]

num_var43_recib_ult1 <= 0.017gini = 0.486

samples = 0.2%value = [0.417, 0.583]

class = y[1]

num_var45_ult1 <= 0.026gini = 0.43

samples = 0.9%value = [0.687, 0.313]

class = y[0]

gini = 0.5samples = 0.2%

value = [0.505, 0.495]class = y[0]

gini = 0.354samples = 0.7%

value = [0.77, 0.23]class = y[0]

gini = 0.5samples = 0.2%

value = [0.509, 0.491]class = y[0]

gini = 0.5samples = 0.1%

value = [0.486, 0.514]class = y[1]

gini = 0.461samples = 0.1%

value = [0.36, 0.64]class = y[1]

imp_op_var41_comer_ult3 <= 0.0gini = 0.476

samples = 6.7%value = [0.609, 0.391]

class = y[0]

var36 <= 1.5gini = 0.331

samples = 2.7%value = [0.79, 0.21]

class = y[0]

var15 <= 0.265gini = 0.49

samples = 5.0%value = [0.571, 0.429]

class = y[0]

saldo_var42 <= 0.002gini = 0.355

samples = 1.7%value = [0.769, 0.231]

class = y[0]

saldo_medio_var5_hace3 <= 0.0gini = 0.417

samples = 1.1%value = [0.704, 0.296]

class = y[0]

imp_var43_emit_ult1 <= 0.0gini = 0.496

samples = 3.9%value = [0.542, 0.458]

class = y[0]

gini = 0.477samples = 0.7%

value = [0.607, 0.393]class = y[0]

gini = 0.138samples = 0.4%

value = [0.925, 0.075]class = y[0]

saldo_medio_var5_hace3 <= 0.0gini = 0.498

samples = 3.7%value = [0.528, 0.472]

class = y[0]

num_var45_hace3 <= 0.022gini = 0.212

samples = 0.3%value = [0.879, 0.121]

class = y[0]

num_meses_var39_vig_ult3_2 <= 0.5gini = 0.48

samples = 0.7%value = [0.401, 0.599]

class = y[1]

num_var43_recib_ult1 <= 0.006gini = 0.49

samples = 3.0%value = [0.569, 0.431]

class = y[0]

gini = 0.45samples = 0.3%

value = [0.343, 0.657]class = y[1]

num_var45_ult1 <= 0.003gini = 0.495

samples = 0.4%value = [0.452, 0.548]

class = y[1]

gini = 0.476samples = 0.2%

value = [0.391, 0.609]class = y[1]

gini = 0.492samples = 0.2%

value = [0.563, 0.437]class = y[0]

var38 <= 0.007gini = 0.484

samples = 2.6%value = [0.59, 0.41]

class = y[0]

saldo_medio_var5_ult3 <= 0.001gini = 0.495

samples = 0.4%value = [0.448, 0.552]

class = y[1]

var15 <= 0.345gini = 0.471

samples = 2.3%value = [0.621, 0.379]

class = y[0]

gini = 0.49samples = 0.3%

value = [0.429, 0.571]class = y[1]

num_med_var22_ult3 <= 0.019gini = 0.403

samples = 0.9%value = [0.72, 0.28]

class = y[0]

num_med_var45_ult3 <= 0.039gini = 0.49

samples = 1.4%value = [0.572, 0.428]

class = y[0]

saldo_medio_var5_hace2 <= 0.001gini = 0.379

samples = 0.7%value = [0.746, 0.254]

class = y[0]

gini = 0.463samples = 0.2%

value = [0.635, 0.365]class = y[0]

gini = 0.181samples = 0.6%

value = [0.899, 0.101]class = y[0]

gini = 0.454samples = 0.1%

value = [0.348, 0.652]class = y[1]

var36_2 <= 0.5gini = 0.482

samples = 1.3%value = [0.596, 0.404]

class = y[0]

gini = 0.469samples = 0.1%

value = [0.375, 0.625]class = y[1]

gini = 0.489samples = 1.1%

value = [0.574, 0.426]class = y[0]

gini = 0.293samples = 0.2%

value = [0.822, 0.178]class = y[0]

gini = 0.494samples = 0.2%

value = [0.557, 0.443]class = y[0]

gini = 0.467samples = 0.2%

value = [0.372, 0.628]class = y[1]

gini = 0.355samples = 0.1%

value = [0.769, 0.231]class = y[0]

gini = -0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

num_op_var41_comer_ult1 <= 0.024gini = 0.453

samples = 0.8%value = [0.653, 0.347]

class = y[0]

num_var22_ult3 <= 0.019gini = 0.186

samples = 0.9%value = [0.896, 0.104]

class = y[0]

imp_op_var41_comer_ult1 <= 0.007gini = 0.43

samples = 0.6%value = [0.688, 0.312]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.507, 0.493]class = y[0]

num_op_var39_ult3 <= 0.01gini = 0.481

samples = 0.4%value = [0.599, 0.401]

class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

saldo_medio_var5_hace2 <= 0.0gini = 0.492

samples = 0.2%value = [0.439, 0.561]

class = y[1]

gini = 0.469samples = 0.1%

value = [0.625, 0.375]class = y[0]

gini = 0.44samples = 0.1%

value = [0.327, 0.673]class = y[1]

var15 <= 0.285gini = 0.299

samples = 0.5%value = [0.817, 0.183]

class = y[0]

gini = -0.0samples = 0.4%value = [1.0, 0.0]

class = y[0]

gini = 0.452samples = 0.1%

value = [0.654, 0.346]class = y[0]

imp_op_var41_comer_ult1 <= 0.002gini = 0.17

samples = 0.4%value = [0.906, 0.094]

class = y[0]

gini = 0.399samples = 0.1%

value = [0.725, 0.275]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

saldo_var42 <= 0.004gini = 0.384

samples = 1.6%value = [0.741, 0.259]

class = y[0]

num_op_var39_ult1 <= 0.01gini = 0.209

samples = 1.0%value = [0.881, 0.119]

class = y[0]

var15 <= 0.515gini = 0.429

samples = 1.2%value = [0.688, 0.312]

class = y[0]

num_med_var45_ult3 <= 0.017gini = 0.136

samples = 0.4%value = [0.927, 0.073]

class = y[0]

saldo_var42 <= 0.003gini = 0.367

samples = 1.0%value = [0.758, 0.242]

class = y[0]

gini = 0.494samples = 0.2%

value = [0.447, 0.553]class = y[1]

num_var35 <= 0.125gini = 0.464

samples = 0.5%value = [0.635, 0.365]

class = y[0]

num_op_var41_hace2 <= 0.006gini = 0.122

samples = 0.5%value = [0.935, 0.065]

class = y[0]

saldo_var5 <= 0.008gini = 0.496

samples = 0.2%value = [0.546, 0.454]

class = y[0]

imp_trans_var37_ult1 <= 0.0gini = 0.399

samples = 0.3%value = [0.725, 0.275]

class = y[0]

gini = 0.385samples = 0.1%

value = [0.74, 0.26]class = y[0]

gini = 0.493samples = 0.1%

value = [0.442, 0.558]class = y[1]

gini = 0.499samples = 0.1%

value = [0.475, 0.525]class = y[1]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

var15 <= 0.395gini = 0.148

samples = 0.4%value = [0.919, 0.081]

class = y[0]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

gini = -0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

gini = 0.36samples = 0.1%

value = [0.765, 0.235]class = y[0]

gini = 0.365samples = 0.1%

value = [0.76, 0.24]class = y[0]

gini = -0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

var15 <= 0.395gini = 0.07

samples = 0.9%value = [0.963, 0.037]

class = y[0]

gini = 0.499samples = 0.1%

value = [0.526, 0.474]class = y[0]

n0 <= 165.5gini = 0.139

samples = 0.4%value = [0.925, 0.075]

class = y[0]

gini = -0.0samples = 0.5%value = [1.0, 0.0]

class = y[0]

gini = 0.279samples = 0.2%

value = [0.833, 0.167]class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

n0 <= 144.5gini = 0.222

samples = 0.9%value = [0.873, 0.127]

class = y[0]

gini = 0.458samples = 0.1%

value = [0.644, 0.356]class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

saldo_var30 <= 0.029gini = 0.308

samples = 0.6%value = [0.81, 0.19]

class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

gini = 0.418samples = 0.3%

value = [0.702, 0.298]class = y[0]

saldo_medio_var5_ult1 <= 0.002gini = 0.174

samples = 1.0%value = [0.904, 0.096]

class = y[0]

saldo_medio_var5_hace2 <= 0.0gini = 0.064

samples = 6.0%value = [0.967, 0.033]

class = y[0]

gini = 0.465samples = 0.2%

value = [0.632, 0.368]class = y[0]

gini = -0.0samples = 0.8%value = [1.0, 0.0]

class = y[0]

num_var4 <= 0.214gini = 0.226

samples = 1.2%value = [0.87, 0.13]

class = y[0]

num_op_var41_ult3 <= 0.022gini = 0.014

samples = 4.8%value = [0.993, 0.007]

class = y[0]

saldo_var42 <= 0.002gini = 0.286

samples = 0.9%value = [0.827, 0.173]

class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

saldo_medio_var5_ult1 <= 0.002gini = 0.219

samples = 0.8%value = [0.875, 0.125]

class = y[0]

gini = 0.482samples = 0.1%

value = [0.594, 0.406]class = y[0]

var38 <= 0.003gini = 0.29

samples = 0.5%value = [0.824, 0.176]

class = y[0]

gini = -0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.49, 0.51]class = y[1]

gini = -0.0samples = 0.4%value = [1.0, 0.0]

class = y[0]

gini = 0.0samples = 4.5%value = [1.0, 0.0]

class = y[0]

imp_op_var39_comer_ult3 <= 0.015gini = 0.187

samples = 0.3%value = [0.896, 0.104]

class = y[0]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.365samples = 0.1%

value = [0.76, 0.24]class = y[0]

gini = 0.294samples = 0.1%

value = [0.179, 0.821]class = y[1]

n0 <= 141.5gini = 0.492

samples = 1.9%value = [0.563, 0.437]

class = y[0]

num_var22_ult3 <= 0.045gini = 0.498

samples = 1.0%value = [0.466, 0.534]

class = y[1]

num_var43_recib_ult1 <= 0.017gini = 0.408

samples = 0.9%value = [0.715, 0.285]

class = y[0]

imp_var43_emit_ult1 <= 0.0gini = 0.43

samples = 0.7%value = [0.686, 0.314]

class = y[0]

saldo_var5 <= 0.006gini = 0.388

samples = 0.3%value = [0.263, 0.737]

class = y[1]

num_var45_ult1 <= 0.021gini = 0.46

samples = 0.5%value = [0.642, 0.358]

class = y[0]

gini = -0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

num_op_var41_efect_ult3 <= 0.087gini = 0.499

samples = 0.3%value = [0.526, 0.474]

class = y[0]

imp_op_var41_comer_ult1 <= 0.038gini = 0.312

samples = 0.3%value = [0.807, 0.193]

class = y[0]

gini = 0.333samples = 0.1%

value = [0.789, 0.211]class = y[0]

gini = 0.481samples = 0.1%

value = [0.402, 0.598]class = y[1]

gini = 0.472samples = 0.1%

value = [0.619, 0.381]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.5samples = 0.2%

value = [0.486, 0.514]class = y[1]

gini = 0.272samples = 0.2%

value = [0.163, 0.837]class = y[1]saldo_var30 <= 0.002

gini = 0.437samples = 0.7%

value = [0.678, 0.322]class = y[0]

gini = 0.231samples = 0.2%

value = [0.867, 0.133]class = y[0]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

num_op_var41_comer_ult3 <= 0.065gini = 0.483

samples = 0.5%value = [0.592, 0.408]

class = y[0]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

num_op_var39_comer_ult3 <= 0.078gini = 0.5

samples = 0.3%value = [0.512, 0.488]

class = y[0]

gini = 0.486samples = 0.2%

value = [0.417, 0.583]class = y[1]

gini = 0.416samples = 0.1%

value = [0.705, 0.295]class = y[0]

saldo_var30 <= 0.06gini = 0.107

samples = 1.2%value = [0.943, 0.057]

class = y[0]

gini = 0.488samples = 0.1%

value = [0.578, 0.422]class = y[0]

num_var43_recib_ult1 <= 0.006gini = 0.063

samples = 1.0%value = [0.967, 0.033]

class = y[0]

gini = 0.339samples = 0.1%

value = [0.784, 0.216]class = y[0]

gini = 0.0samples = 0.8%value = [1.0, 0.0]

class = y[0]

gini = 0.239samples = 0.2%

value = [0.861, 0.139]class = y[0]

saldo_medio_var5_hace2 <= 0.0gini = 0.5

samples = 3.4%value = [0.503, 0.497]

class = y[0]

saldo_var42 <= 0.006gini = 0.412

samples = 2.3%value = [0.709, 0.291]

class = y[0]

saldo_medio_var5_hace3 <= 0.0gini = 0.442

samples = 0.6%value = [0.329, 0.671]

class = y[1]

num_var43_recib_ult1 <= 0.051gini = 0.492

samples = 2.8%value = [0.564, 0.436]

class = y[0]

num_var45_ult1 <= 0.009gini = 0.413

samples = 0.4%value = [0.291, 0.709]

class = y[1]

gini = 0.499samples = 0.2%

value = [0.48, 0.52]class = y[1]

gini = 0.295samples = 0.1%

value = [0.18, 0.82]class = y[1]

num_var45_ult1 <= 0.032gini = 0.478

samples = 0.3%value = [0.394, 0.606]

class = y[1]

gini = 0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

gini = 0.399samples = 0.2%

value = [0.276, 0.724]class = y[1]

var15 <= 0.365gini = 0.482

samples = 2.6%value = [0.595, 0.405]

class = y[0]

gini = 0.404samples = 0.2%

value = [0.281, 0.719]class = y[1]

num_var22_hace3 <= 0.042gini = 0.38

samples = 1.8%value = [0.745, 0.255]

class = y[0]

imp_op_var39_comer_ult1 <= 0.042gini = 0.487

samples = 0.9%value = [0.419, 0.581]

class = y[1]

num_op_var39_ult3 <= 0.119gini = 0.42

samples = 1.4%value = [0.701, 0.299]

class = y[0]

n1 <= 14.5gini = 0.139

samples = 0.4%value = [0.925, 0.075]

class = y[0]

var38 <= 0.004gini = 0.33

samples = 1.2%value = [0.791, 0.209]

class = y[0]

gini = 0.437samples = 0.1%

value = [0.323, 0.677]class = y[1]

num_var45_ult1 <= 0.003gini = 0.115

samples = 0.5%value = [0.939, 0.061]

class = y[0]

imp_op_var41_comer_ult3 <= 0.029gini = 0.419

samples = 0.7%value = [0.702, 0.298]

class = y[0]

gini = 0.35samples = 0.1%

value = [0.773, 0.227]class = y[0]

gini = -0.0samples = 0.4%value = [1.0, 0.0]

class = y[0]

imp_op_var41_comer_ult1 <= 0.0gini = 0.335

samples = 0.5%value = [0.787, 0.213]

class = y[0]

gini = 0.5samples = 0.1%

value = [0.502, 0.498]class = y[0]

gini = 0.426samples = 0.3%

value = [0.693, 0.307]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]saldo_medio_var5_ult3 <= 0.001gini = 0.181

samples = 0.3%value = [0.899, 0.101]

class = y[0]

gini = -0.0samples = 0.1%value = [1.0, 0.0]

class = y[0]

gini = 0.37samples = 0.1%

value = [0.755, 0.245]class = y[0]

gini = -0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

num_var22_hace2 <= 0.061gini = 0.498

samples = 0.7%value = [0.469, 0.531]

class = y[1]

gini = 0.384samples = 0.1%

value = [0.259, 0.741]class = y[1]

var36_1 <= 0.5gini = 0.474

samples = 0.6%value = [0.613, 0.387]

class = y[0]

gini = 0.312samples = 0.1%

value = [0.194, 0.806]class = y[1]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

gini = 0.499samples = 0.4%

value = [0.518, 0.482]class = y[0]

num_var37_0 <= 0.066gini = 0.438

samples = 2.0%value = [0.675, 0.325]

class = y[0]

gini = 0.0samples = 0.3%value = [1.0, 0.0]

class = y[0]

num_var22_hace2 <= 0.085gini = 0.291

samples = 1.4%value = [0.823, 0.177]

class = y[0]

saldo_var42 <= 0.002gini = 0.494

samples = 0.5%value = [0.444, 0.556]

class = y[1]

imp_op_var41_ult1 <= 0.038gini = 0.234

samples = 1.3%value = [0.865, 0.135]

class = y[0]

gini = 0.498samples = 0.1%

value = [0.535, 0.465]class = y[0]

num_op_var41_hace2 <= 0.03gini = 0.056

samples = 1.1%value = [0.971, 0.029]

class = y[0]

gini = 0.498samples = 0.2%

value = [0.471, 0.529]class = y[1]

gini = 0.0samples = 0.9%value = [1.0, 0.0]

class = y[0]

saldo_var30 <= 0.002gini = 0.196

samples = 0.3%value = [0.89, 0.11]

class = y[0]

gini = 0.37samples = 0.1%

value = [0.755, 0.245]class = y[0]

gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]gini = 0.0samples = 0.2%value = [1.0, 0.0]

class = y[0]

saldo_var30 <= 0.002gini = 0.448

samples = 0.4%value = [0.339, 0.661]

class = y[1]

gini = 0.329samples = 0.1%

value = [0.207, 0.793]class = y[1]

num_op_var39_comer_ult1 <= 0.024gini = 0.499

samples = 0.2%value = [0.475, 0.525]

class = y[1]

gini = 0.479samples = 0.1%

value = [0.604, 0.396]class = y[0]

gini = 0.479samples = 0.1%

value = [0.397, 0.603]class = y[1]

Figure 13: Tree 1

18

Page 21: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

6.2 Tuning XGBoost

The numerous parameters in XGBoost make it intractable to simply apply a grid search and in-stead we utilize RandomizedSearchCV(). Instead of trying all possible combinations, this functionsamples a predetermined amount of sets of parameters in the parameter spaces specified. Insteadof set values in the parameter space we typically have distributions. We run this once for 75 timesand using the results of the first try, narrow the parameter spaces and run it again for 50 times.We first apply 5-fold stratified crossvalidation to compare and the second time we make it moreprecise by using 8-fold stratified crossvalidation. The results are shown in Table 10. We havechosen these starting ranges as broad ranges around recommended starting parameters in relatedwork and here. (Jain, 2016)

Table 10: RandomizedSearchCV()Parameter Range 1 Best Value 1 Range 2 Best Value 2

N_ESTIMATORS [100, 2000] 245 [100, 1000] 917MAX_DEPTH [3, 9] 6 [4, 6] 5

LEARNING_RATE [0.01, 0.21] 0.04864 [0.001, 0.076] 0.01632SUBSAMPLE [0.6, 0.9] 0.85635 [0.75, 0.85] 0.75183

COLSAMPLE_BYTREE [0.6, 0.9] 0.79057 [0.75, 0.85] 0.82017MIN_CHILD_WEIGHT [1,5] 1 1 1

REG_ALPHA [0,0.1] 0.02276 [0,0.05] 0.04970GAMMA [0,0.2] 0.10552 [0,0.15] 0.038668

With the aforementioned best parameter set we fit the model on the training data and make a plotof how relevant a specific feature is in Figure 14. Like the Random Forest model we can see somemade features are relevant. There is also a significant overlap between what is important, which isencouraging.

var1

5va

r38

sald

o_va

r30

sald

o_m

edio

_var

5_ha

ce3

sald

o_m

edio

_var

5_ul

t3sa

ldo_

med

io_v

ar5_

hace

2 n0 n1nu

m_v

ar45

_hac

e3nu

m_v

ar22

_ult3

num

_var

22_u

lt1sa

ldo_

med

io_v

ar5_

ult1

imp_

op_v

ar41

_efe

ct_u

lt3im

p_op

_var

41_u

lt1nu

m_v

ar45

_hac

e2va

r38_

mos

t_co

mm

onim

p_op

_var

41_e

fect

_ult1

sald

o_va

r42

num

_med

_var

45_u

lt3nu

m_v

ar22

_hac

e3va

r15_

mos

t_co

mm

onsa

ldo_

var5

var3

_mos

t_co

mm

onsa

ldo_

var3

7im

p_op

_var

41_c

omer

_ult3

num

_var

45_u

lt1im

p_en

t_va

r16_

ult1

imp_

op_v

ar39

_com

er_u

lt1va

r3im

p_tra

ns_v

ar37

_ult1

Features

0.00

0.02

0.04

0.06

0.08

0.10

0.12

F sc

ore

Feature Importances

Figure 14: Feature Importance

19

Page 22: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

We also show one of the many trees the XGB model creates in Figure 15, namely a randomlyselected tree, which happens to be the 300th tree.

n1<5

var15_most_common<1

yes, missing

ind_var8_0<1

no

var38<0.00358573

yes, missing

leaf=-0.0128957

no

var15_most_common<1

yes, missing

saldo_var30<0.00154067

no

num_var45_hace3<0.00442478

yes, missing

var38_most_common<1

no

var38<0.00119249

yes, missing

num_med_var45_ult3<0.00561798

no

leaf=-0.00943261

yes, missing

num_var45_hace3<0.00442478

no

leaf=0.00173004

yes, missing

leaf=-0.00409691

no

leaf=0.00208477

yes, missing

leaf=-0.0017853

no

leaf=-0.000244793

yes, missing

leaf=-0.00453708

no

saldo_var30<0.00144865

yes, missing

leaf=-0.0153812

no

var38<0.0050588

yes, missing

leaf=-0.009992

no

num_med_var22_ult3<0.0192308

yes, missing

num_op_var39_comer_ult1<0.113014

no

leaf=-0.0117572

yes, missing

leaf=-0.00491872

no

leaf=-0.0138288

yes, missing

leaf=-0.0036878

no

saldo_medio_var5_hace3<5.95197e-06

yes, missing

leaf=-0.00282646

no

leaf=0.00668997

yes, missing

leaf=-0.000452203

no

Figure 15: Tree 300

6.3 Results

There are a number of different performance indicators. Our local stratified 10-fold crossvalida-tion procedure was used to tune all parameters and select features. Note that a single fold takesup to 10 minutes at most and several operations can be ran in parallel. Kaggle also has a privateand public leaderboard score. Approximately half of the submitted test set is used for the privateleaderboard score and the other for the public leaderboard score. Since the competition had al-ready completed the difference in these scores is superfluous for us. Nevertheless the completeresults are showcased in Table 11.

Feature set 1 refers to only the top 30 features that have the highest importance scores according tothe XGBoost model that uses all features. Feature set 2 refers to feature set 1 except some variableshave been manually removed that seeemed obviously correlated like 'saldo_medio_var5_hace3'and 'saldo_medio_var5_hace2'. The one that was deemed most important by the modelwas kept. Feature set 3 refers to feature set 2 except adding some of the variables that were highlycorrelated with the target as seen in section 4.2. For a complete list of all involved features seeAppendix A. Note that a single fold takes up to 10 minutes at most and several operations can beran in parallel.

20

Page 23: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

Table 11: ResultsModel Features Local Public Private

Logistic Regression All 0.80561 (+/- 0.02337) 0.804893 0.786949Random Forest All 0.82739 (+/- 0.01672) 0.822562 0.803286

XGBoost All 0.84061 (+/- 0.01672) 0.837524 0.823152XGBoost Set 1 0.84035 (+/- 0.01763) 0.836194 0.822334XGBoost Set 2 0.83931 (+/- 0.01954) 0.836109 0.821644XGBoost Set 3 0.83898 (+/- 0.02624) 0.836397 0.821777

It is very peculiar that the private leaderboard score is consistently lower. This indicates a leader-board shakeup, where the train set is not representative enough of the test set. For reference thetop public leaderboard score is 0.845325 and the top private leaderboard score is 0.829072. Ourscores are in far reach of that, however we did not endeavor to be at the top of the leaderboard andkept the test set like a secret until the end. All data exploration was done on solely the trainingdata and normalization for instance was done using just the observations in the training data. Weconsider this more realistic as you can more truly validate your conclusions with completely newdata, as opposed to the information spillover that happens if we had not done so. In a businesssense a singular new client tends to appear, instead of an entire population that can for instance beappropriately normalized.

7 Conclusion

This paper researched how to preemptively understand if customers of Santander will be dissatis-fied using Machine Learning. A semi-anonymized dataset, to protect the privacy of the customers,provided difficulties in asserting what could be relevant or not, especially in light of a huge featureset. However a thorough data analysis discerned the meaning and interpretation of several fea-tures. A Python implementation utilized the Logistic Regression, Random Forest and XGBoostalgorithms, carefully tuned, in order to lead to predictions. Further research could for exampleemploy different solution methods, apply more feature engineering or combine several modelsinstead of trying singular models. More specifically they can increase the computation time thatgoes into tuning and for example make the correlation filtering dependent on the correlation withthe target.

21

Page 24: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

A Appendix Feature Sets

Set 1: var15, var38, saldo_var30, saldo_medio_var5_hace3, saldo_medio_var5_ult3,saldo_medio_var5_hace2, n0, n1, num_var45_hace3, num_var22_ult3, num_var22_ult1,saldo_medio_var5_ult1, imp_op_var41_efect_ult3, imp_op_var41_ult1, num_var45_hace2,var38_most_common, imp_op_var41_efect_ult1, saldo_var42, num_med_var45_ult3,num_var22_hace3, var15_most_common, saldo_var5, var3_most_common, saldo_var37,imp_op_var41_comer_ult3, num_var45_ult1, imp_ent_var16_ult1, imp_op_var39_comer_ult1,var3 and imp_trans_var37_ult1

Set 2: var15, var38, saldo_var30, saldo_medio_var5_hace3, n0, n1, num_var45_hace3,num_var22_ult3, imp_op_var41_efect_ult3, var38_most_common, saldo_var42,num_med_var45_ult3, var15_most_common, saldo_var5, var3_most_common,saldo_var37, imp_ent_var16_ult1, imp_op_var39_comer_ult1, var3, imp_trans_var37_ult1,ind_var8_0, num_meses_var5_ult3, num_meses_var39_vig_ult3_1, num_var4and var36

Set 3: var15, var38, saldo_var30, saldo_medio_var5_hace3, n0, n1, num_var45_hace3,num_var22_ult3, imp_op_var41_efect_ult3, var38_most_common, saldo_var42,num_med_var45_ult3, var15_most_common, saldo_var5, var3_most_common,saldo_var37, imp_ent_var16_ult1, imp_op_var39_comer_ult1, var3, imp_trans_var37_ult1,ind_var8_0, num_meses_var5_ult3, num_meses_var39_vig_ult3_1, num_var4,var36, ind_var30, num_var42, and ind_var5

22

Page 25: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

References

Andreu (2015). Predicting banking customer satisfaction. https://www.kaggle.com/c/santander-customer-satisfaction/discussion/19291#110414. Accessed:10-15-2017.

Bishop, C. (2006). Logistic regression. In M. Jordan, J. Kleinberg, and B. Scholkopf (Eds.),Pattern Recognition and Machine Learning, pp. 205–207. Springer-Verlag New York.

Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32.

Breiman, L., J. Friedman, C. J. Stone, and R. Olshen (1984). Classification and Regression Trees.Wadsworth International Group.

Buckinx, W. and D. van den Poel (2005). Customer base analysis: partial defection of be-haviourally loyal clients in a non-contractual fmcg retail setting. European Journal of Oper-ational Research 164(1), 252–268.

Buitinck, L., G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Pretten-hofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux(2013). API design for machine learning software: experiences from the scikit-learn project. InECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.

Burez, J. and D. van den Poel (2009). Handling class imbalance in customer churn prediction.Expert Systems with Applications 36(3), 4626–4636.

Chen, T. (2014). Introduction to boosted trees. https://homes.cs.washington.edu/

~tqchen/pdf/BoostedTree.pdf. Accessed: 10-25-2017.

Chen, T. and C. Guestrin (2016). Xgboost: A scalable tree boosting system. arxiv: Cs.lg1603.02754v3.

Clemes, M. D., C. Gan, and D. Zhang (2010). Customer switching behaviour in the chinese retailbanking industry. International Journal of Bank Marketing 28(7), 519–546.

Colgate, M., K. Stewart, and R. Kinsella (1996). Customer defection: a study of the student marketin ireland. International Journal of Bank Marketing 14(3), 23–29.

Dernoncourt, F. (2015). What does auc stand for and what isit? https://stats.stackexchange.com/questions/132777/

what-does-auc-stand-for-and-what-is-it. Accessed: 10-15-2017.

dmi3kno (2015). Exploring features. https://www.kaggle.com/cast42/

exploring-features. Accessed: 10-15-2017.

Jain, A. (2016). Complete guide to parameter tuning in xg-boost. https://www.analyticsvidhya.com/blog/2016/03/

complete-guide-parameter-tuning-xgboost-with-codes-python/.Accessed: 10-25-2017.

Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and modelselection. In Proceedings of IJCAI 1995.

23

Page 26: Derek van den Elsen - beta.vu.nl · 1996) Datamining is essential in this process and this practice is widely applied across industries ... Greedy Forest, Adaboost, XGBoost, Neural

Kumar, A. (2015). Boosting customer satisfaction with gradient boosting. https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a079.pdf. Accessed: 10-16-2017.

Mozer, M., R. Wolniewicz, D. Grimes, E. Johnson, and H. Kaushansky (2000). Predicting sub-scriber dissatisfaction and improving retention in the wireless telecommunications industry.IEEE Transactions on Neural Networks 11(3), 690 – 696.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-rot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of MachineLearning Research 12, 2825–2830.

Santander (2015). Santander customer satisfaction. https://www.kaggle.com/c/

santander-customer-satisfaction. Accessed: 10-15-2017.

Silva, L., G. Titericz, D. Efimov, I. Tanaka, D. Barusauskas, M. Michailidis, M. Muller, D. Polat,S. Semenov, and D. Altukhov (2016). Solution for santander customer satisfaction competition,3rd place. https://github.com/diefimov/santander_2016/blob/master/

README.pdf. Accessed: 10-16-2017.

Wang, S. (2016). Predicting banking customer satisfaction. https://

shuaiw.github.io/assets/data-science-project-workflow/

santander-customer-satisfaction.pdf. Accessed: 10-16-2017.

Xie, Y., X. Li, E. Ngai, and W. Ying (2009). Customer churn prediction using improved balancedrandom forests. Expert Systems with Applications 36(3), 5445–5449.

Yooyen, T. and K.-C. Ma (2016). Csci 567 spring 2016 mini-project. https://markcsie.github.io/documents/CSCI567Spring2016Project.pdf. Accessed: 10-16-2017.

24