open10

23
Super Computer Data Mining Project Entry for the 2006 PAKDD Data Mining Competition: Heterogeneous Classifier Ensemble Approach A. J. Bagnall 1 G. Cawley 1 L. Bull 2 1 School of Computing Sciences University of East Anglia, Norwich, UK 2 School of Computer Science, University of West of England, Bristol, UK 1. Introduction This document describes the entry of the Super Computer Data Mining (SCDM) Project to the PAKDD 2006 data mining competition. The SCDM project is sponsored by the Engineering and Physical Sciences Research Council of the UK government (funded under grants GR/T18479/01, GR/T18455/01 and GR/T/18462/01) and began in January 2005. The objective of the project is to develop data mining tools, implemented in C++ and MPI, for parallel execution on super computer architecture. The main super computer facility we use is based at the University of Manchester and is called CSAR It forms part of the UK high performance computing (HPC) service. However, the SCDM code will run on any cluster and will be freely available to the academic community. The SCDM toolkit has already contributed to several research projects with six papers published [1,2,3,4,5,6] and more in preparation. More details can be found at the project website (still under development) [7]. Our motivation for this project is to develop tools that will be able to perform complex analysis of very large data with many attributes. We have assembled several large attribute, many cases data sets to form a standard test suite for the assessment of data mining algorithms on this type of data [8]. The main algorithmic focus of the project is on ensemble techniques, with a particular emphasis on attribute

Upload: tommy96

Post on 21-May-2015

228 views

Category:

Documents


5 download

TRANSCRIPT

Super Computer Data Mining Project Entry for the 2006 PAKDD Data Mining Competition: Heterogeneous Classifier Ensemble Approach

A. J. Bagnall1 G. Cawley1 L. Bull2

1School of Computing SciencesUniversity of East Anglia, Norwich, UK

2School of Computer Science, University of Westof England, Bristol, UK

1. Introduction

This document describes the entry of the Super Computer Data Mining (SCDM) Project to the PAKDD 2006 data mining competition. The SCDM project is sponsored by the Engineering and Physical Sciences Research Council of the UK government (funded under grants GR/T18479/01, GR/T18455/01 and GR/T/18462/01) and began in January 2005. The objective of the project is to develop data mining tools, implemented in C++ and MPI, for parallel execution on super computer architecture. The main super computer facility we use is based at the University of Manchester and is called CSAR It forms part of the UK high performance computing (HPC) service. However, the SCDM code will run on any cluster and will be freely available to the academic community. The SCDM toolkit has already contributed to several research projects with six papers published [1,2,3,4,5,6] and more in preparation. More details can be found at the project website (still under development) [7]. Our motivation for this project is to develop tools that will be able to perform complex analysis of very large data with many attributes. We have assembled several large attribute, many cases data sets to form a standard test suite for the assessment of data mining algorithms on this type of data [8]. The main algorithmic focus of the project is on ensemble techniques, with a particular emphasis on attribute selection. This competition has been particularly useful for us for several reasons

1. The PAKDD competition data set will make a useful addition to the data collection.

2. It provides a test bed for our implementations of existing algorithms (for this work k-NN, C4.5, Naïve Bayes, Neural Network and Logistic Regression)

3. It allows us to assess new variations of classifiers (Learning Classifier Systems) and ensemble algorithms (FASBIR)

2. Approach and Understanding of the Problem

The 2006 PAKDD data mining competition involves building classifiers to predict whether customers will choose a 3G or 2G phone. Our approach to this type of problem is normally to adopt a data mining methodology such as Cross Industry Standard Process for Data Mining (CRISP) [9]. Our ability to gain a good business understanding through interaction with the customer is obviously limited due to the nature of the competition. Nevertheless, a structured approach is always beneficial.

Business Understanding

Our only source of business understanding other than the data itself comes from the competition website. “An Asian telco operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network.”There is also some indication of the preferences of the customer in terms of the classification loss function. “The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided.”This indicates that the objective may be to target marketing towards potential customers in a way where the cost per individual is small, hence the priority is to not miss those most likely to want 3G. It is of course trivial to maximize the true positives by simply predicting everyone as 3G. However, such a solution is obviously of no interest. A proper modeling of the situation would involve a cost function for those both interested and not interested in 3G. The choice of cost function will obviously affect the ranking of our final entry. Given the stated assessment criteria“Entries will be assessed in terms of the number of 3G customers that are correctly classified by their model. A ranking table of the entries will be made available on this website in April 2006.”we assume a low cost for false positives and high profit for true positives. Since the true costs are unknown we also present several different cost scenarios in the analysis section, based on the training data.

Data Understanding

There are 18000 cases in the training data and 6000 in the test data, each case is an individual phone user. Of the 250 attributes 37 are categorical and are 213 numeric. All apart from the basic demographics (sex, age, nationality and marital status) relate to phone use. Of the usage attributes, there are average and standard deviations for 90 features. Many of these features are obviously related and correlated (for example, Average number of off-peak minutes and Average number of off-peak calls), and domain knowledge could be usefully employed to derive features (such as heavy user or international traveller). However, it is dangerous and probably fruitless to do so without consultation with domain experts. Hence we only use automated attribute transformations.

Data Preparation

The data requires significant pre-processing:

Missing values: In addition to those indicated in the dictionary, there are several missing values with =#VALUE! Key, suggesting there may be some failed calculation in preprocessing. We have replaced these with missing value indicators.

21 attributes are all zeros. This should also be investigated. We have removed these attributes completely.

Missing attribute values: several attribute values appear only in the training data or the testing data. For all attributes we

Group all in train but not in test as “other”Change all in test but not in train as “missing”

Cluster discrete attributes with a large number of classes. Many of the categorical variables have attribute values with very few observed values. We have clustered these together as dictated by the data. A full description is given in Appendix A. We also provide class distributions for some of the attributes to provide an informal idea of the discriminatory power. Note the most important of these are subplan and handset. In the given format these attributes have too many possible values to be of much use, but after formatting they prove highly indicative.

Transformation of continuous attributes. Many of the continuous attributes have highly skewed distributions and there is a high degree of multi-collinearity. Given the lack of domain knowledge, we have taken three approaches to the continuous fields to reflect these characteristics.

1. Leave them as they are (after preprocessing described in Appendix A): File Mixed.csv (Version4_1.csv) contains all the formatted discrete attributes and the continuous fields as provided.

2. Discretise the continuous with the MDL method. File AllDiscrete.csv (Version4_2.csv) contains only discrete attributes

3. Transform into principle components, retaining only the components that explain 95% of the variation in the data. File MixedPCA.csv (Version4_4.csv) contains the formatted discrete attributes and the 67 principle components.

Furthermore, we have derived binary dummy attributes for the discrete fields and thus created two new files for algorithms that are best suited to problems with only continuous attributes. These are AllContinuous.csv (Version4_3.csv) and AllContinuousPCA.csv (Version4_5.csv)

3. Modelling phase: details of the classification model that was produced

Our approach is to produce a probability estimate using an ensemble of classifiers, some of which may themselves be ensembles. The core classifiers used are:

1. Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR) [10] is an ensemble k nearest neighbour algorithm that filters by information gain and randomizes with distance metrics and attribute subsets. We use the parameter values and implementation of FASBIR described in [6]. The final output of FASBIR is in fact the result of an ensemble of 100 alternative k-NN classifiers. FASBIR is run on AllContinuous.csv and AllContinuousPCA.csv2. C4.5 Decision tree. Our C4.5 is a standard implementation comparable to the WEKA version. Based on past experience, we set the minimum leaf node size of C4.5 to 50. This is a crude but effective way of avoiding over fitting. C4.5 is run on Mixed.csv and AllDiscrete.csv3. Naïve Bayes. Standard implementation assuming normality for real valued attributes. NB is run on AllDiscrete.csv and MixedPCA.csv4. Logistic regression. Parameters estimated with a standard maximum likelihood

estimation technique. AllContinuous.csv and AllContinuousPCA.csv5. Neural Network. The NN has a single hidden layer, initially containing 32 or 64 neurons. The output layer utilised a Softmax activation function with a cross-entropy error metric and a standard 1-of-c encoding system. A Bayesian regularisation scheme was used to avoid overfitting, adopting a Laplace prior. AllContinuous.csv and AllContinuousPCA.csv6. Learning Classifier System. LCS as described in [5], using an error weighted fitness function, a niched genetic algorithm and a Q-Learning style Widrow-Hoff update rule. AllDiscrete.csv

This gives us 11 classifiers in the meta-ensemble. Each of the 11 classifiers produces a probability estimate for the training cases. We use a weighted mean of these probability estimates as our final prediction. The weights are determined by using a 10 fold cross validation and estimating the accuracy of each classifier. Once a probability estimate is obtained for each training case, a prediction is made using the following profit matrix. Profit Matrix

Actual

Predicted 3G (true) 2G (false)

3G (positive) 45 (TP) -5 (FP)

2G (negative) 0 (TN) 0 (FN)

This is designed to simulate the idea of making marketing decisions based on the classifier: The cost of classifying as 3G is associated with the cost of a mail shot. If the individual is classified as 3G (positive) we assume a mail shot costing 5 units will be sent, and if not, no action will be taken (at 0 cost). We assume that if the mail shot

reaches the 3G customer they will respond (true positive) and result in a profit of 50, whereas if they do not respond (FP) there is a loss of 5. Thus to decide whether to classify as 3G we take the action that maximizes the expected return. Of course, this may not be the intended use for the analysis, but a simple decision theory framework at least provides some structure to justify the classifications we make.

Training Results

The training data results indicate the resulting classifier is effective, although it may be based on information that is not of great use in predicting whether individuals will switch. Predicting whether someone uses a 3G phone based on their phone usage is not the same as predicting whether someone will switch from 2G to 3G. However, for the former task, we believe accurate predictions can be made. The contingency matrix below gives an accuracy of 87% and a balanced error rate of 14%.

  Actual    Predicted 2G 3G Grand Total2G 13197 491 136883G 1800 2509 4309Grand Total 14997 3000 17997

We can get higher accuracy by changing the cost function. If we change our cost function to +5 for true positive we get the following outcome, essentially trading accuracy on 2G for 3G.

  Actual    

Predicted 2G 3GGrand

Total2G 14459 970 154293G 538 2030 2568

Grand Total 14997 3000 17997ACC 91.62% BER 17.96%

Our entry for the test data predicts nearly half the customers as 3G. This overweighting towards 3G is because of the competition assessment criteria, which is to judge entries on the number of true positives. A more balanced (and hence more accurate) entry could easily be made.

The ROC curve demonstrates how this decision boundary changes the outcome. Altering the reward moves the decision boundary, and hence the point on the ROC curve. Due to the lack of expertise, we have chosen in an ad hoc fashion based on our interpretation of the objectives.

ROC Curve

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

4. Evaluation phase: discussion on what insights can be gained from their model

Table 1 shows the ranking of the top 15 attributes by Information Gain, Information Gain Ratio and Chi-Squared Statistic. Note the importance of Handset model and age (after preprocessing). We would expect handset to be highly predictive of 3G, given that most phones are either 3G or 2G.

However, there is variation in 2G/3G with handsets, so presumably some phones can be used in both contexts. Also of high discriminatory power is handset age, which is again not surprising, as 3G handsets have not been in production as long.

Subplan is important (although incomprehensible to us) as is the amount spent and the amount of calls made. Perhaps interestingly, the GAMES fields appear high in the ranking, indicating that game players have a higher likelihood of using 3G services. We could very simplistically characterize a 3G user as a heavy user with a fairly new phone who plays games. These are not surprising results (this report should be seen as a preliminary investigation rather than a complete case study) but it does indicate that customer modeling could yield user profiles that would help in marketing. In terms of identifying those most likely to switch, this may not be a particularly valuable insight. However, targeting those with certain models and older phones could be profitable. We would tentatively recommend that the following observations about switching from 2G to 3G. 2G Customers who have

demonstrated some preference for games are heavy users

have old phones of predominantly 2G model

should be offered 3G contracts with 3G popular models on 3G popular subplans

Table 1: Top 20 attributes by IG, IGR and Chi-Squared Statistic

1 14 HS_MODEL 0.1837 0.0960 61652 29 HS_AgeGroup 0.1333 0.0616 38343 38 HS_AGE 0.1320 0.0501 38664 63 AVG_BILL_AMT 0.0760 0.0286 18845 7 SUBPLAN 0.0706 0.0358 19186 184 STD_VAS_GAMES 0.0635 0.0260 17397 112 AVG_VAS_GAMES 0.0632 0.0259 17308 72 AVG_NO_CALLED 0.0571 0.0241 15129 74 AVG_MINS_OB 0.0566 0.0235 1474

10 89 AVG_MINS_MOB 0.0566 0.0253 146811 90 AVG_MINS_INTRAN 0.0547 0.0223 141012 84 AVG_MINS_OBPK 0.0542 0.0255 145313 75 AVG_CALL_OB 0.0527 0.0225 140614 71 AVG_CALL 0.0522 0.0213 136215 95 AVG_CALL_LOCAL 0.0518 0.0213 1356

An examination of the least indicative attributes is also of interest. Note that some of the attributes that one might have thought to be indicative, such as nationality, delinquency and occupation, are amongst the least important.

Table 1: Bottom 20 attributes by IG, IGR and Chi-Squared Statistic

187 2 NATIONALITY 0.0008 0.0013 18188 46 TOT_LAST_DELINQ_DIST 0.0008 0.0028 21189 110 AVG_VAS_QTUNE 0.0008 0.0072 22190 182 STD_VAS_QTUNE 0.0008 0.0072 22191 45 TOT_LAST_DELINQ_DAYS 0.0007 0.0027 20192 47 TOT_DELINQ_DAYS 0.0007 0.0028 20193 48 TOT_PAST_DELINQ 0.0007 0.0026 20194 57 AVG_DELINQ_DAYS 0.0007 0.0026 20195 9 SUBPLAN_CHANGE_FLAG 0.0007 0.0025 19196 18 REVPAY_PREV_CD 0.0007 0.0035 21197 52 TOT_PAST_REVPAY 0.0005 0.0026 13198 23 VAS_IB_FLAG 0.0003 0.0057 10199 59 OD_FREQ 0.0003 0.0019 9200 3 OCCUP_CD 0.0003 0.0005 8201 17 ID_CHANGE_FLAG 0.0001 0.0035 4202 10 CONTRACT_FLAG 0.0000 0.0000 0203 50 TOT_PAST_TOS 0.0000 0.0000 0204 51 TOT_TOS_DAYS 0.0000 0.0000 0205 58 OD_REL_SIZE 0.0000 0.0000 0206 109 AVG_VAS_QG 0.0000 0.0000 0207 181 STD_VAS_QG 0.0000 0.0000 0

It is difficult to say much about modeling switching without any information on those who have switched. However, an analysis of where the variation in the attributes occurs can at least highlight where customers most differ. The top 5 principle components for the continuous data are shown below.

-0.146*AVG_MINS-0.144*AVG_MINS_LOCAL-0.144*AVG_MINS_OB-0.144*AVG_MINS_PK-0.141*AVG_CALL_OB

0.241*AVG_BILL_VOICEI+0.203*AVG_MINS_INT+0.195*AVG_BILL_VOICE+0.193*STD_BILL_AMT+0.191*AVG_BILL_AMT

0.226*STD_T1_MINS_CON+0.213*STD_EXTRANT1_RATIO+0.201*STD_EXTRAN_RATIO+0.198*STD_T1_CALL_CON+0.19* STD_OP_CALL_RATIO

0.244*STD_MINS_IB+0.238*STD_MINS_IBOP --0.213*AVG_EXTRAN_RATIO +0.202*AVG_MINS_IBOP+0.201*STD_MINS_OP

0.298*AVG_PAST_OD_VALUE+0.298*AVG_OD_AMT-0.298*AVG_PAY_AMT +0.298*STD_OD_AMT+0.298*STD_PAY_AMT

It is difficult to interpret these without expert knowledge. However, certain summary behaviour is apparent, and could be useful in the modeling of customers. The first component clearly relate to useage, whereas the second describes amount spent (including some international contribution). Hence most of the variation in customers can be explained by the amount of calls they make and the amount they spend. Given the importance of call amounts in predicting 3G, it would be worthwhile spending some time carefully modeling customers based on usage. However, the third component identifies an alternative source of variation in the Standard Deviation Fields. These fields are beyond our understanding, but may be worthy of investigation.

Another approach to looking for indicators of switching is to examine the model for areas of the attribute space where the classifier predicts a mixture of 2G/3G. However, this is a dangerous activity, as variation is more likely to come from incorrect modeling or natural variation. Many of the attributes may be highly influenced by whether someone is already using 3G or not (for example, there may be more games available on 3G). In our opinion a more detailed customer modeling, a removal of possibly deceptive attributes and a market segmentation using clustering may highlight more areas of the attribute space of particular interest in terms of switching users.

References

[1] Bagnall, A. J. and Janacek, G. (2005) Clustering time series with clipped data. Machine Learning,58(2): 151-178[2] Bagnall, A. J., Janacek, G. and Powell, M. (2005) A likelihood ratio distance measure for the similarity between the Fourier transform of time series, the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining[3] Bagnall, A. J., Ratanamahatan, C., Keogh, E., Lonardi, S. and Janacek, G.J. (2006) A Bit level representation for time series data mining with shape based similarity, to appear in Data Mining and Knowledge Discovery Journal, 2006.[4] Bagnall, A. J., Whittley, I. M., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Variance Stabilizing Regression Ensembles for Environmental Models, submitted to IEEE Congress on Computational Intelligence, 2006[5] Bull, L. and Studley, M. and Bagnall, A.J. and Whittley, I.M., "On the use of Rule Sharing in Learning Classifier System", In Proceedings of the 2005 Congress on Evolutionary Computation, 2005[6] Whittley, I. M., Bagnall, A. J., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Attribute Selection Methods for Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR), to appear in the International Workshop on Feature Selection for Data Mining, part of the 2006 SIAM Data Mining Conference[7] SCDM project website http://www.mc.manchester.ac.uk/scdm/[8] Large Datasets for Feature Selection http://www2.cmp.uea.ac.uk/~ajb/SCDM/AttributeSelection.html[9] CRISP http://www.crisp-dm.org/index.htm

Appendix A

This appendix provides details of the data formatting conducted prior to mining

2. MARITAL_STATUS: Train       Test  Count of CUSTOMER_TYPE CUSTOMER_TYPE     Count of GENDER  

MARITAL_STATUS 0 1Grand Total MARITAL_STATUS Total

M 7340 1034 8374 M 2808S 6834 1750 8584 S 2822X 826 216 1042 X 370Grand Total 15000 3000 18000 Grand Total 6000

3. Nationality: We transform to just 7 classes. 0 is all others, including missing.Train Test

702 16513 702 5478458 535 458 201

0 318 156 99156 264 0 89608 167 608 67356 112 356 35360 88 360 31

4. Occupation CD: Combine rare valuesOriginalTrain       Test  OCCUP_CD 0 1 Total OCCUP_CD TotalAGT 3   3 AGT 4CLRC 29 12 41 CLRC 18CLRF6603C 1 1 ENG 27ENG 57 13 70 EXEC 18EXEC 50 16 66 GOVT 4FAC 2 2 4 HWF 8GOVT 6 2 8 MED 3HWF 40 8 48 MGR 30MED 2 2 OTH 1960MGR 66 18 84 POL 27OTH 5015 978 5993 SELF 10POL 48 20 68 STUD 52SELF 21 6 27 TCHR 5SHOP 3 1 4 X 3834STUD 115 35 150 Grand Total 6000TCHR 15 2 17X 9527 1887 11414Grand Total 15000 3000 18000

NewTrain TestCLRC 41 CLRC 18ENG 70 ENG 27EXEC 66 EXEC 18HWF 48 HWF 8MGR 84 MGR 30OTH 6032 OTH 1976POL 68 POL 27SELF 27 SELF 10STUD 150 STUD 52X 11411 X 3834

5. Cobrand FlagTrain       Test  

Count of OCCUP_CD CUSTOMER_TYPE    Count of COBRAND_CARD_FLAG  

COBRAND_CARD_FLAG 0 1Grand Total COBRAND_CARD_FLAG Total

0 13696 2587 16283 0 54051 1304 413 1717 1 595

Grand Total 15000 3000 18000 Grand Total 6000

6. HIGHEND_PROGRAM_FLAG

Train       Test  HIGHEND_PROGRAM_FLAG 0 1 Grand Total HIGHEND_PROGRAM_FLAG Total

0 14499 2555 17054 0 56791 501 445 946 1 321

Grand Total 15000 3000 18000 Grand Total 6000

7. CUSTOMER_CLASSTrain       Test  Count of CUSTOMER_CLASS

CUSTOMER_TYPE    

Sum of CUSTOMER_CLASS  

CUSTOMER_CLASS 0 1Grand Total CUSTOMER_CLASS Total

3 12085 2392 14477 31442

44 398 52 450 4 6285 348 54 402 5 6256 322 26 348 6 7747 1577 379 1956 7 46278 138 34 172 8 4409 121 63 184 9 567

10 11 11 10 20

Grand Total 15000 3000 18000 Grand Total2210

5

8. SUBPLAN, SUBPLAN_PREVIOUS

These variables have a very large number of possible values, and there are mismatches between test and training sets. We group low the plans as follows

Subplan: Class 1 Class 22219   4 4 2207 4 1 72214 4 5 9 2204 3 62169   2 2 2202 1 52164 3 2 5 2197 1 1 52163 6 4 10 2196 6 1 42128 1 2 3 2187 1 1 42127 3 3 6 2168 2 32118 6 3 9 2159 1 22113 5 4 9 2158 2 2

2152 3 12130 5 12116 6 12112 1 12109 3 1

Missing2246 12244 12217 12110 1

Subplan previous

Class 1 Class 2 missing2112 1 1 2113 5 3 8 2110 12115 8 1 9 2118 4 3 7 21272116 7 7 2128 4 4 8 2168 12130 5 5 2152 3 2 5 2199 12159 1 1 2158 2 2 4 22442162 1 1 2169   1 1 22462164 7 1 8 2170 4 4 82185 1 1 2196 2 7 92186 1 1 2214 5 4 92187 1 1 2219   4 42197 1 1 2219   4 42202 1 1    2207 4 4      2215 3 3      6105 4 4      6106 1 1      

8. Contract FlagCount of CONTRACT_FLAG CUSTOMER_TYPE    SUBPLAN_CHANGE_FLAG 0 1 Grand

Total0 14266 2795 170611 731 205 936

Grand Total 14997 3000 17997

9. PAY_METDCount of PAY_METD CUSTOMER_TYPE    

PAY_METD 0 1Grand Total

cg 603 132 735co 1028 328 1356cs 11294 2113 13407cx 138 33 171dd 1236 206 1442X 698 188 886Grand Total 14997 3000 17997

PAY_METD PrevCount of PAY_METD_PREV CUSTOMER_TYPE    

PAY_METD_PREV 0 1Grand Total

cb   1 1cg 421 103 524ch 9 1 10co 822 255 1077cs 11946 2262 14208cx 73 21 94dd 1028 169 1197X 698 188 886Grand Total 14997 3000 17997

LUCKY_NOSCount of LUCKY_NO_FLAG CUSTOMER_TYPE    

LUCKY_NO_FLAG 0 1Grand Total

0 14296 2694 169901 701 306 1007

Grand Total 14997 3000 17997

BLACKLISTCount of CUSTOMER_TYPE    

BLACK_LIST_FLAG

BLACK_LIST_FLAG 0 1Grand Total

0 13883 2921 168041 1114 79 1193

Grand Total 14997 3000 17997

ID_CHANGECount of BLACK_LIST_FLAG CUSTOMER_TYPE    

ID_CHANGE_FLAG 0 1Grand Total

0 14940 2981 179211 57 19 76

Grand Total 14997 3000 17997

REVPAY_PREV

All below 602 converted to 0-732 13 8 21 -732 3-722 120 42 162 -722 52-702 95 28 123 -702 34-602 111 19 130 -602 40

0 14658 2903 17561 0 5871COUNTRY:

For all three, just retain select list0 543

29 10835 19839 7648 7350 8452 4565 8869 5270 27280 59

101 168102 80103 329105 1980236 418237 160238 91239 87240 84241 63242 210

248 379254 206258 203260 54

NONE 11887Grand Total 17997

Internationals0 407 136 543   0 175

29 77 31 108   29 3435 155 43 198   35 6039 54 22 76   39 3248 66 7 73   48 2850 63 21 84   50 2852 24 21 45   52 1265 62 26 88   65 2869 34 18 52   69 1270 204 68 272   70 7680 49 10 59   80 12

101 135 33 168   101 44102 63 17 80   102 25103 253 76 329   103 125105 1449 531 1980   105 701236 306 112 418   236 121237 116 44 160   237 53238 66 25 91   238 26239 76 11 87   239 32240 61 23 84   240 20241 37 26 63   241 14242 163 47 210   242 78248 278 101 379   248 131254 160 46 206   254 83258 132 71 203   258 64260 42 12 54   260 12

NONE 10465 1422 11887   NONE 3973

VAS_CND_FLAG: Note big class imbalance between train and test

Sum of CUSTOMER_TYPE

CUSTOMER_TYPE    

Count of VAS_CND_FLAG  

VAS_CND_FLAG 0 1Grand Total

VAS_CND_FLAG Total

0 0 33 33 0 4641 0 2967 2967 1 5536

Grand Total 0 3000 3000 Grand Total 6000

VAS_CNDD

Count of CUSTOMER_TYPE CUSTOMER_TYPE    

Count of VAS_CNND_FLAG  

VAS_CNND_FLAG 0 1Grand Total VAS_CNND_FLAG Total

0 14499 2710 17209 0 57631 498 290 788 1 237

Grand Total 14997 3000 17997 Grand Total 6000

VAS_DRIVE: REMOVECount of CUSTOMER_TYPE CUSTOMER_TYPE    

VAS_DRIVE_FLAG 0 1Grand Total

0 14992 3000 179921 5 5

Grand Total 14997 3000 17997

VAS_FF: Remove

Count of CUSTOMER_TYPE CUSTOMER_TYPE    

Sum of VAS_FF_FLAG  

VAS_FF_FLAG 0 1Grand Total VAS_FF_FLAG Total

0 14832 2934 17766 0 59091 165 66 231 1 91

Grand Total 14997 3000 17997 Grand Total 6000

VAS_IB

Count of CUSTOMER_TYPE CUSTOMER_TYPE    

Count of VAS_IB_FLAG  

VAS_IB_FLAG 0 1Grand Total VAS_IB_FLAG Total

0 14905 2966 17871 0 59401 92 34 126 1 60

Grand Total 14997 3000 17997 Grand Total 6000

VAS_NRCount of CUSTOMER_TYPE CUSTOMER_TYPE    

Count of VAS_IB_FLAG  

VAS_NR_FLAG 0 1Grand Total VAS_NR_FLAG Total

0 14746 2892 17638 0 58791 251 108 359 1 121

Grand Total 14997 3000 17997 Grand Total 6000

VAS_VMCount of CUSTOMER_TYP

CUSTOMER_TYPE

    Count of VAS_IB_FLAG

 

E

VAS_VM_FLAG 0 1Grand Total VAS_VM_FLAG

Total

0 4456 1057 5513 0186

8

1 10541 1943 12484 1413

2

Grand Total 14997 3000 17997 Grand Total600

0

VAS_VMN: RemoveVAS_VMP: RemoveCount of CUSTOMER_TYPE CUSTOMER_TYPE    

Count of VAS_IB_FLAG  

VAS_VMP_FLAG 0 1Grand Total VAS_VMP_FLAG Total

0 14996 2999 17995 0 59981 1 1 2 1 2

Grand Total 14997 3000 17997 Grand Total 6000

VAS_SN_FLAG: DeleteVAS_GPRSCount of CUSTOMER_TYPE

CUSTOMER_TYPE    

Count of VAS_IB_FLAG  

VAS_GPRS_FLAG 0 1 Grand Total

VAS_GPRS_FLAG

Total

0 631 20 651 0 204

1 14366 2980 17346 1579

6

Grand Total 14997 3000 17997 Grand Total600

0

VAS_CSMS_FLAG: RemoveVAS_IEM_FLAG: Remove

VAS_AR_FLAG

Count of CUSTOMER_TYPE CUSTOMER_TYPE    

Count of VAS_IB_FLAG  

VAS_AR_FLAG 0 1 Grand Total VAS_AR_FLAG Total0 10574 1269 11843 0 39931 4423 1731 6154 1 2007

Grand Total 14997 3000 17997 Grand Total 6000

TELE_CHANGE_FLAG: Remove