comparison of machine learning algorithms

Comparison of Machine Learning Algorithms

in Market Segmentation Analysis

Zhaohua Huang

Dec 12, 2005

Abstract

This project is aimed to compare the four machine learning methods: bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by using the sales data of an orthopedic equipment company. The result shows that the four methods show similarly unsatisfactory prediction performance on this dataset. Though these four methods have their own advantages on predicting some specific categories, ANN is relatively the best based on the misclassification rates.

1

1. IntroductionBagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) are four useful machine learning methods, which can be use to improve of the classification accuracy. Bagging produces replicated training sets by sampling with replacement from the training set to form the classifiers. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The artificial neural network used here is a the single-layer network, which consists of only a single layer of output nodes and the inputs are fed directly to the outputs via a series of weights. And the support vector machine for classification creates a hyperplane that separates the data into two classes with the maximum margin. Theoretically, random forests yield better error rates, at least than bagging, and are more robust to noise.

In this project, we use a real data set to empirically compare their classification performance. The data set contains the sales of a company’s orthopedic equipments in 4703 hospitals and 13 feature variables that are potentially able to explain the difference of sales among these hospitals. The above four classification algorithms yield the predicted probabilities of sales of four categories: “no sale”, “low”, “high”, and “very high” from low to high. Then these probabilities are input to LDA for classification and the misclassification rates are compared. 2. Analysis and Results2.1 OverviewThe procedure of analysis is summarized in the following diagram. Since RFA directly reports the classification result instead of probabilities, LDA is not applied.

2.2 Data Manipulation and PCAThe goal and method of data transformation are the same as the previous project and the details are in appendix 0 to 3. There are 5 closely related and highly correlated variables, “knee 95”, “knee 96”, “hip 95”, “hip 96” and “femur 96”. They are the numbers of operations of knee, hip and femur in 1995 and 1996. Simply using them together as predictors will generate unnecessary noises to the prediction. Therefore, we apply principle component analysis to these 5 variables after transforming the data. Since the first component can explain 0.9138151 of the variance and the second one drops to only 0.05, we only use the first principle component variable “V1”, which is the linear combination of the above five highly correlated variables: V1 = -0.456hip95 - 0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not show high correlation after transformation.

2

Data Transformation

PCA

Bagging RandomForests

ANN SVM

LDA

In addition, variable “rbeds” is transformed into a binary categorical variable, which is the same as the variable “rehab”. So we drop the first. We also try cross validation method to find a good combination of the predictors. The change of result is subtle and in most cases, dropping any one variable will weaken the prediction power. Hence, the final predictors are “V1”, “beds”, “outv”, “adm”, “sir”, “th”, “trauma” and “rehab”. The description of these 8 predictors is in the appendix. To examine the out-of-sample performance of these six methods, we randomly split the whole data set into two subsets: one training set with 4203 observations and one testing set with 500 observations.

2.3 Results of bagging, RFA, ANN and SVMThe results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively. All these four methods are related to some randomness. Sometimes the result is good and sometimes is bad. Hence we try them for several times and only report their best performances

Specifically, we choose the default set for bagging, ntree=1000 for RFA, size=20, maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for SVM. We do not have many options for bagging in R. We test different numbers of bags from 5 to 100. The results differ a little bit, but just due to the randomness of the method. Therefore, we only use the default setting. The number of trees in RFA has also been tested from 100 to 1000. Though the increment does not improve the result significantly, we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the iteration number is only 5 and sometimes it will go to 800 and above. Obviously the higher the iteration number, the better the result is. The default setting of maximum iteration number is relatively small and we set it to be 1000, though the iteration never reaches that high. Our best result comes out when iteration goes to 740. There are many options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial kernel, we also try different degree. In general, polynomial with high degree dominates sigmoid and radial. It fits the training set pretty well with the misclassification rate as low as 42%. But none of the combination does well in the testing set. Also, the gamma coefficient should be 1 over the number of parameters, which should be 0.125 in our study. However, the testing set performs better if we increase gamma to 0.6. Therefore, we suspect there is an overfitting issue with SVM.

In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy of these 4 classifiers. First, we focus on their performance in training and testing sets. In training set, SVM has the lowest misclassification rate 45.3. However the testing set misclassification rate is 50.8%. Compared with the relatively consistent performance of the other 3 methods, this could be a sign of overfitting. The result of bagging is also not good. The misclassification rate of the training set is 0.485, which is close to ANN’s

3

0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect predictions. RFA performs pretty stably as expected. The misclassification rate for training set is always close to the one for testing set. But the result is still not satisfactory. Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is defiantly the best. The problem with ANN is its randomness and inconsistency. One has to try ANN for many times to get the best result and no one knows whether that result is really the best one can possibly get.

As far as the specific category is concerned, their performances are close but a little bit different. In general, all four methods can classify the “n” group out very well. The accuracy rates are all above 80%. But they do extremely badly in “l” group, where the misclassification rate is above 80%. They can not classify the “h” and “v” group. The correct and incorrect predictions are about half and half. The reason for this could be that categorization is not good and the difference between selling 10 and 50 equipments can be very subtle. Hence the classification among low, high and very high sales is very difficult. If the categorization decreases to only 2, these methods could perform very well. The four methods differ in their prediction abilities toward different categories of the response variables. In general, bagging does worse in “n” but pretty well in “v”. RFA has its weakness in “h” and “v”. SVM does very badly in “l” and “v”, but pretty well in “n” and “h”. ANN is ok for “l” and “h”, but pretty good for “n” and “v”. Since the “n” group has most of the observations, the method which performs the best in that group is highly possible to be the best one. In our case, it is ANN. 2.5 Something moreWe also try the data mining tree method. But the result does not come out on our computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a little bit different from the goal of this project. Therefore, we might try it again after upgrading the computer, but regretfully give it up now. We plan to try different categorization to further test the prediction abilities, since this 4 categorization result is far from satisfactory. Besides, the ANN we use is the single layer neural net provided by R. Maybe a more complicated ANN could perform better. However, due to the time limit, we have to leave all these to our future study.

3. ConclusionThe four methods do not yield ideal results. Artificial neural network is ok and random forests show the expected robustness. The prediction abilities for different categories of the response variable differ among them. The category “n” is relatively the easiest to be predicted, while none of the four methods can successfully identify the category “l”. There are several ways to improve the performance, such as lower the dimension of category or adding region or state factor into analysis. In summary, ANN relatively

4

dominates other three methods in the analysis of this orthopedic equipment data.

5

Reference:

[1] Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New York.

[2] Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Canada.

[3] Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics

[4] Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer-Verlag, New York.

[5] Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001

6

Appendix 0 The Notations of Variables

Response:

SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO

Features (predictors):

BEDS : NUMBER OF HOSPITAL BEDS

RBEDS : NUMBER OF REHAB BEDS

OUT-V : NUMBER OF OUTPATIENT VISITS

ADM : ADMINISTRATIVE COST(In $1000's per year)

SIR : REVENUE FROM INPATIENT

HIP95 : NUMBER OF HIP OPERATIONS FOR 1995

KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995

TH : TEACHING HOSPITAL? 0, 1

TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1

REHAB : DO THEY HAVE A REHAB UNIT? 0, 1

HIP96 : NUMBER HIP OPERATIONS FOR 1996

KNEE96 : NUMBER KNEE OPERATIONS FOR 1996

FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996

Appendix 1 Transformations of Selected Variables

beds = log(beds+1)rbeds = 1 if rbeds 1

outv = 15*log(outv+215)

adm = 0.0001*log(adm+425)

sir <- log(0.1*sir+42)

hip95 <- log(3*hip95+11)

knee95 <- sqrt(log(3*knee95+15))

hip96 <- log(25*hip96+150)

knee96 <- log(5+10*knee96)

femur96 <- log(20*femur96+60)

7

Appendix 2 Distributions before Transformation

8

Appendix 3 Distributions after Transformation

9

Appendix 4 Result of PCA: Call: princomp(x = check)Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 2.1373123 0.4777930 0.3386012 0.2303204 0.1866775 5 variables and 4703 observations.

Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 2.1373123 0.47779303 0.33860116 0.23032037 0.18667749Proportion of Variance 0.9138151 0.04566695 0.02293503 0.01061175 0.00697118Cumulative Proportion 0.9138151 0.95948204 0.98241707 0.99302882 1.00000000

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5[1,] -0.456 -0.529 0.162 0.697[2,] -0.445 -0.548 -0.214 -0.612 -0.286[3,] -0.458 0.126 -0.227 0.587 -0.615[4,] -0.445 -0.344 0.749 0.261 0.233[5,] -0.432 0.751 0.248 -0.432

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5SS loadings 1.0 1.0 1.0 1.0 1.0Proportion Var 0.2 0.2 0.2 0.2 0.2Cumulative Var 0.2 0.4 0.6 0.8 1.0

10

Appendix 5 Result of Bagging: Length Class Mode y 4203 -none- numericX 8 data.frame list Mtrees 10 -none- list OOB 1 -none- logicalcomb 1 -none- logicalcall 4 -none- call Bagging regression trees with 20 bootstrap replications

Call: bagging.data.frame(formula = yy_ ~ v1 + beds + outv + adm + sir + th + trauma + rehab, data = hos.train, nbagg = 20, coob = T)

Training set: pred h l n v h 473 14 324 261 y l 257 91 401 102 n 387 2 1314 87 v 156 1 46 287Misclassification rate= 0.485

Testing set: lda.predict pred h l n v h 66 3 41 20 y l 30 12 52 20 n 51 2 145 8 v 18 0 4 28 Misclassification rate= 0.498

11

Appendix 6 Result of Random Forests:Call: randomForest(x = xx.train, y = yf.train, xtest = xx.test, ytest = yf.test, ntree = 1000) Type of random forest: classification Number of trees: 1000No. of variables tried at each split: 2

OOB estimate of error rate: 49.75%Confusion matrix: Pre 1 2 3 4 class.error

1 1424 38 285 43 0.2044693y 2 448 107 242 54 0.8742656

3 439 65 421 147 0.60727614 88 12 230 160 0.6734694

Test set error rate: 48.6%Confusion matrix: Pre 1 2 3 4 class.error

1 164 2 35 5 0.2038835y 2 58 14 35 7 0.8771930

3 51 5 61 13 0.53076924 7 0 25 18 0.6400000

12

Appendix 7 Results of ANNa 8-20-4 network with 264 weightsoptions were -# weights: 264initial value 3855.019060 iter 10 value 2583.006140iter 20 value 2409.884200iter 30 value 2357.671846iter 40 value 2332.398369......iter 740 value 2198.189341final value 2198.189123 converged

LDA result(training set): Pred n l h v n 1440 0 292 58y l 461 92 219 79 h 423 13 407 229 v 64 1 178 247

Misclassification rate= 0.480

LDA result(testing set): Pred n l h v n 165 0 35 6

y l 59 12 28 15 h 48 3 64 15 v 7 0 21 22


13

Appendix 8 Results of SVMCall:svm.default(x = xx.train, y = yf.train, kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10)

Parameters: SVM-Type: C-classification SVM-Kernel: polynomial cost: 1 degree: 6 gamma: 0.5 coef.0: 0 Number of Support Vectors: 3251 ( 779 1139 932 401 )

Number of Classes: 4 Levels: 1 2 3 4

10-fold cross-validation on training data:

Total Accuracy: 45.12557 Single Accuracies: 51.35135 51.08108 40.81081 40.70081 38.91892 37.02703 44.74394 42.43243 48.64865 55.52561

LDA result(training set): pred 1 2 3 4 1 1500 9 257 24y 2 513 62 244 32 3 420 9 579 64 4 68 3 259 160


LDA result(testing set): pred 1 2 3 4

1 163 4 38 1y 2 70 5 33 6

3 56 1 65 84 6 0 31 13


14

Appendix 9 Comparison of 4 Methods:

Misclassification

Rates

Train

ing

Set

N L H V Testi

ng

Set

N L H V

Bagging 0.485 0.266 0.893 0.559 0.414 0.498 0.296 0.895 0.492 0.440

Random Forests 0.498 0.204 0.874 0.607 0.673 0.486 0.204 0.877 0.531 0.640

Artificial Neural

Network

0.480 0.196 0.892 0.620 0.496 0.474 0.199 0.895 0.508 0.560

Support Vector

Machine

0.453 0.162 0.927 0.460 0.673 0.508 0.209 0.956 0.500 0.740

15

comparison of machine learning algorithms

Documents