Comparison of Machine Learning Algorithms
Post on 14-Jun-2015
- 1. 1Comparison of Machine Learning Algorithmsin Market Segmentation Analysis Zhaohua Huang Dec 12, 2005 AbstractThis project is aimed to compare the four machine learning methods: bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by using the sales data of an orthopedic equipment company. The result shows that the four methods show similarly unsatisfactory prediction performance on this dataset. Though these four methods have their own advantages on predicting some specific categories, ANN is relatively the best based on the misclassification rates.
2. 2 1. Introduction Bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) are four useful machine learning methods, which can be use to improve of the classification accuracy. Bagging produces replicated training sets by sampling with replacement from the training set to form the classifiers. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The artificial neural network used here is a the single-layer network, which consists of only a single layer of output nodes and the inputs are fed directly to the outputs via a series of weights. And the support vector machine for classification creates a hyperplane that separates the data into two classes with the maximum margin. Theoretically, random forests yield better error rates, at least than bagging, and are more robust to noise.In this project, we use a real data set to empirically compare their classification performance. The data set contains the sales of a companys orthopedic equipments in 4703 hospitals and 13 feature variables that are potentially able to explain the difference of sales among these hospitals. The above four classification algorithms yield the predicted probabilities of sales of four categories: no sale, low, high, and very high from low to high. Then these probabilities are input to LDA for classification and the misclassification rates are compared.2. Analysis and Results 2.1 Overview The procedure of analysis is summarized in the following diagram. Since RFA directly reports the classification result instead of probabilities, LDA is not applied.Data Transformation 2.2 Data Manipulation and PCA The goal and method of data transformation are the same as the previous project and the PCA details are in appendix 0 to 3. There are 5 closely related and highly correlated variables, knee 95, knee 96, hip 95, hip 96 and femur 96. They are the numbers of operations of knee, hip and femur in 1995 and 1996. Simply using them together asBagging RandomANN SVM predictors will generate unnecessary noises to the prediction. Therefore, we applyForests principle component analysis to these 5 variables after transforming the data. Since the first component can explain 0.9138151 of the variance and the second one drops to only 0.05, we only use the first principle component variable V1, which is the linear combination of the above five highly correlated variables: V1 = -0.456hip95 - LDA 0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not show high correlation after transformation. 3. 3 In addition, variable rbeds is transformed into a binary categorical variable, which is the same as the variable rehab. So we drop the first. We also try cross validation method to find a good combination of the predictors. The change of result is subtle and in most cases, dropping any one variable will weaken the prediction power. Hence, the final predictors are V1, beds, outv, adm, sir, th, trauma and rehab. The description of these 8 predictors is in the appendix. To examine the out-of-sample performance of these six methods, we randomly split the whole data set into two subsets: one training set with 4203 observations and one testing set with 500 observations.2.3 Results of bagging, RFA, ANN and SVM The results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively. All these four methods are related to some randomness. Sometimes the result is good and sometimes is bad. Hence we try them for several times and only report their best performancesSpecifically, we choose the default set for bagging, ntree=1000 for RFA, size=20, maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for SVM. We do not have many options for bagging in R. We test different numbers of bags from 5 to 100. The results differ a little bit, but just due to the randomness of the method. Therefore, we only use the default setting. The number of trees in RFA has also been tested from 100 to 1000. Though the increment does not improve the result significantly, we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the iteration number is only 5 and sometimes it will go to 800 and above. Obviously the higher the iteration number, the better the result is. The default setting of maximum iteration number is relatively small and we set it to be 1000, though the iteration never reaches that high. Our best result comes out when iteration goes to 740. There are many options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial kernel, we also try different degree. In general, polynomial with high degree dominates sigmoid and radial. It fits the training set pretty well with the misclassification rate as low as 42%. But none of the combination does well in the testing set. Also, the gamma coefficient should be 1 over the number of parameters, which should be 0.125 in our study. However, the testing set performs better if we increase gamma to 0.6. Therefore, we suspect there is an overfitting issue with SVM.In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy of these 4 classifiers. First, we focus on their performance in training and testing sets. In training set, SVM has the lowest misclassification rate 45.3. However the testing set misclassification rate is 50.8%. Compared with the relatively consistent performance of the other 3 methods, this could be a sign of overfitting. The result of bagging is also not good. The misclassification rate of the training set is 0.485, which is close to ANNs 0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect 4. 4 predictions. RFA performs pretty stably as expected. The misclassification rate for training set is always close to the one for testing set. But the result is still not satisfactory. Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is defiantly the best. The problem with ANN is its randomness and inconsistency. One has to try ANN for many times to get the best result and no one knows whether that result is really the best one can possibly get.As far as the specific category is concerned, their performances are close but a little bit different. In general, all four methods can classify the n group out very well. The accuracy rates are all above 80%. But they do extremely badly in l group, where the misclassification rate is above 80%. They can not classify the h and v group. The correct and incorrect predictions are about half and half. The reason for this could be that categorization is not good and the difference between selling 10 and 50 equipments can be very subtle. Hence the classification among low, high and very high sales is very difficult. If the categorization decreases to only 2, these methods could perform very well. The four methods differ in their prediction abilities toward different categories of the response variables. In general, bagging does worse in n but pretty well in v. RFA has its weakness in h and v. SVM does very badly in l and v, but pretty well in n and h. ANN is ok for l and h, but pretty good for n and v. Since the n group has most of the observations, the method which performs the best in that group is highly possible to be the best one. In our case, it is ANN.2.5 Something more We also try the data mining tree method. But the result does not come out on our computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a little bit different from the goal of this project. Therefore, we might try it again after upgrading the computer, but regretfully give it up now. We plan to try different categorization to further test the prediction abilities, since this 4 categorization result is far from satisfactory. Besides, the ANN we use is the single layer neural net provided by R. Maybe a more complicated ANN could perform better. However, due to the time limit, we have to leave all these to our future study.3. Conclusion The four methods do not yield ideal results. Artificial neural network is ok and random forests show the expected robustness. The prediction abilities for different categories of the response variable differ among them. The category n is relatively the easiest to be predicted, while none of the four methods can successfully identify the category l. There are several ways to improve the performance, such as lower the dimension of category or adding region or state factor into analysis. In summary, ANN relatively dominates other three methods in the analysis of this orthopedic equipment data. 5. 5 6. 6Reference: Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New York. Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Canada. Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer- Verlag, New York. Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001 7. 7 Appendix 0 The Notations of Variables Response:SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MOFeatures (predictors): BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 :NUMBER OF KNEE OPERATIONS FOR 1995 TH :TEACHING HOSPITAL? 0, 1 TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1 HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 :NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996Appendix 1 Transformations of Selected Variablesbeds = log(beds+1) rbeds = 1 if rbeds 1 outv = 15*log(outv+215) adm = 0.0001*log(adm+425) sir