[ieee 2010 3rd ieee international conference on computer science and information technology (iccsit...

Notice of Retraction

After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE's Publication Principles. We hereby retract the content of this paper. Reasonable effort should be made to remove all past references to this paper.

The presenting author of this paper has the option to appeal this decision by contacting [email protected].

Diagnosis of Tuberculosis using Ensemble methods

Asha.T 1, Dr. S. Natarajan2, Dr. K.N.B. Murthl Department of computer science,

P.E.S.I.T., V.T.V., Bangalore, INDIA [email protected]

[email protected] [email protected]

Abstact --- Classification of medical data is an important task in the prediction of any disease. It even helps doctors in their diagnosis decisions. Ensemble classifier is to generate a set of classifiers instead of one classifier for the classification of a new object, hoping that the combination of answers of multiple classification results in better performance. Tuberculosis (TB) is a disease caused by bacteria called Mycobacterium tuberculosis. It usually spreads through the air & attacks low immune bodies. HIV patients are more likely to be attacked with TB. It is an important health problem in India also. Diagnosis of pulmonary tuberculosis has always been a problem. The main task carried out in this paper is the comparison of classification techniques for TB based on two categories namely pulmonary tuberculosis(PTB) and retroviral PTB using ensemble classifiers such as Bagging, AdaBoost and Random forest trees.

Key words- Tuberculosis; classification;, Ensemble classifiers; PTB; retroviral PTB

I. INTRODUCTION

Data classification process using knowledge obtained from known historical data has been one of the most intensively studied subjects in statistics, decision science and computer science. Data mining techniques have been applied to medical services in several areas, including prediction of effectiveness of surgical procedures, medical tests, medication, and the discovery of relationships among clinical and diagnosis data. In order to help the clinicians in diagnosing the type of disease computerized data mining and decision support tools are used which are able to help clinicians to process a huge amount of data available from solving previous cases and suggest the probable diagnosis based on the values of several

important attributes[22] Ensemble of classifiers has been proved to be very

effective way to improve classification accuracy because uncorrelated errors made by a single classifier can be removed by voting. A classifier which utilizes a single minimal set of classification rules to classify future examples may lead to mistakes. An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way to classify new example. Many research results illustrated that such multiple classifiers, if appropriately combined during classification, can improve the classification accuracy.

978-1-4244-5540-9/10/$26.00 ©2010 IEEE

409

India has the world's highest burden of tuberculosis (TB) with million estimated incident cases per year. It also ranks among the world's highest HIV burden with an estimated 2.3 million persons living with HIV / AIDS. Tuberculosis is much more likely to be a fatal disease among HIV-infected persons than persons without HIV infection [21]. It is a disease caused by mycobacterium which can affect virtually all organs, not sparing even the relatively inaccessible sites. The microorganisms usually enter the body by inhalation through the lungs. They spread from the initial location in the lungs to other parts of the body via the blood stream. they present a diagnostic dilemma even for physicians with a great deal of experience in this disease.

Previous work[4] involves diagnosing tuberculosis using artificial neural networks(ANN) with multilayer NN and General Regression NN. In our paper, a study on tuberculosis is realized that classifies TB data into two categories pulmonary tuberculosis(PTB) and retroviral PTB. Ensemble classifiers such as Bagging, AdaBoost and Random forest trees are used for classification. Their sensitivity, specificity and accuracies are compared. Results show that Bagging method performs well compared to AdaBoost and Random forest trees.

Paper is organized as follows: section 2 presents related work, section 3 about medical data, in section 4 we describe Bagging, AdaBoost and Random forest tree algorithms. Section 5 explains different performance measures and section 6 describes experimental results followed by conclusion.

II. LITERATURE SURVEY

Minou Rabiei et.al.[l] use tree based ensemble classifiers for the diagnosis of excess water production. Their results demonstrate the applicability of this technique in successful diagnosis of water production problems. Hongqi Li, Haifeng Guo and team present[2] a comprehensive comparative study on petroleum exploration and production using five feature selection methods including expert judgment, CFS, L VF, Relief-F, and SVMRFE, and fourteen algorithms from five distinct kinds of classification methods including decision tree, artificial neural network, support vector machines(SVM), Bayesian network and ensemble learning. Zhenzheng Ouyang, Min Zhou, Tao Wang, Quanyuan Wu[3] propose a method, called WEAP-I, which trains a weighted ensemble classifier

on the most n data chunks and trains an averaging ensemble classifier on the most recent data chunk. All the base classifiers are combined to form the WEAP-I ensemble classifier.

Orhan Er. And temuritus[4] present a study on tuberculosis diagnosis, carried out with the help of multilayer neural networks (MLNNs). For this purpose, an MLNN with two hidden layers and a genetic algorithm for training algorithm has been used. Data mining approach was adopted to classify genotype of mycobacterium tuberculosis using c4.5 algorithm[5]. Evaluation of the performance of two decision tree procedures and four Bayesian network classifiers as potential decision support systems in the cytodiagnosis of breast cancer was carried out[6].

Paper on "Mining Several Data Bases with an Ensemble of Classifiers"[7] analyze the two types of conflicts, one created by data inconsistency within the area of the intersection of the data bases and the second is created when the meta method selects different data mining methods with inconsistent competence maps for the objects of the intersected part and their combinations and suggest ways to handle them. Referenced paper[8] studies medical data classification methods, comparing decision tree and system reconstruction analysis as applied to heart disease medical data mining. Under most circumstances, single classifiers, such as neural networks, support vector machines and decision trees, exhibit worst performance. In order to further enhance performance combination of these methods in a multi-level combination scheme was proposed that improves efficiency[9]. paper[lO] demonstrates the use of adductive network classifier committees trained on different features for improving classification accuracy in medical diagnosis.

Paper on "MReC4.5: C4.5 ensemble classification with MapReduce" [11] takes the advantages of C4.5, ensemble learning and the MapReduce computing model, and proposes a new method MReC4.5 for parallel and distributed ensemble classification. Seppo Puuronen and team[12] propose a similarity evaluation technique that uses a training set consisting predicates that define relationships within the three sets: the set of instances, the set of classes, and the set of classifiers. Lei Chen and Mohamed S. Kamel[13] propose the scheme of Multiple Input Representation-Adaptive Ensemble Generation and Aggregation(MIR-AEGA) for the classification of time series data.Kai Jiang et.al.[14] propose a neural network ensemble model for classification of incomplete data. In the method, the incomplete dataset is divided into a group of complete sub datasets, which is then used as the training sets for the neural networks.

III. MEDICAL DATA SET DESCRIPTION

The medical dataset we are classifying includes 250 real records of patients suffering from TB obtained from a state hospital. The entire dataset is put in one file having many records. Each record corresponds to most relevant information of one patient. Totally there are 11 attributes(symptoms) and one class attribute. The symptoms of each patient such as age, chroniccough(weeks),weightloss, interrnittentfever( days), nightsweats, Sputum, bloodcough,

410

chestpain, HIV, radiographic findings, wheezing and class are considered as attributes.

Table I shows names of 12 attributes considered along with their datatype(DT). Type N-indicates numerical and Ccategorical.

TABLE I. LIST OF ATTRIBUTES AND THEIR DATA TYPES

Name DT No 1 Age N

2 chroniccough(weeks) N

3 weightloss C

4 intermittentfever( days) N

5 nightsweats C

6 Bloodcough C

7 chestpain C

8 HIV C

9 Radiographicfindings C

10 Sputum C

11 wheezing C

12 class C

IV. ENSEMBLE CLASSIFICATION

The basic classification techniques predict the class labels of unknown examples using a single classifier induced from training data. We can improve classification accuracy by aggregating the predictions of multiple classifiers. These techniques are known as the ensemble or classification combination methods. An ensemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier. Each method combines a series of k learned models M],M2, • • • • ,Mn with the aim of creating an improved composite model, MO. Following are the examples of ensemble classifiers.

In case of Bagging, votes are collected from each classifier; decision is made based on the majority votes. Boosting works by assigning weights to each classifier decision and final decision is made based on the combination of weighted decisions. Random Forest combines the predictions made by multiple decision trees, where each tree is generated based on the values of an independent set of random vectors.

V. ALGORITHMS

AdaBoost

Given (It,Yl),"" (1m, Ym) I'ffiere Ii E X, Yi E Y = (-1, + I} 1

Initialise Dl (i) = - -m

For t = 1, ... , T

• Find the classifier ht : X .... (-1, + I} that minimizes the error \'lith respect to the distribution Dt m

Itt = arg �� tj, I'ffiere t j = L Dt(i)[Yi f hj(li)! J i=l • Prerequisite 1 < 0.5, otherwise stop

1 1- tt _ _ _ • Choose Gt E R. Npically Gt = 2ln-t

-t

I'ffiere 115 the weighted error rate of classifier ht

• Update

(.)

Dt(i)exp(-GtYiht(li)) Dttl1 =

Zt I'ffiere Zt is a normalisation factor (chosen so that Dt+ 1\'Iili be a distribution)

Output the final classifier

H(I) = sign (�Gtht(I)) Bagging

1. for m = 1 to M / / M". number of iterations a) draw (with replacement) a bootstrap

sample Sm of the data b) learn a classifier Cm from Sm 2. for each test example

a) try all classifiers Cm b) predict the class that receives the highest

number of votes Random Forest

1 Choose T -number of trees to grow. 2: Choose m-number of variables used to split

each node. m should be much less than M, where M is the number of input variables. m is hold constant while growing the forest.

3. Grow T trees. When growing each tree do the following.

(a) Construct a bootstrap sample of size n s�mpled from Sn with replacement and grow a tree from thIS bootstrap sample.

(b) When growing a tree at each node s�lect m variables at random and use them to find the best spht.

(c) Grow the tree to a maximal extent. There is no pruning. . 4. To classifY point X collect votes from every tree III the forest and then use majority voting to decide on the class label.

VI. PERFORMANCE MEASURES

Supervised Machine Learning (ML) has several ways of evaluating the performance of learning algorithms a�d the classifiers they produce. Measures of the qualIty of classification are built from a confusion matrix which

411

records correctly and incorrectly recognized examples for each class. Table II presents a confusion matrix for binary classification, where TP are true positive, FP false positive, FN false negative, and TN true negative counts and Table III shows various measures

I I

TABLE II. CONFUSION MATRIX

I I

I Predicted Label

I Positive I Negative

Positive (TP) (FN) F True Positive - , False Negative

Known Label

-, I . False Positive True Negative Negative

(FP) (TN)

TABLE III. DIFFERENT PERFORMANCE MEASURES

I Measure I Formula I Intuitive Meaning

I TP I (TP +FP) The percentage of

Precision positive predictions that are correct.

The percentage of Recall! positive labeled Sensitivity

TP I (TP + FN) instances that were predicted as positive.

The percentage of negative labeled

Specificity TN I (TN + FP) instances that were predicted as negative.

The percentage of F (TP+TN)I(TP+TN+FP predictions that are Accuracy

+ FN) correct.

2xRecallxPrecisioni Harmonic mean between F-measure

Recall+Precision precision and recall

VII. EXPERIMENTAL RESULTS

In Bioinformatics and machine learning in general, there is a large variation in the measures that are used to evaluate prediction systems. Often in Biological applicatio�s �e majority of the examples are negative, and here spe�Ificio/ and accuracy will always be high as long as the classIfie� IS not predicting too many positives. Here we us� e,:aluatlOn measures such as sensitivity(Recall), speCIfiCIty and accuracy to compare the results. WEKA software (Waikato environment for knowledge analysis ) was used for simulation purposes. . Table IV provides the result of all the three claSSIfiers. Experimental results show that Bagging and AdaBoost performs well compared to Random Fores� trees algo�thm. The highest Recall for Retroviral PTB III all claSSIfiers ensures that there are very few positive examples rnisclassified as the negative class.

TABLE IV. COMPARISON OF DIFFERENT CLASSIFIER ACCURACIES

Classifier Class Sensitivity Specificity Accuracy (Recall)

AdaBoost PTB 80% 100%

96% Retroviral 100% 80% PTB

Bagging PTB 84% 100%


Random PTB 68% 98% Forest trees


Followmg FIgure 1 shows the companson of all the three classifier ar c:.:c", ur::. a:::;c::.: i� e",s ____________ _

9S�

9)%

....

91\

91"

Figure I. Graph indicating the performance of all the classifiers

VIII. CONCLUSIONS

Tuberculosis is an important health concern as it is also associated with AIDS. Retrospective studies of tuberculosis suggest that active tuberculosis accelerates the progression of HIV infection. Recently, intelligent methods such as ANN have been intensively used for classification tasks. In this article we apply ensemble classification techniques, Adaboost, Bagging and Random forest for classifYing tuberculosis (TB). The data is obtained from the state hospital which mainly includes twelve preliminary symptoms (attributes).The data is classified into two categories namely pulmonary tuberculosis (PTB) and retroviral PTB i.e. TB along with AIDS. Evaluation measures such as sensitivity, specificity and accuracy are used for comparison. Random forest is found to be weak with 93% accuracy against 97% that of Bagging and 96% of Adaboost.

REFERENCES

[I] Minou Rabiei, Ritu Gupta "Excess Water Production Diagnosis in Oil Fields using Ensemble Classifiers" 2009 IEEE

[2] Hongqi Li, Haifeng Guo, Haimin Guo and Zhaoxu Meng " Data Mining Techniques for Complex Formation Evaluation in Petroleum Exploration and Production: A Comparison of Feature Selection and Classification Methods" in proc 2008 IEEE Pacific-Asia Workshop

412

on Computational Intelligence and Industrial Application Volume 01 ,Pages: 37-43

[3] Zhenzheng Ouyang, Min Zhou, Tao Wang and Quanyuan Wu "Mining Concept-Drifting and Noisy Data Streams using Ensemble Classifiers" 2009 International Conference on Artificial Intelligence and Computational Intelligence, Nov. 2009,pages:360-364

[4] Orhan Er, Feyzullah Temurtas and A.C. Tantrikulu, " Tuberculosis disease diagnosis using Artificial Neural networks ", Journal of Medical Systems, Springer, 2008, DOl 1O.1007/sI0916-008-9241-x online

[5] M. Sebban, I. Mokrousov, N. Rastogi and C. Sola" A data-mining approach to spacer oligo nucleotide typing of Mycobacterium tuberculosis" Bioinformatics, oxford university press, Vol 18, issue 2, 2002, pp 235-243

[6] Nicandro Cruz-Ramlfez , Hector-Gabriel Acosta-Mesa , Humberto Carrillo-Calvet , RoclO-Erandi Barrientos-Martmez "Discovering interobserver variability in the cytodiagnosis of breast cancer using decision trees and Bayesian networks" Applied Soft Computing, Elsevier, volume 9,issue 4,September 2009,pp 1331-1342

[7] Seppo Puuronen, Vagan Terziyan and Alexander Logvinovsky " Mining Several Data Bases With an Ensemble of Classifiers" in proc. J Oth International Conference on Database and Expert Systems Applications, Vo1.1677 , Pages: 882 - 891, 1999

[8] Tzung-I Tang,Gang Zheng ,Yalou Huang ,Guangfu Shu "A Comparative Study of Medical Data Classification Methods Based on Decision Tree and System Reconstruction Analysis" IEMS ,Vol. 4, issue I, June 2005, pp. 102-108,

[9] Tsirogiannis, G.L. Frossyniotis, D. Stoitsis, J. Golemati, S. Staf)tlopatis, A. Nikita, K.S. " Classification of medical data with a robust multi-level combination scheme" in proc. 2004 IEEE International Joint Conference on Neural Networks, 25-29 July 2004, volume 3, pp 2483- 2487

[10] R.E. Abdel-Aal "Improved classification of medical data using abductive network committees trained on different feature subsets" Computer Methods and Programs in Biomedicine, Volume 80, Issue 2, 2005, pp 141-153

[11] Gongqing Wu, Haiguang Li, Xuegang Hu, Yuanjun Bi, Jing Zhang and Xindong Wu "MReC4.5: C4.5 ensemble classification with MapReduce" in proc 2009 Fourth China Grid Annual Conference, pp:249-255, 2009

[12] Seppo Puuronen and Vagan Terziyan "A Similarity Evaluation Technique for Data Mining with an Ensemble of classifiers", Cooperative Information Agents III, Third International Workshop, CIA 1999, pp:163-174

[13] Lei Chen and Mohamed S. Kamel "New Design of Multiple Classifier System and its Application to the time series data" IEEE International Conference on Systems, Man and Cybernetics, 2007, pp: 385-391

[14] Kai Jiang, Haixia Chen, Senmiao Yuan "Classification for Incomplete Data Using Classifier Ensembles" Neural Networks and Brain, 2005.

[15] X. Wu, V. Kumar, Ross, J. Ghosh, Q. Yang, H. Motoda, G.Mclachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. I, pp. 1-37, January 2008.

[16] R. J. Quinlan, "Bagging, boosting, and c4.5," in AAAIIIAAI: Proceedings of the J 3th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference. Portland, Oregon, AAAI Press / The MIT Press, Vol. I, 1996, pp.725-730.

[17] J. Han and M. Kamber. Data mining: concepts and techniques: Morgan Kaufinann Pub, 2006.

[18] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition: Morgan Kaufmann Pub, 2005.

[ieee 2010 3rd ieee international conference on computer science and information technology (iccsit...

Documents