effective classification algorithms to predict the accuracy of tuberculosis - a machine learning...

8/6/2019 Effective Classification Algorithms to Predict the Accuracy of Tuberculosis - A Machine Learning Approach

http://slidepdf.com/reader/full/effective-classification-algorithms-to-predict-the-accuracy-of-tuberculosis 1/6

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011

Effective Classification Algorithms to Predict theAccuracy of Tuberculosis-A Machine Learning

Approach

Asha.TDept. of Info.Science & Engg.,

Bangalore Institute of TechnologyBangalore, INDIA

S. NatarajanDept. of Info. Science & Engg.P.E.S. Institute of Technology

Bangalore,INDIA

K.N.B. MurthyDept.of Info. Science & Engg.P.E.S.Institute of Technology

Bangalore,INDIA

Abstract — Tuberculosis is a disease caused by mycobacteriumwhich can affect virtually all organs, not sparing even therelatively inaccessible sites. India has the world’s highest burdenof tuberculosis (TB) with million estimated incident cases peryear. Studies suggest that active tuberculosis accelerates theprogression of Human Immunodeficiency Virus (HIV) infection.Tuberculosis is much more likely to be a fatal disease amongHIV-infected persons than persons without HIV infection.Diagnosis of pulmonary tuberculosis has always been a problem.Classification of medical data is an important task in theprediction of any disease. It even helps doctors in their diagnosisdecisions. In this paper we propose a machine learning approachto compare the performance of both basic learning classifiers andensemble of classifiers on Tuberculosis data. The classificationmodels were trained using the real data collected from a cityhospital. The trained models were then used for predicting theTuberculosis as two categories Pulmonary Tuberculosis(PTB)and Retroviral PTB(RPTB) i.e. TB along with Acquired ImmuneDeficiency Syndrome(AIDS). The prediction accuracy of theclassifiers was evaluated using 10-fold Cross Validation and theresults have been compared to obtain the best predictionaccuracy. The results indicate that Support Vector Machine(SVM) performs well among basic learning classifiers andRandom forest from ensemble with the accuracy of 99.14% fromboth classifiers respectively. Various other measures likeSpecificity, Sensitivity, F-measure and ROC area have been usedin comparison.

Keywords-component; Machine learning; Tuberculosis;Classification, PTB, Retroviral PTB

I. INTRODUCTION

There is an explosive growth of bio-medical data, rangingfrom those collected in pharmaceutical studies and cancertherapy investigations to those identified in genomics andproteomics research. The rapid progress in data miningresearch has led to the development of efficient and scalablemethods to discover knowledge from these data. Medical datamining is an active research area under data mining sincemedical databases have accumulated large quantities of information about patients and their clinical conditions.Relationships and patterns hidden in this data can provide new

medical knowledge as has been proved in a number of medicaldata mining applications.

Data classification process using knowledge obtained fromknown historical data has been one of the most intensively

studied subjects in statistics, decision science and computerscience. Data mining techniques have been applied to medicalservices in several areas, including prediction of effectivenessof surgical procedures, medical tests, medication, and thediscovery of relationships among clinical and diagnosis data.In order to help the clinicians in diagnosing the type of diseasecomputerized data mining and decision support tools are usedwhich are able to help clinicians to process a huge amount of data available from solving previous cases and suggest theprobable diagnosis based on the values of several importantattributes. There have been numerous comparisons of thedifferent classification and prediction methods, and the matterremains a research topic. No single method has been found tobe superior over all others for all data sets.

India has the world’s highest burden of tuberculosis (TB) withmillion estimated incident cases per year. It also ranks[20]among the world’s highest HIV burden with an estimated 2.3million persons living with HIV/AIDS. Tuberculosis is muchmore likely to be a fatal disease among HIV-infected personsthan persons without HIV infection . It is a disease caused bymycobacterium which can affect virtually all organs, notsparing even the relatively inaccessible sites. Themicroorganisms usually enter the body by inhalation throughthe lungs. They spread from the initial location in the lungs toother parts of the body via the blood stream. They present adiagnostic dilemma even for physicians with a great deal of experience in this disease.

II. RELATED WORK

Orhan Er. And Temuritus[1] present a study on tuberculosisdiagnosis, carried out with the help of Multilayer NeuralNetworks (MLNNs). For this purpose, an MLNN with twohidden layers and a genetic algorithm for training algorithmhas been used. Data mining approach was adopted to classifygenotype of mycobacterium tuberculosis using c4.5algorithm[2].Rethabile Khutlang et.al. present methods for the

89 http://sites.google.com/site/ijcsis/ISSN 1947-5500




automated identification of Mycobacterium tuberculosis inimages of Ziehl–Neelsen (ZN) stained sputum smears obtainedusing a bright-field microscope.They segment candidatebacillus objects using a combination of two-class pixelclassifiers[3].

Sejong Yoon, Saejoon Kim [4] proposes a mutualinformation-based Support Vector Machine Recursive Feature

Elimination (SVM-RFE) as the classification method withfeature selection in this paper.Diagnosis of breast cancerusing different classification techniques was carriedout[5,6,7,8]. A new constrained-syntax genetic programmingalgorithm[9] was developed to discover classification rulesfor diagnosing certain pathologies.Kwokleung Chan et.al. [10]used several machine learning and traditional calssifiers in theclassification of glaucoma disease and compared theperformance using ROC. Various classification algorithmsbased on statistical and neural network methods werepresented and tested for quantitative tissue characterization of diffuse liver disease from ultrasound images[11] andcomparison of classifiers in sleep apnea[18] . Ranjit Abrahamet.al.[19] propose a new feature selection algorithm CHI-WSSto improve the classification accuracy of Naïve Bayes withrespect to medical datasets.

Minou Rabiei et.al.[12] use tree based ensemble classifiers forthe diagnosis of excess water production. Their resultsdemonstrate the applicability of this technique in successfuldiagnosis of water production problems. Hongqi Li, HaifengGuo et.al. present[13] a comprehensive comparative study onpetroleum exploration and production using five featureselection methods including expert judgment, CFS, LVF,Relief-F, and SVM-RFE, and fourteen algorithms from fivedistinct kinds of classification methods including decision tree,artificial neural network, support vector machines(SVM),

Bayesian network and ensemble learning.Paper on “Mining Several Data Bases with an Ensemble of Classifiers”[14] analyze the two types of conflicts, one createdby data inconsistency within the area of the intersection of thedata bases and the second is created when the meta methodselects different data mining methods with inconsistentcompetence maps for the objects of the intersected part andtheir combinations and suggest ways to handle them.Referenced paper[15] studies medical data classificationmethods, comparing decision tree and system reconstructionanalysis as applied to heart disease medical data mining.Under most circumstances, single classifiers, such as neuralnetworks, support vector machines and decision trees, exhibit

worst performance. In order to further enhance performancecombination of these methods in a multi-level combinationscheme was proposed that improves efficiency[16]. paper[17]demonstrates the use of adductive network classifiercommittees trained on different features for improvingclassification accuracy in medical diagnosis.

III. DATA SOURCE

The medical dataset we are classifying includes 700 realrecords of patients suffering from TB obtained from a cityhospital. The entire dataset is put in one file having manyrecords. Each record corresponds to most relevant informationof one patient. Initial queries by doctor as symptoms and somerequired test details of patients have been considered as mainattributes. Totally there are 11 attributes(symptoms) and oneclass attribute. The symptoms of each patient such as age,chroniccough(weeks), loss of weight, intermittent fever(days),night sweats, Sputum, Bloodcough, chestpain, HIV,radiographic findings, wheezing and class are considered asattributes.

Table I shows names of 12 attributes considered along withtheir Data Types (DT). Type N-indicates numerical and C iscategorical.

Table I. List of Attributes and their Datatypes

No Name DT

1 Age N

2 Chroniccough(weeks) N

3 WeightLoss C

4 Intermittentfever N

5 Nightsweats C

6 Bloodcough C

7 Chestpain C

8 HIV C

9 Radiographicfindings C

10 Sputum C

11 Wheezing C

12 Class C

IV. CLASSIFICATION ALGORITHMS

SVM (SMO)The original SVM algorithm was invented by VladimirVapnik. The standard SVM takes a set of input data, andpredicts, for each given input, which of two possible classesthe input is a member of, which makes the SVM a non-probabilistic binary linear classifier.A support vector machine constructs a hyperplane or set of

hyperplanes in a high or infinite dimensional space, which canbe used for classification, regression or other tasks. Intuitively,a good separation is achieved by the hyperplane that has thelargest distance to the nearest training data points of any class(so-called functional margin), since in general the larger themargin the lower the generalization error of the classifier .

K-Nearest Neighbors( IBK )The k -nearest neighbors algorithm ( k -NN) is a method for[22]classifying objects based on closest training examples in the





feature space. k -NN is a type of instance-based learning., orlazy learning where the function is only approximated locallyand all computation is deferred until classification. Here anobject is classified by a majority vote of its neighbors, with theobject being assigned to the class most common amongst its k nearest neighbors ( k is a positive, typically small).

Naive Bayesian Classifier ( Naive Bayes )

It is Bayes classifier which is a simple probabilistic classifierbased on applying Baye’s theorem(from Bayesian statistics)with strong (naive) independence[23] assumptions. Inprobability theory Bayes theorem shows how one conditionalprobability (such as the probability of a hypothesis givenobserved evidence) depends on its inverse (in this case, theprobability of that evidence given the hypothesis). In moretechnical terms, the theorem expresses the posteriorprobability (i.e. after evidence E is observed) of a hypothesisH in terms of the prior probabilities of H and E, and theprobability of E given H. It implies that evidence has astronger confirming effect if it was more unlikely before beingobserved.

C4.5 Decision Tree( J48 in weka )Perhaps C4.5 algorithm which was developed by Quinlan isthe most popular tree classifier[21]. It is a decision supporttool that uses a tree-like graph or model of decisions and theirpossible consequences, including chance event outcomes,resource costs, and utility. Weka classifier package has its ownversion of C4.5 known as J48. J48 is an optimizedimplementation of C4.5 rev. 8.

Bagging(bagging)Bagging (Bootstrap aggregating) was proposed by LeoBreiman in 1994 to improve the classification by combiningclassifications of randomly generated training sets. Theconcept of bagging (voting for classification, averaging forregression-type problems with continuous dependent variablesof interest) applies to the area of predictive data mining tocombine the predicted classifications (prediction) frommultiple models, or from the same type of model for differentlearning data. It is a technique generating multiple trainingsets by sampling with replacement from the available trainingdata and assigns vote for each classification.

Adaboost( Adaboost M1 )AdaBoost is an algorithm for constructing a “strong” classifieras linear combination of “simple” “weak” classifier. Instead of resampling, Each training sample uses a weight to determinethe probability of being selected for a training set. Finalclassification is based on weighted vote of weak classifiers.AdaBoost is sensitive to noisy data and outliers. However insome problems it can be less susceptible to the overfittingproblem than most learning algorithms.

Random forest ( or random forests )The algorithm for inducing a random forest was developed byleo-braiman[25]. The term came from random decision foreststhat was first proposed by Tin Kam Ho of Bell Labs in 1995. Itis an ensemble classifier that consists of many decision treesand outputs the class that is the mode of the class's output by

individual trees. It is a popular algorithm which builds arandomized decision tree in each iteration of the baggingalgorithm and often produces excellent predictors.

V. EXPERIMENTAL SETUP

The open source tool Weka was used in different phases of theexperiment. Weka is a collection of state-of-the-art machinelearning algorithms[26] for a wide range of data mining taskssuch as data preprocessing, attribute selection, clustering, andclassification. Weka has been used in prior research both in thefield of clinical data mining and in bioinformatics.

Weka has four main graphical user interfaces(GUI).The maingraphical user interface are Explorer and Experimenter. OurExperiment has been tried under both Explorer andExperimenter GUI of weka. In the Explorer we can flip back and forth between the results we have obtained,evaluate themodels that have been built on different datasets, and visualizegraphically both the models and the datasets themselves-including any classification errors the models make.Experimenter on the other side allows us to automate theprocess by making it easy to run classifiers and filters withdifferent parameter settings on a corpus of datasets, collectperformance statistics, and perform significance tests.Advanced users can employ the Experimenter to distribute thecomputing load across multiple machines using java remotemethod invocation.

A. Cross-Validation

Cross validation with 10 folds has been used for evaluating theclassifier models. Cross-Validation (CV) is the standard DataMining method for evaluating performance of classificationalgorithms mainly, to evaluate the Error Rate of a learningtechnique. In CV a dataset is partitioned in n folds, where eachis used for testing and the remainder used for training. Theprocedure of testing and training is repeated n times so thateach partition or fold is used once for testing. The standardway of predicting the error rate of a learning technique given asingle, fixed sample of data is to use a stratified 10-fold cross-validation. Stratification implies making sure that whensampling is done each class is properly represented in bothtraining and test datasets. This is achieved by randomlysampling the dataset when doing the n fold partitions.In a stratified 10-fold Cross-Validation the data is dividedrandomly into 10 parts in which the class is represented inapproximately the same proportions as in the full dataset. Eachpart is held out in turn and the learning scheme trained on theremaining nine-tenths; then its error rate is calculated on theholdout set. The learning procedure is executed a total of 10times on different training sets, and finally the 10 error ratesare averaged to yield an overall error estimate. When seekingan accurate error estimate, it is standard procedure to repeatthe CV process 10 times. This means invoking the learningalgorithm 100 times. Given two models M1 and M2 withdifferent accuracies tested on different instances of a data set,





Table III. Performance comparison of various classifiers

REFERENCES [1] Orhan Er, Feyzullah Temurtas and A.C. Tantrikulu, “ Tuberculosis

disease diagnosis using Artificial Neural Networks ”,Journal of MedicalSystems, Springer, DOI 10.1007/s10916-008-9241-x online,2008.

[2] M. Sebban, I Mokrousov, N Rastogi and C Sola “ A data-miningapproach to spacer oligo nucleotide typing of Mycobacteriumtuberculosis” Bioinformatics, oxford university press, Vol 18, issue 2,pp 235-243. J. Clerk Maxwell, A Treatise on Electricity and Magnetism,3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73,2002.

[3] Rethabile Khutlang, Sriram Krishnan, Ronald Dendere, AndrewWhitelaw, Konstantinos Veropoulos, Genevieve Learmonth, and TaniaS. Douglas, “Classification of Mycobacterium tuberculosis in Images of ZN-Stained Sputum Smears”, IEEE Transactions On InformationTechnology In Biomedicine, VOL. 14, NO. 4, JULY 2010 .

[4] Sejong Yoon and Saejoon Kim, “ Mutual information-based SVM-RFEfor diagnostic Classification of digitized mammograms”, PatternRecognition Letters, Elsevier, volume 30, issue 16, pp 1489–1495,December 2009.

Disease categories(class)Classifier category Classifier model Various measures

PTB RPTB

TPR/ Sensitivity 98.9% 99.6%

FPR 0.004 0.011

Specificity 99.6% 98.9%

SVM(SMO)

PredictionAccuracy

99.14%


FPR 0.03 0.008


K-NN(IBK)

PredictionAccuracy

98.4%


FPR 0.035 0.037


Naive Bayes

PredictionAccuracy

96.4%

TPR/ Sensitivity 98.5% 100%

FPR 0 0.015

Specificity 100% 98.5%

Basic Learning classifiers

C4.5 Decision Trees(J48)

PredictionAccuracy

99%


FPR 0.004 0.015


Bagging

PredictionAccuracy

98.85%

TPR/ Sensitivity 98.5% 100%FPR 0 0.015

Specificity 100% 98.5%

Adaboost(AdaboostM1)

PredictionAccuracy

99%


FPR 0.004 0.011


Ensemble classifiers

Random Forest

PredictionAccuracy

99.14%





[5] Nicandro Cruz-Ramırez , Hector-Gabriel Acosta-Mesa , HumbertoCarrillo-Calvet and Rocıo-Erandi Barrientos-Martınez, “Discoveringinterobserver variability in the cytodiagnosis of breast cancer usingdecision trees and Bayesian networks” Applied Soft Computing, Elsevier,volume 9,issue 4,pp 1331–1342, September 2009.

[6] Liyang Wei, Yongyi Yanga and Robert M Nishikawa,“Microcalcification classification assisted by content-based imageretrieval for breast cancer diagnosis” Pattern Recognition , Elsevier,volume 42,issue 6, pp 1126 – 1132, june 2009.

[7] Abdelghani Bellaachia and Erhan Guven, “ Predicting breast cancersurvivability using Data Mining Techniques” Artificial Intelligence in

Medicine, Elsevier , Volume 34, Issue 2, pp 113-127, june 2005.[8] Maria-Luiza Antonie, Osmar R Zaıane and Alexandru Coman,

“Application of data mining techniques for medical image classification ”In Proceedings of Second International Workshop on Multimedia DataMining (MDM/KDD’2001) in conjunction with Seventh ACM SIGKDD , pp 94-101,2000.

[9] Celia C Bojarczuk, Heitor S Lopes and Alex A Freitas, “ Data Miningwith Constrained-Syntax Genetic Programming: Applications in MedicalData Set” Artificial Intelligence in Medicine, Elsevier, volume 30, issue1, pp. 27-48,2004.

[10] Kwokleung Chan, Te-Won Lee , Associate Member, IEEE , Pamela A.Sample, Michael H. Goldbaum, Robert N. Weinreb, and Terrence J.Sejnowski , Fellow, IEEE ,“ Comparison of Machine Learning andTraditional Classifiers in Glaucoma Diagnosis”, IEEE Transactions OnBiomedical Engineering, volume 49, NO. 9, September 2002.

[11] Yasser M. Kadah, Aly A. Farag, Member, IEEE, Jacek M. Zurada,Fellow, IEEE, Ahmed M. Badawi, and Abou-Bakr M. Youssef,“Classification algorithms for Quantitative Tissue Characterization of diffuse liver disease from ultrasound images”, IEEE Transactions OnMedical Imaging, volume 15, NO. 4, August 1996.

[12] Minou Rabiei and Ritu Gupta, “Excess Water Production Diagnosis inOil Fields using Ensemble Classifiers”, in proc. of InternationalConference on Computational Intelligence and Software Engineering ,IEEE,pages:1-4,2009.

[13] Hongqi Li, Haifeng Guo, Haimin Guo and Zhaoxu Meng, “ Data MiningTechniques for Complex Formation Evaluation in Petroleum Explorationand Production: A Comparison of Feature Selection and ClassificationMethods” in proc. 2008 IEEE Pacific-Asia Workshop on ComputationalIntelligence and Industrial Application ,volume 01 Pages: 37-43,2008.

[14] Seppo Puuronen, Vagan Terziyan and Alexander Logvinovsky, “Miningseveral data bases with an Ensemble of classifiers ” in proc. 10thInternational Conference on Database and Expert Systems Applications,Vol.1677 , pp: 882 – 891, 1999.

[15] Tzung-I Tang,Gang Zheng ,Yalou Huang and Guangfu Shu, “Acomparative study of medical data classification methods based onDecision Tree and System Reconstruction Analysis” IEMS ,Vol. 4, issue1, pp. 102-108, June 2005.

[16] Tsirogiannis, G.L. Frossyniotis, D. Stoitsis, J. GolematiS. Stafylopatis and A. Nikita, K.S, “Classification of medical data witha robust multi-level combination scheme ” in proc. 2004 IEEEInternational Joint Conference on Neural Networks , volume 3, pp 2483-2487, 25-29 July 2004.

[17] R.E. Abdel-Aal, “Improved classification of medical data using abductivenetwork committees trained on different feature subsets” ComputerMethods and Programs in Biomedicine, volume 80, Issue 2, pp. 141-153,2005.

[18] Kemal polat,Sebnem Yosunkaya and Salih Guines, “Comparison of different classifier algorithms on the Automated Detection of ObstructiveSleep Apnea Syndrome”, Journal of Medical Systems,volume 32 ,Issue 3,pp. 9129-9, June 2008.

[19] Ranjit Abraham, Jay B.Simha and Iyengar S.S “Medical datamining witha new algorithm for Feature Selection and Naïve Bayesian classifier”proceedings of 10th International Conference on InformationTechnology, IEEE, pp.44-49,2007.

[20] HIV Sentinel Surveillance and HIV Estimation, 2006. New Delhi, India:National AIDS Control Organization, Ministry of Health and FamilyWelfare, Government of India.

http://www.nacoonline.org/Quick_Links/HIV_Data/ Accessed 06February, 2008.

[21] J.R. Quinlan, “Induction of Decision Trees” Machine Learning 1, KluwerAcademic Publishers, Boston, pp 81-106, 1986.

[22] Thomas M. Cover and Peter E. Hart, "Nearest neighbor patternclassification," IEEE Transactions on Information Theory, volume. 13,issue 1, pp. 21-27,1967.

[23] Rish and Irina, “An empirical study of the naïve Bayes classifier”, IJCAI2001, workshop on empirical methods in artificial intelligence,2001(available online).

[24] R. J. Quinlan, "Bagging, boosting, and c4.5," in AAAI/IAAI: Proceedingsof the 13th National Conference on Artificial Intelligence and 8thInnovative Applications of Artificial Intelligence Conference. Portland,Oregon, AAAI Press / The MIT Press, Vol. 1, pp.725-730,1996.

[25] Breiman, Leo, "Random Forests". Machine Learning 45 (1): 5–32.,Doi:10.1023/A:1010933404324,2001.

[26] Weka – Data Mining Machine Learning Software,http://www.cs.waikato.ac.nz/ml/.

[27] J. Han and M. Kamber, Data mining: concepts and techniques: MorganKaufmann Publishers, 2006.

[28] I. H. Witten and E. Frank, Data Mining: Practical Machine LearningTools and Techniques, Second Edition: Morgan Kaufmann Publishers,2005.

AUTHORS PROFILEMrs.Asha.T obtained her Bachelors and Masters in Engg.,from Bangalore University, Karnataka, India. She ispursuing her research leading to Ph.D in VisveswarayaTechnological University under the guidance of Dr. S.Natarajan and Dr. K.N.B. Murthy. She has over 16 yearsof teaching experience and currently working as Assistantprofessor in the Dept. of Information Science & Engg.,B.I.T. Karnataka, India. Her Research interests are in DataMining, Medical Applications, Pattern Recognition, and

Artificial Intelligence.

Dr S.Natarajan holds Ph. D (Remote Sensing) fromJNTU Hyderabad India. His experience spans 33 years inR&D and 10 years in Teaching. He worked in DefenceResearch and Development Laboratory (DRDL),Hyderabad, India for Five years and later worked forTwenty Eight years in National Remote Sensing Agency,Hyderabad, India. He has over 50 publications in peerreviewed Conferences and Journals His areas of interestare Soft Computing, Data Mining and Geographical

Information System.

Dr. K. N. B. Murthy holds Bachelors in Engineeringfrom University of Mysore, Masters from IISc,Bangalore and Ph.D. from IIT, Chennai India. He hasover 30 years of experience in Teaching, Training,Industry, Administration, and Research. He has authoredover 60 papers in national, international journals andconferences, peer reviewer to journal and conferencepapers of national & international repute and hasauthored book. He is the member of several academic

committees Executive Council, Academic Senate, University Publication

Committee, BOE & BOS, Local Inquiry Committee of VTU, Governing BodyMember of BITES, Founding Member of Creativity and Innovation Platform of Karnataka. Currently he is the Principal & Director of P.E.S. Institute of Technology, Bangalore India. His research interest includes ParallelComputing, Computer Networks and Artificial Intelligence.


effective classification algorithms to predict the accuracy of tuberculosis - a machine learning...

Documents