[ieee 2014 international conference on computing for sustainable global development (indiacom) - new...

5
Naïve Bayes Classification of DRDO Tender Documents Sumit Goswami Directorate of P&C DRDO New Delhi, India [email protected] Prakriti Bhardwaj Dept. of Information Technology JSSATE NOIDA, India [email protected] Sunaina Kapoor Dept. of Computer Science & Engg. IGDTUW (Formerly IGIT) Delhi, India [email protected] Abstract—We propose a technique of automatic classification of DRDO tender documents into predefined technological categories. The dataset comprised of 698 tender files obtained from DRDO website. The dataset was processed and fed into Weka toolkit. Experiments were conducted using Naïve Bayes classifier with 10 folds cross validation. The documents were classified with 75.21 percent accuracy by technology area and 68.62 percent by lab’s name when the dataset included lab names somewhere in the document. An accuracy result of 74.36 percent and 68.34 percent was attained for the technology area and name of the lab respectively when all occurrences of lab names were removed from the dataset. This experiment is a step to train auto- classification of the tender documents available on Internet for Selective Dissemination of Information to the Concerned Vendors. Also, this is an attempt to gauge the area of work, undergoing tasks and the exact work being carried by an organisation based on the tenders published by it. Keywords—Machine Learning; Naïve Bayes; Selective Dissemination of Information (SDI); Tender Document Classification; Text Mining I. INTRODUCTION The unswerving increase of categories for tender documents has resulted in the need for methods that can automatically classify new documents within such categories. In this paper we attempt to make the categorization process automated using a machine learning approach, i.e., to learn from the historic data and use the learned knowledge to classify a new document [1]. We commence this paper with an overview of DRDO and then move to the introduction of a few terminologies which are relevant to this paper. Defence Research & Development Organisation (DRDO) works under Department of Defence Research and Development of Ministry of Defence in India [2]. A tender is a document that a purchasing agent publishes to announce its request for certain goods or services [3]. The whole tendering system is similar to that of a traditional one, i.e. a selling agent submits a bid to the purchasing organisation against the tender enquiry, which evaluates the quoting vendors based on the technical specifications as well as financial terms and conditions. Then the purchasing agent announces the successful bidder [3]. We carried out the classification on DRDO tenders available on its website (url- http://drdo.gov.in/drdo/tenders/liveTenders.jsp), under eight technology categories and thirty four labs. DRDO website: www.drdo.gov.in divides the labs into eight different technology areas and the same technology areas and the cluster of labs in those technology areas have been used in the paper. The tender classification was done using Information Extraction and Data Mining techniques. The dataset was prepared as a term-document matrix [4] of words and tender names using Vector Space Model [5] concept. The dataset was then fed into Weka Toolkit (version 3.7.4) in which Naïve Bayes Classifier was chosen. II. RELATED WORK In reference [7] the classification of tender documents is done including the non-dictionary words in the dataset. When a tender document is published, it contains plenty of technical terms as well as abbreviations related to an organization and its field of work. Though these may be non-dictionary words, they might exist in the technical thesaurus related to that organization. Besides, a tender document always contains the name of the tendering organization at a few places in the document. In [7] these technical abbreviations, technology cluster specific terminologies and the name of the organization were not removed from the dataset. It can be well assumed that as these words are lab and technology specific, it will give a direct indication of the name of the lab and the technology area of the tender. So, in this paper an attempt was made to experiment the variations by removing such technology and lab specific words. In [10] they utilized the Bayes formula to vectorize a document according to a probability distribution based on keywords reflecting the probable categories that the document may belong to. Fabrizio Sebastian [9] termed Machine Learning as “a general inductive process to automatically build the classifier by learning from a set of pre-classified documents, the characteristics of the categories”. References [11, 12] reported the Naïve Bayes as an accurate mean of categorizing the documents. Finally, [11, 13, 14, 15, 16] present the few applications of the text categorization such as for classification of blogs, web documents, scientific articles like medical repositories among others [7]. 593 978-93-80544-12-0/14/$31.00 c 2014 IEEE

Upload: sunaina

Post on 01-Feb-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2014 International Conference on Computing for Sustainable Global Development (INDIACom) - New Delhi, India (2014.3.5-2014.3.7)] 2014 International Conference on Computing for

Naïve Bayes Classification of DRDO Tender Documents

Sumit Goswami Directorate of P&C

DRDO New Delhi, India

[email protected]

Prakriti Bhardwaj Dept. of Information Technology

JSSATE NOIDA, India

[email protected]

Sunaina Kapoor Dept. of Computer Science & Engg.

IGDTUW (Formerly IGIT) Delhi, India

[email protected]

Abstract—We propose a technique of automatic classification of DRDO tender documents into predefined technological categories. The dataset comprised of 698 tender files obtained from DRDO website. The dataset was processed and fed into Weka toolkit. Experiments were conducted using Naïve Bayes classifier with 10 folds cross validation. The documents were classified with 75.21 percent accuracy by technology area and 68.62 percent by lab’s name when the dataset included lab names somewhere in the document. An accuracy result of 74.36 percent and 68.34 percent was attained for the technology area and name of the lab respectively when all occurrences of lab names were removed from the dataset. This experiment is a step to train auto-classification of the tender documents available on Internet for Selective Dissemination of Information to the Concerned Vendors. Also, this is an attempt to gauge the area of work, undergoing tasks and the exact work being carried by an organisation based on the tenders published by it.

Keywords—Machine Learning; Naïve Bayes; Selective Dissemination of Information (SDI); Tender Document Classification; Text Mining

I. INTRODUCTION The unswerving increase of categories for tender

documents has resulted in the need for methods that can automatically classify new documents within such categories. In this paper we attempt to make the categorization process automated using a machine learning approach, i.e., to learn from the historic data and use the learned knowledge to classify a new document [1].

We commence this paper with an overview of DRDO and then move to the introduction of a few terminologies which are relevant to this paper. Defence Research & Development Organisation (DRDO) works under Department of Defence Research and Development of Ministry of Defence in India [2]. A tender is a document that a purchasing agent publishes to announce its request for certain goods or services [3]. The whole tendering system is similar to that of a traditional one, i.e. a selling agent submits a bid to the purchasing organisation against the tender enquiry, which evaluates the quoting vendors based on the technical specifications as well as financial terms and conditions.

Then the purchasing agent announces the successful bidder [3]. We carried out the classification on DRDO tenders

available on its website (url-http://drdo.gov.in/drdo/tenders/liveTenders.jsp), under eight technology categories and thirty four labs. DRDO website: www.drdo.gov.in divides the labs into eight different technology areas and the same technology areas and the cluster of labs in those technology areas have been used in the paper. The tender classification was done using Information Extraction and Data Mining techniques. The dataset was prepared as a term-document matrix [4] of words and tender names using Vector Space Model [5] concept. The dataset was then fed into Weka Toolkit (version 3.7.4) in which Naïve Bayes Classifier was chosen.

II. RELATED WORK In reference [7] the classification of tender documents is

done including the non-dictionary words in the dataset. When a tender document is published, it contains plenty of technical terms as well as abbreviations related to an organization and its field of work. Though these may be non-dictionary words, they might exist in the technical thesaurus related to that organization. Besides, a tender document always contains the name of the tendering organization at a few places in the document. In [7] these technical abbreviations, technology cluster specific terminologies and the name of the organization were not removed from the dataset. It can be well assumed that as these words are lab and technology specific, it will give a direct indication of the name of the lab and the technology area of the tender. So, in this paper an attempt was made to experiment the variations by removing such technology and lab specific words.

In [10] they utilized the Bayes formula to vectorize a document according to a probability distribution based on keywords reflecting the probable categories that the document may belong to. Fabrizio Sebastian [9] termed Machine Learning as “a general inductive process to automatically build the classifier by learning from a set of pre-classified documents, the characteristics of the categories”. References [11, 12] reported the Naïve Bayes as an accurate mean of categorizing the documents. Finally, [11, 13, 14, 15, 16] present the few applications of the text categorization such as for classification of blogs, web documents, scientific articles like medical repositories among others [7].

593978-93-80544-12-0/14/$31.00 c©2014 IEEE

Page 2: [IEEE 2014 International Conference on Computing for Sustainable Global Development (INDIACom) - New Delhi, India (2014.3.5-2014.3.7)] 2014 International Conference on Computing for

III. DATASET

A. Overview DRDO has a network of 52 laboratories which are engaged

in technologies covering various fields namely aeronautics, armaments, electronic and computer sciences, human resource development, life sciences, materials, missiles, combat engineering and naval research and development [6]. Each lab belongs to a specific technology cluster. Each lab is involved in its own projects and related technical activities in accordance with its mission, vision and core competence. As the area of work is different for each lab, there is generally no centralized procurement in the organization i.e. DRDO. Instead, each lab procures the items based on its usage in the lab and requirement in the projects. Hence most of the labs publish their tender documents in the newspapers and DRDO website on the Internet. In this paper we had built the corpus by downloading the lab specific tender documents from DRDO website. Dataset was prepared from a total of 698 tender files published during August 2010 to July 2011. Fig. 1 shows the distribution of these downloaded tenders into technology areas.

B. Pre-processing As the downloaded documents were in the pdf format, the

first step was converting them into the text documents using the pdf to text converter. The files were then passed through a stop word filter for removing the common words such as “a”, “of”, “the” etc. After this a list of words was prepared for each file using Word List Expert Version 3.2 software. The words which occurred only once in the document were removed as well. The next step involved merging the word lists of all the documents and generating of a list of 8,549 unique words after removing the duplicates from this collective list. Further, the non dictionary words were removed. The experiment was carried out twice, once on the dataset containing the names of the DRDO labs (a total of 6426 words), and then on the refined dataset after removing the 34 lab names (a total of 6392 words). At last, the documents were represented in the ARFF (Attribute Relation File Format) which was supplied to the

Fig. 1. Distribution of lab tenders among fields

WEKA toolkit.

IV. CLASSIFICATION ALGORITHM For predicting the lab and technology area the tender

document belongs to, we used the Naïve Bayes classifier. The “bag-of-words” approach was used to represent the documents. The ARFF document supplied to the WEKA toolkit was based on the multivariate Bernoulli event model of Naïve Bayes [12] wherein the possible value of the feature can be 1 (if the feature is present in the document) or 0 (if it is absent). The WEKA toolkit was used to train the Naïve Bayes classifier using 10-folds cross validation technique. Under this technique, the data set is partitioned into 10 sets, 9 of these sets are used as training sets and the remaining one as the testing data. The procedure is repeated 10 times and the mean accuracy is calculated [7].

“The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics (categories)." [10] Naïve Bayes classifier uses the concept of conditional probabilities. It is built on the Bayes formula.

V. RESULTS AND DISCUSSIONS For the classification based on the technology field name,

Naïve Bayes classifier yielded an accuracy of 75.21% when lab names were included while an accuracy of 74.36% on excluding them. Table 1 shows the confusion matrix for technology field classification for field name (with lab names). Table 2 shows the confusion matrix for field classification for field name (without lab names).

The confusion matrix of Table 3 and Table 4 illustrate the experimental results for the classification based on lab name including and excluding lab names in the dataset respectively. An accuracy of 68.62% (with lab names) and 68.34% (without lab names) was achieved for categorization based on the lab name.

VI. CONCLUSION AND FUTURE WORK This experimental paper reported an accuracy of 68.62%

(with lab names in the dataset) and 68.34% (without lab names in the dataset) when classified on the basis of lab names. However a better accuracy of 75.21% (with lab names in the dataset) and 74.36% (without lab names in the dataset) was attained for technology field name.

One of the reasons for the percent accuracy to be around the obtained values and not better is that there are certain common items which are tendered by plenty of labs and across different technology clusters. Some of these common items are procurement of computers, workstations, storage and servers, AMCs of computers, LAN and peripherals, common software, transport services, procurement of books and journals, Expression of Interest for Vendor Registration, Environmental Control and Hygiene etc. As these items are common tender items across all the labs and technology clusters, it can be one of the reasons leading to a slight reduction in the % accuracy which can be confirmed only after further experimentation.

594 2014 International Conference on Computing for Sustainable Global Development (INDIACom)

Page 3: [IEEE 2014 International Conference on Computing for Sustainable Global Development (INDIACom) - New Delhi, India (2014.3.5-2014.3.7)] 2014 International Conference on Computing for

TABLE I. CONFUSION MATRIX FOR TECHNOLOGY FIELD CLASSIFICATION (WITH LAB NAMES IN THE DATASET)

TABLE II. CONFUSION MATRIX FOR TECHNOLOGY FIELD CLASSIFICATION (WITHOUT LAB NAMES IN THE DATASET)

In [7] the non-dictionary words were not removed from the dataset. The same classification approach was used in both [7] and this experiment. The percent accuracy by field name saw a slight decrease in the present experiment while an improved accuracy was obtained in case of classification by lab name.

This paper is a step towards future experimentation of attempting to detect the project being undertaken by an organisation and its stage of development by machine learning based on the tender published. Generally, the details of a project being under-taken by any research organisation and its stage of development is considered as classified information. Analogous to this, in a Government funded research organization, the tender which can point towards this classified information has to be published in the public, which is contradictory.

REFERENCES [1] Y. Wang, H. Zang, B. Spencer and Y. Yan, A Text Categorization

Approach for Match-Making in Online Business Tendering”, Journal of Business and Tech., Vol.1, No.1, Canada, 2005.

[2] Defence Research & Development Organisation, http://www.drdo.nic.in. [3] Y. Wang, H. Zang, B. Spencer and Y. Yan, “Text Categorization for an

Online Tendering System”, Proceedings of BASeWEB'04, 2004. [4] L. Elden - Fundamentals of algorithms- matrix methods in data mining

and pattern recognition; First Edition; Linkoping Univ., Linkoping, Sweden, SIAM, 2007.

[5] G. Salton and M.E. Lesk, "Computer evaluation of indexing and text processing", Journal of the ACM, January 1968, pp. 8-36.

[6] Defence Research & Development Organisation, http://www.en.wikipedia.org/wiki/Defence_Research_and_Development_Organisation.

[7] S. Goswami, S. Kapoor and P. Bhardwaj, “Machine Learning for Automated Tender Classification”, IEEE India Conference, INDICON, Hyderabad, India, 2011.

[8] S. Goswami, S. Sarkar and M. Rustagi, “Stylometric Analysis of Blogger's Age and Gender”, Proceedings of the Third International ICWSM, The AAAI Press, 2009, pp. 214–217.

[9] F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, pp. 1-47, 2002.

[10] D. Isa, L. H. Lee, V. P. Kallimani and R. RajKumar, “Text document pre-processing with the bayes formula for classification using the support vector machine”, IEEE Transactions on Knowledge and Data Engineering, Vol.20, No.9, pp. 1264-1272, 2008.

[11] Y. Li and A. K. Jain, “Classification of Text Documents”, The Computer Journal, Vol. 41, No. 8, 1998.

[12] Y. Wang, J. Hodges and B. Tang, “Classification of web documents using a naïve bayes method”, 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2003.

[13] R. Bai, X. Wang and J. Liao, “Folksnomy for the blogosphere: blog identification and classification”, Shandong Univ. of Technology Library Zibo, CSIE, IEEE, China, 2009, pp. 631-635.

[14] R. Rak, L. Kurgan and M. Reformat, “Multi-label associative classification of medical documents from MEDLINE”, Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA), 2005.

a b c d e f g h classified as 42 21 13 4 0 0 0 0 a=aeronautics 1 225 14 5 0 0 0 0 b=electronics 0 18 71 0 5 0 0 0 c=armaments 0 17 17 18 0 2 1 3 d=combat-engg 0 12 5 1 80 0 1 0 e=life-science 0 15 2 0 0 79 1 0 f=materials 0 8 0 0 0 0 2 0 g=missiles 0 4 0 1 0 2 0 8 h=naval

a b c d e f g h classified as 42 21 13 4 0 0 0 0 a=aeronautics 1 225 14 5 0 0 0 0 b=electronics 0 19 70 0 5 0 0 0 c=armaments 0 17 17 18 0 2 1 3 d=combat-engg 0 13 5 1 79 0 1 0 e=life-science 0 19 2 0 0 75 1 0 f=materials 0 8 0 0 0 0 2 0 g=missiles 0 4 0 1 0 2 0 8 h=naval

2014 International Conference on Computing for Sustainable Global Development (INDIACom) 595

Page 4: [IEEE 2014 International Conference on Computing for Sustainable Global Development (INDIACom) - New Delhi, India (2014.3.5-2014.3.7)] 2014 International Conference on Computing for

[15] M. Rustagi, R. Rajendra Prasath, S. Goswami and S. Sarkar, “Learning Age and Gender of Blogger from Stylistic Variation”, PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence, 2009, pp. 205-212.

[16] M. I. Devi, Dr. R. Rajaram and K. Selvakuberan, “Machine learning techniques for auto-mated web page classification using URL features”, International Conference on Computational Intelligence and Multimedia Applications, 2007, pp.116-118.

TABLE III. CONFUSION MATRIX FOR LAB CLASSIFICATION (WITH LAB NAMES IN THE DATASET)

596 2014 International Conference on Computing for Sustainable Global Development (INDIACom)

Page 5: [IEEE 2014 International Conference on Computing for Sustainable Global Development (INDIACom) - New Delhi, India (2014.3.5-2014.3.7)] 2014 International Conference on Computing for

TABLE IV. CONFUSION MATRIX FOR LAB CLASSIFICATION (WITHOUT LAB NAMES IN THE DATASET)

2014 International Conference on Computing for Sustainable Global Development (INDIACom) 597