a report of the work done in this project is available here

Transductive Support Vector Classification for RNA Related Biological Abstracts

by

Blake Adams

Submitted to the Department of Computer Science

as a requirement for graduation

from the Graduate School for a

Master of Science Degree

at

University of West Georgia

CS 6900 – Project

December 8, 2005

Approved:

______________________Faculty Advisor

Table of Contents

1. Introduction....................................................................................3

2. Text Classification Background.......................................................3

3. Support Vector Machine Background.............................................3

4. Transductive Learning Background................................................4

5. Our Contribution..............................................................................4

6. Implementation...............................................................................5

7. Results............................................................................................6

8. Future Work....................................................................................7

9. Conclusions.....................................................................................8

9. Appendix........................................................................................9

Equation A. Term Frequency/Inverse Document Frequency:.........................................9Table A. Categorical Breakdown of Corpus...................................................................9Table B. Results..............................................................................................................9Figure A. SVM-Light File Format.................................................................................10Figure B. Tokenized Word Path....................................................................................10Chart A. ResultsSource Code........................................................................................11Source Code...................................................................................................................12

10. Bibliography................................................................................19

1. Introduction

With the dawning of the Information Age, the world has reached a point at which the movement of information is much faster than physical movement. The internet has played a tremendous part in creating a global environment wherein information is always close at hand. With this proliferation of information, we have reached a point at which the classification of text-documents has become unmanageable. In short, it is not possible to manually classify all of the new information that becomes available on a daily basis. Developing methods to successfully classify text via machine learning has become critical to managing information. In this project we examine a transductive learning system [1] and develop a method to convert standard biological abstracts into training and testing files that can be used for transductive learning via Support Vector Machines.

2. Text Classification Background

Text Classification is a classic problem in the realm of information retrieval. Applied correctly, it eliminates the time consuming task of reviewing and sorting documents into established categories by hand. The formal approach to programmatic implementation of text classification consists of determining whether or not a document from a corpus of documents belongs in a predefined category based on what the system has learned from a previously reviewed set of training data consisting of positive and negative training examples [2]. A number of learning techniques have been applied to this problem. Some of the more successful applications have involved Naïve Baines [3], Support Vector Machines [4] and Nearest Neighbor [5]. A comparison of the prevailing techniques is available in An Evaluation of Statistical Approaches to Text Categorization [5].

3. Support Vector Machine Background

Support Vector Machine is a learning technique based on Structural Risk Minimization [6]. The goal of Structural Risk Minimization is to reduce the risk of mis-categorization by finding the hypothesis with the lowest possible chance of true error. The term true error refers to the odds that the system will misclassify an example based on what it has learned from prior training examples. The main idea is to automatically learn a separation hyper-plane from a set of training examples, which splits classified entities into two subsets according to a certain classification property. This concept is referred to as linear separability. The concept can be summarized as follows:

Given two sets of points with integer coordinates P and Q, recognize whether there exists a hyper-plane H that separates the sets P and Q [7].

Support vector machines have been shown to outperform alternative learning methods, especially with text classification [5]. This high level of performance can be attributed to several factors; key amongst them are SVM’s ability to handle a high number of features, as well as the fact that most text categorization problems are linearly separable [4].

4. Transductive Learning Background

Transductive learning is a concept that was introduced by Vladimir Vapnik [8]. Traditionally, Support Vector Machines learn on the inductive learning principle. This premise follows the traditional learning method of passing the system several examples during a training phase that allows the system to establish a model that can be used to make predictions about unseen examples [4]. One problem with this method is that, as no inference can be made about unseen examples, several examples must be provided to reduce the likelihood of error. Transductive learning differs from the learning approach by passing the system not only training examples during the learning phase, but testing examples as well. By learning what it can from the training examples, the system can plot the location of the testing examples and plot the hyper-plane of separation that will minimize error based on the system’s prediction of the unmarked testing examples [1].

The key benefit of this method is that far fewer training examples have to be passed to the learner for it to return effective results. Since the system already knows where the remaining testing examples fall, it simply needs to classify the examples in a fashion that will minimize true error for the established hypothesis.

5. Our Contribution

The goal of this work was to implement transductive learning via the support vector machine learning technique with biological RNA related abstracts. The motivation behind this project was to explore how keyword searches within large online databases can be improved upon to return better results. Successful implementation should allow for categorization of articles chosen by a keyword search based on information provided in the document’s abstract. By focusing on abstracts rather than the entire article text, the researchers hope to produce successful results while processing less text than would be necessary with full text documents.

The corpus of abstracts was collected from the Entrez Pubmed database. Entrez Pubmed is a service for the National Library of Medicine that includes over 15 million citations from MEDLINE and other life science journals for biomedical

articles dating back to the 1950s. The database includes links to abstracts, as well as to some full text articles. The site adds new citations on a daily basis. A key search on the expressions “RNA” and “Ribosomal Nucleic Acid” within this database returned 452,864 articles. Of those articles, an examination of the first 50 abstracts revealed that only 19 were specific to RNA related research (38%). Passing results from such a keyword search through a Transductive support vector machine should remove articles that are returned by the keyword search but do not relate to the keyword subject.

In this work, we have focused on abstracts that relate specifically to RNA, messenger RNA, transfer RNA, ribosomal RNA, and Small Nuclear RNA. The first test is to sort RNA from Non-RNA related abstracts that contain the term RNA. Second is to sort abstracts that include the expressions messenger RNA, ribosomal RNA, transfer RNA, or small nuclear RNA. It is our belief that such a search will return improved results from a keyword search standpoint, but will be more difficult to sort from a machine learning stand point. Finally, positive training examples of mRNA, tRNA, rRNA, and snRNA abstracts will be combined, and the system will be asked to classify based on a specific RNA type (such as tRNA). It is believed that this will be the most challenging test, since all 4 subjects share many common terms.

6. Implementation

The subject of SVM has been widely covered in the field of computer science and has been implemented programmatically several times. One of the top packages that has been developed to implement SVM is SVM-Light by Thorsten Joachims. Fortunately, this package implements not only Support Vectors, but is also capable of transductive learning; thus it was the ideal logical choice for the project. The responsibility of the researchers was to develop a system that could efficiently convert abstracts from Pubmed into feature vectors that could be easily read by SVM-Light.

The corpus of abstracts collected from Pubmed are broken in down in the appendix (Table A). An equal balance of positive and negative examples were collected for each category, and we worked with 80 abstracts for each classification project. All abstracts were manually classified by the researchers prior to project implementation.

In order to use SVM-Light for the project, the researchers had to choose a feature selection process and a scoring method. Much work has been done on how proper feature selection can impact the success of text categorization. It has been established in past work specific to texts in the biological field that extensive preprocessing of documents to eliminate stop words and so forth has an insignificant impact on the outcome of categorization; thus pre-processing is not necessary. It has also been established that selection of terms for the

construction of feature vectors over the bag-of-words approach shows no significant decrease in performance of a system, and adds the benefit of reduced processing time [9]. Thus the decision was made to use a term-based approach for feature selection. Terms used were selected based on their relevance to the type of RNA research that was being queried, and each term set contained at least 100 terms. Pre-processing was limited to the removal of special characters such as quotations, dashes, commas, periods, etc. This processing was implemented to eliminate the chance of such special characters causing a term to be misidentified. Features were scored using the classic Term-Frequency/Inverse document frequency scoring technique. See appendix (Equation A) for details.

The program used to convert the abstracts into feature vectors was developed in Java. The logic behind the program is to read in a set of abstracts and a dictionary of terms from file. As the program scans each abstract, it is comparing each word to the dictionary in search of terms that can be converted into feature vectors. Every time a term is found in an abstract, it is given an id that it will share with every other matching term in the corpus of documents. Once the term has been converted into an id, the program must determine if it needs to increment the term’s term frequency score, document frequency score, or both. Once the entire body of abstracts has been converted into feature vectors, their individual scores will be calculated based on the TFIDF scoring system. Finally, the feature vectors are printed out in order of document, then document id, in order to satisfy the formatting of SVM-Light. The SVM-Light file format is detailed in the appendix (Figure A). The potential paths for any tokenized word in a given abstract is diagramed in the appendix (Figure B).

Once the abstracts were converted to feature vectors, training and testing files could be created. The training files consisted of the set of 80 abstracts converted into feature vectors. For each experiment, the system was passed 5 positive training examples, 5 negative training examples, 35 positive unmarked sets, and 35 negative unmarked sets in training mode. These are the files that were used by SVM-Light to generate a model that minimized true error. To generate the test file to check the system’s accuracy, the 70 unmarked examples were marked as positive or negative and passed through the system again in classify mode.

7. ResultsThe researchers expected to see good results from the system, with standard RNA classification being the highest performer, and classification from RNA specific positive examples being the poorest. The outcomes of the testing and training met or exceeded the researchers’ expectations in every experiment, achieving an accuracy of better than 80 percent in every category.

When classifying RNA related abstracts from non-RNA related abstracts, the system was able to classify 75 of 80 abstracts successfully. This represents an accuracy of 93.75 percent. While these results were positive, the researchers

had expected to see even slightly better accuracy. Review of the missed documents revealed three documents that had been misclassified by hand. Changing these classifications and re-running the system returned and accuracy of 97.5 percent (78 of 80 correct) and was more in line with the researchers’ expected outcomes. This figure far out paces the 38 percent accuracy that was achieved by keyword search alone.

Classification in the next four experiments sought to extract specific types of RNA research (mRNA, tRNA, rRNA, snRNA). A preliminary run of the system in each category yielded accuracy returns of 85 percent (mRNA), 87.5 percent (tRNA), 85 percent (rRNA), and 88.75 percent (snRNA). These outcomes exceeded the researchers’ expectations and called for further examination of the missed examples. In each case, the system had caught abstracts that had been misclassified by hand. Correction of these misclassifications lead to accuracy results of 88.75 (mRNA), 92.5 (tRNA), 95 (rRNA), and 95 (snRNA). Measuring these against the accuracy of keyword search alone (mRNA: 42 percent, tRNA: 53 percent, rRNA: 40 percent, snRNA: 42.5 percent) again shows a tremendous improvement.

The final set of experiments focused on gleaning specific types of RNA research from corpuses of abstracts that were all positive type of RNA abstracts relating to the 4 major RNA types addressed in this research. Here the researchers were hoping for as much as 70 percent accuracy, and again the expectations were exceeded in every case. The system identified mRNA examples from this set with 78.75 percent accuracy, tRNA with 81.25 percent accuracy, rRNA with 75.75 percent accuracy, and snRNA with 82.5 percent accuracy. Misclassified articles were examined again to reveal hand classification errors. Corrections to these errors returned accuracy of 83.75 for mRNA, 91.25 for tRNA, 82.75 for rRNA, and 87.5 for snRNA.

A complete table and chart of the results from this project are available in the appendix (Table B, Chart A). The Java Code used in implementation of this project is also available in the appendix (Source Code).

8. Future Work

With the encouraging results of this project, the researches feel that additional work lies ahead, both in terms of improvement of accuracy, and in development of the system.

The first key area that can be addressed is term development. The term dictionary developed and used for this project yielded good results, but these results could likely be improved upon. This is because the term dictionary was developed by individuals with minimal experience and exposure to RNA related work. To improve the accuracy of this system, the researchers must collaborate with researchers in the biological field who can aid in the development of term

dictionaries that more accurately reflect the key term contents of RNA related abstracts. Additionally, such collaboration would lend itself to more robust term dictionaries and could possibly double or triple the size of the dictionaries used in this project.

The second key area to be addressed is overall system development. The results of this project were successful enough that incorporation of the system into an online site that would allow for the two step process of keyword querying followed by SVM classification is worthy of development. Under such a system, a user should be able to conduct a keyword query with an online database such as Pubmed, and then read the first 10 or so abstracts, marking positive and negative examples. Once a satisfactory number of examples are collected, the system could use its transductive learner to identify the articles that best fit the user’s needs. Additionally, allowing the user to adjust the term dictionary by adding and removing terms specific to his/her needs would potentially increase classification accuracy.

9. Conclusions

The researches have found that transductive SVM classification is an effective tool to use in the classification of biological related abstracts. We were able to yield extremely effective results with minimal pre-processing and incorporation of term feature selection.

Transductive SVM performs best when classifying in general terms, such as classifying abstracts about RNA research from those that are not. Transductive SVM is also highly effective in the classification of more specific terms, such as “messenger RNA,” rather than just “RNA.” Results demonstrate that as keyword searches become more specific, the ability of the system to classify correctly erodes, but at a slower rate than the researchers anticipated.

Further work in this direction should focus on quality of terms used as feature vectors, and incorporation of this work into a fully functional online tool should prove to be a worthwhile endeavor.

9. Appendix

Equation A. Term Frequency/Inverse Document Frequency:

TFIDF = TF * Log(N/DF) where: TF = Total number of times term occurs in a single document. N = Total number of documents in the corpus. DF = Total number of documents in the corpus that contain the term.

Table A. Categorical Breakdown of Corpus

Subject Amount Training/Testing TypeRNA 40 PositiveRNA 40 NegativemRNA 40 PositivemRNA 40 NegativetRNA 40 PositivetRNA 40 NegativerRNA 40 PositiverRNA 40 NegativesnRNA 40 PositivesnRNA 40 NegativeTotal 400 200 Positive /200 Negative

Table B. Results

Category Keyword Search 1st Run 2nd RunRNA 38% 93.5% 97.5%mRNA 42% 85% 88.75%tRNA 53% 87.5% 92.5%rRNA 40% 85% 95%snRNA 42.5% 88.75% 95%mRNA2 N/A 78.75% 83.75%tRNA2 N/A 81.25% 91.25%rRNA2 N/A 75.75% 82.75%snRNA2 N/A 82.5% 87.5%

Figure A. SVM-Light File Format

<expected outcome> <feature>:<score> <feature>:<score> <feature>:<score>…..

Example:+1 1:2.8473 2:3.8324 9:5.423 19:1.003-1 1:8.1574 2:1.1001 14:5.225 20:2.202

Feature ids must be organized from lowest to highest

Figure B. Tokenized Word Path

Tokenized Word Is it in the

Term Map?

Is it in the keyword list?

Is word in current document?

No Yes

No

Assign id, docID, set termFreq to 1

Yes

Is it in the TermDocFreqMap?

Do Nothing

No

YesAssign featureId, set docFreq to 1, assign lastDocId

Increment termFreq

Yes

No

Increment docFreq

Increment docFreq

Is word in current document?

Discard

YesNo

Chart A. Results

Source Code

TERM MAP

import java.io.*;import java.util.*;

public class TermMap {static class CountCompare implements Comparator {

public int compare( Object obj1, Object obj2){Term data1 = (Term)obj1;Term data2 = (Term)obj2;return data1.getDocId() - data2.getDocId();

}}static class CountCompare2 implements Comparator {

public int compare( Object obj1, Object obj2){Term data1 = (Term)obj1;Term data2 = (Term)obj2;return data1.getId() - data2.getId();

}}public static void main(String args[]){

if (args.length == 3){ //command line argument should be name of file to be converted followed by term dictionary try { // Open the file that is the first command line parameter String check; FileOutputStream out;

FileInputStream fstream = newFileInputStream(args[1]); //read argument1 for fstream

FileInputStream fstream2 = new FileInputStream(args[2]); //read argument2 for fstream2

if(((args[0]).equals("-C")) || ((args[0]).equals("-c"))) {check = "TRUE";out = new FileOutputStream("check.txt"); //

initalize a file output object}else {

check = "FALSE";out = new FileOutputStream("train.txt"); //

initalize a file output object}

PrintStream p = new PrintStream( out ); // initialize a print stream objectBufferedReader in = new BufferedReader(new

InputStreamReader(fstream)); //initalize inputstream

BufferedReader in2 = new BufferedReader(new InputStreamReader(fstream2));

String line = in.readLine(); String line2 = in2.readLine();TreeMap<String, Term> termMap = new

TreeMap<String,Term>(); //Map holds every term in every documentTreeMap<String, TermDF> termDFMap = new

TreeMap<String, TermDF>(); //Map holds Document Freq and Feature ID of every term

int idCount = 1; //incremented every time a new term is added to TermMap

int featureId = 1; //incremented every time a new term is added to TermDFMap

int docCount = 0; //incremented every time a new document is encountered

List keywords = new ArrayList(35);while (line2 != null) { //builds list of keywords from

term dictionary submitted by userStringTokenizer st2 = new

StringTokenizer(line2);while (st2.hasMoreTokens()) {

String keyword = st2.nextToken();keywords.add(keyword);

}line2 = in2.readLine();}

while (line !=null) { // Continue to read lines while there are still some left to read

StringTokenizer st = new StringTokenizer(line);while (st.hasMoreTokens()) {

String word = (st.nextToken()).toLowerCase();

if (word.equals("pmid")){docCount++;

}String wordDocId =

word+"-"+String.valueOf(docCount); //build TermMap keyif (termMap.containsKey(wordDocId) &&

!(word.equals("pmid"))) { //word is in term mapTerm incTF =

termMap.get(wordDocId);incTF.incrementTermFreq();

}if(termDFMap.containsKey(word) && !

(word.equals("pmid"))) { //word is in termDFMap

TermDF incDF = termDFMap.get(word); if(incDF.getDocNum()!=docCount) {

incDF.setDocNum(docCount);

incDF.incrementDocFreq();}

}if(!(termMap.containsKey(wordDocId))

&& !(word.equals("pmid")) && keywords.contains(word)) { //word is not in termMap

Term newTerm = new Term(word,idCount,docCount,1,1);

termMap.put(wordDocId, newTerm);

idCount++;if(!

(termDFMap.containsKey(word))){ //word is also not in termDFMapTermDF newTermDF =

new TermDF(word,featureId,1,docCount);

termDFMap.put(word,newTermDF);featureId++;

}}

}line = in.readLine();

}in.close();Iterator itr = termMap.keySet().iterator();

while (itr.hasNext()) { //iterates through termMap to assign Doc Frequency and Feature ID from TermDFMap.

Term setDFId = termMap.get(itr.next()); TermDF getDFId =

termDFMap.get(setDFId.getWord()); int dFVal = getDFId.getDocF(); int iDVal = getDFId.getId(); setDFId.setDocF(dFVal); setDFId.setId(iDVal);

}List<Term> list = new

ArrayList<Term>(termMap.values());Collections.sort(list, new CountCompare2());Collections.sort(list, new CountCompare());int currentDocId = 1;int lastDocId = 1;

p.print("1 ");if(check.equals("FALSE")){

for(int i=0; i < list.size(); i++){Term buildData = list.get(i);double score =

Term.tfidf(buildData.getTermF(), buildData.getDocF(), docCount);currentDocId = buildData.getDocId();if(currentDocId > lastDocId) {

p.println("");if (currentDocId <= 5) {

p.print("1 ");}else if (currentDocId >= 41 &&

currentDocId <= 45) {p.print("-1 ");

} else {p.print("0 ");}

} p.print(buildData.getId() + ":" + score + " ");

lastDocId = buildData.getDocId();}

}else {

for(int i=0; i < list.size(); i++){Term buildData = list.get(i);double score =

Term.tfidf(buildData.getTermF(), buildData.getDocF(), docCount);currentDocId = buildData.getDocId();if(currentDocId > lastDocId) {

p.println("");if (currentDocId <= 40) {

p.print("1 ");}else {

p.print("-1 ");}

} p.print(buildData.getId() + ":" + score + " "); lastDocId = buildData.getDocId();

}}

p.close();}catch (Exception e) {

System.err.println("File input error");

}}

else System.out.println("Invalid parameters");

}}

TERM

import java.lang.*;import java.util.Comparator;

public class Term{

private String word;private int id;private int docId;private int termFreq;private int docFreq;//Constructorpublic Term(String w, int i, int j, int t, int d){

word = w;id = i;docId = j;termFreq = t;docFreq = d;

}//accessorspublic String getWord(){

return word;}public int getId(){

return id;}public int getDocId(){

return docId;}public int getTermF(){

return termFreq;}public int getDocF()

{return docFreq;

}public void incrementTermFreq(){

termFreq++;}public void setDocF(int a){

docFreq = a;}public void setId(int b){

id = b;}public static double tfidf(int w, int x, int y){

double result; int TF = w;Integer temp = new Integer(x);

double DF = temp.doubleValue(); int N = y; result = TF * Math.log(N/DF); return result; }}

TERM DFimport java.lang.*;

public class TermDF{

private String word;private int id;private int docFreq;private int docNum;

//Constructorpublic TermDF(String w, int i, int d, int n){

word = w;id = i;docFreq = d;docNum = n;

}//accessorspublic String getWord()

{return word;

}public int getId(){

return id;}public int getDocF(){

return docFreq;}public int getDocNum(){

return docNum;}

public void incrementDocFreq(){

docFreq++;}public void setDocNum(int z){

docNum = z;}

}

10. Bibliography

1. Joachims, Thorsten (1999) Transductive inference for text classification using support vector machines, Proceedings of the Sixteenth International Conference on Machine Learning, 200-209.

2. Sebestian, F. (2002) Machine learning in automated yext categorization. ACM Computing Surveys, 34(1), 1-47.

3. Joachims, Thorsten (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization . Proceedings of ICML-97, 14th International Conference on Machine Learning. 143-151.

4. Joachims, Thorsten (1998) Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10 European Conference on Machine Learning. 1398, 137-142.

5. Yang, Y. (1999) An evaluation of stastical approaches to text categorization. Journal of Information Retrieval, 1(1/2), 66-99.

6. Vapnik, Vladimir. (1995) The Nature of Statistical Learning Theory. Springer, New York.

7. Megiddo, N. (1988) On the complexity of polyhedral separability. Discrete and Computational Geometry, 3, 325-337.

8. Vapnik, Vladimir. (1998) Statistical Learning Theory. Wiley.

9. Nenadic, Goran. (2003) Selecting Text Features for Gene Name Classification: from Documents to Terms. Proceedings of the ACL 2003 Workshop on NLP in Biomedicine, ACL, 121-128.