[ieee 2009 ieee symposium on computational intelligence in bioinformatics and computational biology...

7
Abstract— As the replication of their DNA genomes is a central step in the reproduction of many viruses, procedures to find replication origins, which are initiation sites of the DNA replication process, are of great importance for controlling the growth and spread of such viruses. Existing computational methods for viral replication origin prediction have mostly been designed to use only the composition of a region of viral DNA to predict if such region is an ORI or not. This paper proposes the application of several feature selection techniques to help find the most significant features of the replication origins in herpesviruses. The results suggest that features based on the relative positions of the regions in the genomes containing replication origins and the information about the subfamily of the virus can be highly useful features to be incorporated into the computational tools for viral replication origin prediction. I. INTRODUCTION eplication of their genomes is the central step of reproduction in many DNA viruses. Understanding the viral replication mechanism is, therefore, of great importance in developing strategies to control the growth and spread of viruses ([14], [20], [39]) for various reasons related to health and economy. Since replication origins, which are initiation sites of the DNA replication process, are regarded as major sites for regulating genome replication, computational methods for predicting the likely locations of replication origins in viral genomes has been developed with the aim of reducing the amount of time and resources spent on labor-intensive laboratory procedures to search for replication origins (e.g., [16], [29], [43]). Early studies on the genome DNA sequences of herpesviruses have suggested that replication origins often lie around regions with an unusually high concentration of palindromes ([27], [32], [40]), where a palindrome is a stretch of DNA bases followed immediately by its reverse complement. Based on these observations, Leung et al. [26] suggest a computational method using the scan statistics to locate significant clusters of palindromes and predict the likely locations of replication origins. Chew et al. [10] have further developed palindrome-based scoring schemes for predicting replication origins in complete herpesvirus genomes. Their approach is to slide a window of size about R.Cruz-Cano is with Texas A&M University-Texarkana, Texarkana, TX 75503 USA (Phone: 903-334-6656; fax: 903-334-6656; e-mail: raul.cruz- cano@ tamut.edu). M.Y. Leung is with The University of Texas at El Paso, El Paso, TX 79968 USA (Phone: 915-747-6836; fax: 915-747-6502; e-mail: mleung@ utep.edu). 0.5% of the genome length over the sequence. As the window moves along, a score which reflects the concentration of palindromes in the window is calculated. The top scoring windows are then selected as predicted likely locations of replication origins. In [12] we have proposed a machine learning approach, namely artificial neural networks (ANN) for replication origin prediction. Noting that machine learning approaches have the advantage of allowing multiple sequence features and relevant knowledge about other members of the viral family to be incorporated in the prediction scheme, we further explore a replication origin prediction scheme based on support vector machines (SVM) in [13]. The results indicate that SVM with adequate prediction accuracy can be constructed. Moreover, the SVM method has certain additional advantages over ANN. For example, the SVM also allows recursive feature elimination to be conducted to suggest which sequence features are more important for identifying viral replication origins. In this study, we use several feature selection approaches to determine which input variables contain information about the known replication origin locations and other characteristics of the genome sequence. We shall briefly describe the different feature selection techniques in the next section. The application of them to the problem of replication origins in herpesviruses and the prediction accuracy of related methods are presented in Section III. A few concluding remarks are given in Section IV. II. FEATURE SELECTION TECHNIQUES It is not always easy to determine which input variables should be included in the training and test data sets to obtain a suitable classifier. The issue is important because by including only useful information, a smaller classification system can be used to solve the problem. Simpler systems require less computational resources and usually lead to better performance for the classification of the examples in the data set. The process that deals with the reduction of the number of input variables is known as feature selection. The problem of feature selection has been studied extensively in machine learning. A good review can be found at [34]. The Internet addresses of feature selection programs can also be found in this manuscript. Recently, feature selection methods have been widely studied in gene selection of microarray data. These methods can be decomposed into two broad classes: Filter Methods and Comparison of Feature Selection Techniques for Viral DNA Replication Origin Prediction Raul Cruz-Cano, Member, IEEE and Ming-Ying Leung R 978-1-4244-2756-7/09/$25.00 ©2009 IEEE

Upload: ming-ying

Post on 07-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

Abstract— As the replication of their DNA genomes is a central step in the reproduction of many viruses, procedures to find replication origins, which are initiation sites of the DNA replication process, are of great importance for controlling the growth and spread of such viruses. Existing computational methods for viral replication origin prediction have mostly been designed to use only the composition of a region of viral DNA to predict if such region is an ORI or not. This paper proposes the application of several feature selection techniques to help find the most significant features of the replication origins in herpesviruses. The results suggest that features based on the relative positions of the regions in the genomes containing replication origins and the information about the subfamily of the virus can be highly useful features to be incorporated into the computational tools for viral replication origin prediction.

I. INTRODUCTION

eplication of their genomes is the central step of reproduction in many DNA viruses. Understanding the viral replication mechanism is, therefore, of great

importance in developing strategies to control the growth and spread of viruses ([14], [20], [39]) for various reasons related to health and economy. Since replication origins, which are initiation sites of the DNA replication process, are regarded as major sites for regulating genome replication, computational methods for predicting the likely locations of replication origins in viral genomes has been developed with the aim of reducing the amount of time and resources spent on labor-intensive laboratory procedures to search for replication origins (e.g., [16], [29], [43]).

Early studies on the genome DNA sequences of herpesviruses have suggested that replication origins often lie around regions with an unusually high concentration of palindromes ([27], [32], [40]), where a palindrome is a stretch of DNA bases followed immediately by its reverse complement. Based on these observations, Leung et al. [26] suggest a computational method using the scan statistics to locate significant clusters of palindromes and predict the likely locations of replication origins. Chew et al. [10] have further developed palindrome-based scoring schemes for predicting replication origins in complete herpesvirus genomes. Their approach is to slide a window of size about

R.Cruz-Cano is with Texas A&M University-Texarkana, Texarkana, TX

75503 USA (Phone: 903-334-6656; fax: 903-334-6656; e-mail: raul.cruz-cano@ tamut.edu).

M.Y. Leung is with The University of Texas at El Paso, El Paso, TX 79968 USA (Phone: 915-747-6836; fax: 915-747-6502; e-mail: mleung@ utep.edu).

0.5% of the genome length over the sequence. As the window moves along, a score which reflects the concentration of palindromes in the window is calculated. The top scoring windows are then selected as predicted likely locations of replication origins.

In [12] we have proposed a machine learning approach, namely artificial neural networks (ANN) for replication origin prediction. Noting that machine learning approaches have the advantage of allowing multiple sequence features and relevant knowledge about other members of the viral family to be incorporated in the prediction scheme, we further explore a replication origin prediction scheme based on support vector machines (SVM) in [13]. The results indicate that SVM with adequate prediction accuracy can be constructed.

Moreover, the SVM method has certain additional advantages over ANN. For example, the SVM also allows recursive feature elimination to be conducted to suggest which sequence features are more important for identifying viral replication origins.

In this study, we use several feature selection approaches to determine which input variables contain information about the known replication origin locations and other characteristics of the genome sequence. We shall briefly describe the different feature selection techniques in the next section. The application of them to the problem of replication origins in herpesviruses and the prediction accuracy of related methods are presented in Section III. A few concluding remarks are given in Section IV.

II. FEATURE SELECTION TECHNIQUES It is not always easy to determine which input variables

should be included in the training and test data sets to obtain a suitable classifier. The issue is important because by including only useful information, a smaller classification system can be used to solve the problem. Simpler systems require less computational resources and usually lead to better performance for the classification of the examples in the data set. The process that deals with the reduction of the number of input variables is known as feature selection.

The problem of feature selection has been studied extensively in machine learning. A good review can be found at [34]. The Internet addresses of feature selection programs can also be found in this manuscript. Recently, feature selection methods have been widely studied in gene selection of microarray data. These methods can be decomposed into two broad classes: Filter Methods and

Comparison of Feature Selection Techniques for Viral DNA Replication Origin Prediction

Raul Cruz-Cano, Member, IEEE and Ming-Ying Leung

R

978-1-4244-2756-7/09/$25.00 ©2009 IEEE

Page 2: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

Wrapper Methods. For all the methods mentioned in this paper the available

data is transformed into a numerical representation and then expressed as a matrix X and a vector Y. The rows of the matrix X store the vectors of inputs Xi’s. The element yi of the vector Y represents the desired output for the input vector Xi. X is an N by n matrix, where n is the dimension of the input vectors and N the number of input-output instances in

the data set. Naturally, the length of the vector Y is N.

A. Filter Methods These methods select features based on discriminating

criteria that are relatively independent of classification. Several methods use simple correlation coefficients. These earlier filter-based methods assume independence among features in isolation and did not consider correlation between

TABLE I FEATURE SELECTION RESULTS

Virus Abbrev. Accession Known Replication Origins Genome Length Window Length Subfamily

Bovine herpesvirus 1 BoHV1 NC_001847 111080-111300 (oriS) 135301 300 α 126918-127138 (oriS) Bovine herpesvirus 4 BoHV4 NC_002665 97143-98850 (oriLyt) 108873 250 γBovine herpesvirus 5 BoHV5 NC_005261 113206-113418 (oriLyt) 138390 300 α 129595-129807 (oriLyt)

Cercopithecine herpesvirus 1 CeHV1 NC_004812 61592-61789 (oriL1) 156789 350 α

61795-61992 (oriL2) 132795-132796 (oriS1) 132998-132999 (oriS2) 149425-149426 (oriS2) 149628-149629 (oriS1) Cercopithecine herpesvirus 2 CeHV2 NC_006560 61445-61542 (oriL) 150715 350 α 129452-129623 (oriS) 144386-144557(oriS) Cercopithecine herpesvirus 9 CeHV7 NC_002686 109627-109646 124138 300 α 118613-118632 Cercopithecine herpesvirus 16 CeHV16 NC_007653 62892-63070 (oriL) 156487 700 α 133380-133578 (oriS) 149725-149923 (oriS) Human herpesvirus 4 EBV NC_001345 7315-9312 (oriP) 172281 400 γ 52589-53581(oriLyt) Equid herpesvirus 1 EHV1 NC_001491 126187-126338 150224 350 αEquid herpesvirus 4 EHV4 NC_001844 73900-73919 (oriL) 145597 350 α

119462-119481 (oriS) 138568-138587(oriS) Gallid herpesvirus 1 GaHV1 NC_006623 24738-25005(oriL) 148687 350 αHuman herpesvirus 5 strain AD169 HCMV NC_001347 93201-94646 (oriLyt) 230287 550 βHuman herpesvirus 6 HHV6 NC_001664 67617-67993 (oriLyt) 159321 350 βHuman herpesvirus 6B HHV6B NC_000898 68740-69581(oriLyt) 162114 400 βHuman herpesvirus 7 HHV7 NC_001716 66685-67298 153080 350 βHuman herpesvirus 1 HSV1 NC_001806 62475 (oriL) 152261 350 α 131999 (oriS) 146235 (oriS) Human herpesvirus 2 HSV2 NC_001798 62930 (oriL), 154746 350 α 132760 (oriS) 148981 (oriS)

Murid herpesvirus 2 RCMV NC_002512 75666-78970 (oriLyt) 230138 550 βSuid herpesvirus 1 SHV1 NC_006151 63848-63908 (oriL) 143461 350 α 114393-115009 (oriS) 129593-130209 (oriS) Human herpesvirus 3 VZV NC_001348 110087-110350 124884 300 α 119547-119810

Page 3: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

features. Some examples filter methods are: 1) Pearson correlation coefficient: The absolute value of

the Pearson correlation coefficient (PCC) is a widely used ranking criterion. The PCC is proportional to the dot product between Xj and Y, after variable standardization (subtracting the mean and dividing by the standard deviation). The PCC is a measure of linear dependency between variables. Although irrelevant variables should have a PCC value near zero, a PCC close zero does not implies the variable is irrelevant: a non-linear dependency may exist, which is not captured by the PCC.

2) Golub: The features are ranked by their degree of correlation to the output (class) vector Y. To establish if these correlations are stronger than would be expected by chance, idealized patterns corresponding to a feature that is uniformly high in one class and uniformly low in the other are created. One tests whether there is an unusually high density of features similar to this idealized pattern, as compared to equivalent random patterns [18].

3) Euclidian: In this naïve method the relevance of the features is determined by their Euclidian distance to the class vector Y. The data set needs to be standardized before applying this technique. The Euclidian distance can also be used to create a simple classifier [7]. Each class is represented by a vector of size n and unseen cases are assigned to the class which vector is closest to them. These representative vectors are composed of the means of each variable for all the examples in each class. This naive learner is found to be extremely fast and can be easily implemented in Matlab.

Information about how to obtain computer programs which implement these three methods can be found in [3].

Other filter methods are: Fisher’s discriminant criterion [31], statistical tests, such as the t-test [28] and mutual information [15].

B. Wrapper Methods In the wrapper methods the relevance of each subset of

features is estimated using a classifier. This approach requires a search for the optimal values for the parameters of the classifier. Some examples of wrapper methods are: the Genetic Feature Selection with Inconsistency Criterion [42], the wrapper approach [23], the wrapper version of the Euclidian method [7] and the fast wrapper method [33].

The wrapper methods utilized in this manuscript are based on SVM [38]. SVM feature selection methods were studied first in [19] and have been used in various biological applications such as breast cancer diagnosis [19], microarray data [17], prediction of insurgence of human genetic disease [6], prediction of single nucleotide polymorphisms [24] and prediction of protein stability [9].

The modification of the parameters of the SVM in order to get the desired results is known as training. Usually, the behavior desired for an SVM is obtained by providing to it examples of inputs and the corresponding observed outputs.

A portion of the data set is reserved and used to demonstrate the accuracy of the SVM for unseen cases. These instances are presented to the system during a procedure called testing. For SVM, the classification of the examples in both the training and test set is performed by a decision function

bG(X) WXD T += )( (1) where G is vector of non-linear functions of size L, where

L >> n, actually L might be infinite. The elements of the vector W = (w1,w2,…,wL) and the bias term b are real value numbers.

It is considered that the decision function has correctly classified the input vector iX if:

⎩⎨⎧

−=−≤=≥

1for 11for 1

)(i

ii y

yXD (2)

An advantage of SVM is that the equations needed to optimize their values allow to work with the products G(X)TG(Xj)=H(X,Xj) instead of G(X) directly, this is known as the kernel trick. The functions H(X,X’) are called kernel functions. The most popular kernels are the Linear, Radial Basis Function (RBF) and Artificial Neural Networks kernels. All the functions H(X,Xj) have the same form; just the values of their parameters are different depending on which support vectors (SVs) Xj they are associated with. When using the kernel trick only M parameters have to be found, where M is the number of support vectors and n ≥ M. The Linear and the RBF kernel were selected for our research due to their ability to produce adequate results in many different field of research; the corresponding equations are, for the Linear:

.||||)( 2jj XXXH −=

And for RBF ).||||exp()( 2

jj XXXH −−= λ

Here we choose λ to be 1. Since during preliminary experiments the algorithm proved to be robust to a wide range for values for λ it was determined that exhaustive search for its most desirable value is not necessary. This phenomenon has been observed previously on [1].

The generalization capabilities of the decision function can be maximized by increasing the margin, i.e. the distance between a decision function and the input vector nearest to it. It can be proved that the margin is optimized by minimizing:

)w,w(w WWQ(W) L,...,, ||||21

212 == (3)

subject to the constraints in (2) for all input-output examples.

The hard-margin support vector machines are decision functions which provide the maximum margin. Hard-margin SVMs may not always exist in real-life problems. One can get around this situation by introducing slack terms, represented by the iξ , in (3) as follows:

Page 4: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

.||||21

1

22 ∑=

+=L

iiCWQ(W) ξ (4)

The resulting systems are known as soft-margin SVM. It is also need to find the M support vectors for the kernel

functions. For this project it was decided to consider all the vectors in the data set SVs. In some problems this might limit the generalization capabilities of the SVM. If this is the case, a strategy proposed in [36] and [37] to iteratively eliminate SVs with little influence in the decision function described in (1), i.e. SVs with small wi’s, can be applied.

Also, for (4), we set C= (#of negative cases)/ (#of positive cases) for the positive cases; where yi =1. For the negative cases, i.e. the cases in which yi =-1, C is set equal to 1.This formula was proposed for imbalance data in [25].

The change in the generalization capability of a SVM created by erasing a variable can be accurately estimated [1]. The process of repetitively eliminating from the data set the variables which produce the smallest change in the generalization capability of a SVM is called recursive feature elimination (RFE) [19]. To obtain information about the programs corresponding to the SVM-RFE method see Reference [5].

The solution to minimizing (4) subject to the constraints in (2) with equalities hold is called a least-squares support vector machines (LS-SVM). LS-SVM have performed well in various applications, including classification of calcium channel antagonists [41], quantification of bacteria [2] and proteometric study [4]. An attractive characteristic of LS-SVM is that the optimal solutions for the vector W and the number b can be found by solving a system of linear equations [37]. The RFE works the same as for the SVM.

III. APPLICATION TO HERPESVIRUSES Table I shows the herpesviruses that are used in this study

and their known replication origins, documented in the annotations of the GenBank files.

The prediction strategy used for this research considers the viral genome as a set of equal size overlapping windows, with each window being a small DNA segment about 0.5% of the genome length. This window length is chosen because it is around the average length of the known replication origins reported in Tables I.

The features used for the construction of the LS-SVM, for both herpesviruses and caudoviruses, are described below. 1) Subfamily/Family: All members of our herpesvirus data

set belong to the Herpesviridae family and they are classified into the α, β and γ subfamilies according to their biological properties such as the range of hosts and types of infected cells.

2) Palindrome scores: Because of the documented observation that replication origins often lie around regions with unusually high concentration of palindromes, two palindrome scores, namely the palindrome length score (PLS) and the base-weighted score (BWS1), described by [10], are included as

features of the SVM. Basically, PLS scores a palindrome proportional to its length whereas BWS1 scores a palindrome according to how rarely it is observed in a random nucleotide base sequence generated by a first order Markov chain. Regardless of the scoring scheme, the total score of a window is the sum of the scores of all palindromes whose centers lie within the window.

3) A+T content: The A+T content of a window refers to the percentage of A and T bases in the window. DNA replication typically requires the binding of an assembly of enzymes (e.g., helicases) to locally unwind the DNA helical structure, and pull apart the two complementary strands. Higher A+T content around the origins makes the two complementary DNA strands bond less strongly to each other. This facilitates the two strands to be pulled apart and initiate the replication process. As other studies ([10], [11], [35]) have reported the use of A+T content to locate DNA replication origins, we include it as a feature of our SVM.

4) Standardized window number: This is the feature that enables information about the location of the known replication origins to be fed into the SVM. First described in [12], the standardized window number is the window number divided by the total number of windows in the virus. Hence the window number will be normalized to a real number between 0 and 1. For example, if a virus has a total of 500 windows then the corresponding standardized window number for the 455th window is 455/500 = 0.91. The idea of including this variable as an input initially came from the observation that the replication origins are located in very similar parts of the genome in groups of viruses, especially for the herpesviruses family. Figure 1 gives a schematic representation of the genomes as vertical bars where the black colored regions are those windows close to known replication origins.

5) Dinucleotide scores: A dinucleotide is a word made up of any two nucleotide bases. The 16 possible dinucleotides in DNA are AA, AC, ..., TT. In the past,

Fig. 1. Replication Origins of Herpesviruses

Page 5: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

measures of the dinucleotide content have been used as genomic signature for different bacteria ([21], [22], [30]). In our research the dinucleotide scores [12] are 16 variables consisting of the natural logarithm of the proportion of each possible dinucleotide in each window divided by the product of the percentages (Pct.) of the two constituting single nucleotide bases in the whole DNA sequence of the virus. The score for a dinucleotide ab in a window w of virus v is:

⎟⎠⎞

⎜⎝⎛

×=

vbvawwabab

in virus of Pct. in virus of Pct. of/length in window appears Timeslog)(score

The features above are represented by a total of 23 input

variables: 3 for family/subfamily classification, 2 for palindromes, 1 for A+T content, 1 for standardized window number, and 16 for dinucleotide scores. Only four of these 23 features, i.e. the three subfamily classification and the standardized window number, are not based on the contents of the window. Other features were included during preliminary trials, e.g. mononucleotide and trinucleotide scores, but they do not seem to over any significant advantage over the features mentioned above.

IV. FEATURE SELECTION RESULTS For our wrapper methods, due to the time necessary to

perform all the required computational work, we carry out the RFE process using a random selection of 50% of the positive examples and 10% of the negative examples. The random selection leads to different rankings of the variables

each time that the RFE algorithm is executed. Table II presents listings of the variables ordered according to their average relative rankings resulting from 12 executions of RFE for each of the six different methods (with the highest and lowest rankings for each variable excluded from the calculations).

There are several similarities among the different wrapper methods, the most remarkable being the ranking of the standardized window number as the most relevant feature of all. This is important because it supports the argument that there are features not based on the contents of the windows of DNA which contain important information about the location of replication origins. Also notice that there is a subfamily classification α or β feature in the top 10 of the wrapper methods, exceeding the importance of the majority of the features based on the content. The subfamily γ, with rank of 18 or worse, provides very little information; this is probably due to the fact that only 10% of the viruses with known replication origins in our data set belong to the γ subfamily (see Fig. 1). The relative importance of this variable might change once that more origins are found in members of this subfamily.

Also, in all three wrapper methods, BWS1 and PLS are consistently considered the variables containing the least information. It is important to understand that this does not imply that these palindrome scores do not contain information useful for predicting replication origins. The reason for their early elimination is that during RFE, data are assumed to come from a deterministic objective function. In

TABLE II FEATURE SELECTION RESULTS

Rank Pearson Golub Euclidian SVM RBF SVM Linear LS-SVM RBF 1 TC TC GA Std Window Std Window Std Window 2 CG CG AC α TG GC 3 GC CC CA AT AT TG 4 GT GC TC TG GG GG 5 CC GG AG CA A+T content TC 6 GG GT TG GG AG AG 7 CT CT CT TC CT GT 8 AG TT GG GA CA CT 9 TT AG CC AG β GA

10 TA Std Window TT CC TT α 11 TG TA GC β GA AT 12 Std Window TG GT CT TC CG 13 AC A+T content CG TT AC AC 14 AA AA AA AC GT CA 15 A+T content AT A+T content CG GC β 16 AT α AT GT TA CC 17 CA AC TA TA PLS A+T content 18 α β Std Window. γ γ TT 19 β CA α GC CG γ 20 BWS1 BWS1 β A+T content CC TA 21 GA GA BWS1 AA α AA 22 PLS PLS PLS PLS AA PLS 23 γ γ γ BWS1 BWS1 BWS1

Page 6: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

other words, due to their random nature and multiple contradictions, palindrome scores provide little information for wrapper methods.

The ranking of the dinucleotides scores and the A+T content appear different across the wrapper methods. Some of the few existing consistencies are: TG is ranked among the top four features and AA among the worst three.

The simplicity of the filter methods allowed the use of all the examples in the data set at once. Since the Golub and PCC methods rely on the correlation, it is not surprising that they produce very similar results.

These methods tend to regard certain dinucleotides, e.g. TC as the best features, while regarding non-composition features rather irrelevant, e.g., the classification of subfamily γ is the ranked as the worst feature by all three filter methods. Not even the standardized window number, considered consistently as the best feature by the wrapper methods, is ranked better than 10 in any of the filter methods.

This discrepancy is an important issue that should be addressed. Note that the assumptions of the filter methods make them appropriate only for simple problems in which the variables are independent. They are designed to detect only linear relationships between the features and the desired classification. The wrapper methods, in contrast, do not require such simplifying assumptions, and are therefore expected to give more reliable results for more complex problems where the variables are correlated.

In Table III, we compare the performance of a couple of the SVM methods against the Euclidian method in terms of their average percentage of correct classifications. We are able to compare these few methods now because they are the only ones that can be used for feature selection and classification at the same time with available software (e.g., [3], [8]).

We also examine the performance of a couple of naïve Bayes classifiers. A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem and; like the filter methods, use strong (naive) independence assumptions. The method Diagonal Linear fits a multivariate normal density to each group, with a pooled estimate of diagonal covariance matrix estimate, while Diagonal Quadratic fits multivariate normal densities with a diagonal covariance matrix estimates stratified by group. Both methods are provided by the Matlab Statistics Toolbox.

The same 12 data sets used for the creation of the wrapper

methods were used as training sets for the SVM classifiers, while the rest of the data were used as test sets. The results for the test sets are shown in Table III. The performances of the SVM classifiers on the herpesviruses suggest that the wrapper methods are the more appropriate model for our problem attaining almost a perfect score for the test set. This is confirmed by fact that LS-SVM can accurately predict replication origins in a herpesvirus based only on information from other herpesviruses [13].

V. CONCLUSION The comparison results from this study suggest that

wrapper methods are better than filter methods in determining which set of features are the most relevant for predicting the location of replication origins. They also indicate that features such as standardized window number and the subfamily classification, which are not directly related to the local content of the nucleotide sequence, can provide valuable information about the location of replication origins in herpesviruses.

Due to the limited availability of the computer programs corresponding to different filter and wrapper methods, we are able to compare only a few examples of them in this paper. More extensive comparisons are planned for future investigations.

ACKNOWLEDGMENT We thank Drs. Kwok-Pui Choi and David S.H. Chew for

their help with data collection and their invaluable advice. We also thank Mr. Ivan Ramirez for his assistance with the installation of the computer programs for the filter methods.

This research is supported in part by the Texas Higher Education Coordinating Board ARP grant 0036661-0008-2007, the National Institutes of Health grants S06GM08012-37 and 5G12RR008124-11, and the National Science Foundation Grant DMS-0800266.

REFERENCES [1] S. Abe, Support Vector Machines for Pattern Classification (Advances

in Pattern Recognition), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005..

[2] Borin, A., M. F. Ferro, C. Mello, L. Cordi, L. C. M. Pataca, N. Durn, R. J. Poppi, “Quantification of lactobacillus in fermented milk by multivariate image analysis with least-squares support-vector machines”, Anal Bioanal Chem, vol. 387, pp.1105–1112, 2007.

[3] L. J. Buturovic, “PCP: a program for supervised classification of gene expression profiles”, Bioinformatics, vol.22, issue 2, pp. 245-247, 2006.

[4] J. Caballero, L. Fernandez, M. Garriga,, J.I. Abreu, S. Collina and M. Fernandez. “Proteometric study of ghrelin receptor function variations upon mutations using amino acid sequence autocorrelation vectors and genetic algorithm-based least square support vector machines”, J Mol Graph Model, vol. 26, pp. 166–178, 2007.

[5] S. Canu, Y. Grandvalet, V. Guigue and A. Rakotomamonjy. “SVM and Kernel Methods Matlab Toolbox”, Perception Syste`mes et Information, INSA de Rouen, Rouen, France, 2003. Software available at http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/

[6] E. Capriotti, R. Calabrese and R. Casadio, “Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information”, Bioinformatics, vol. 22, pp. 2729–2734, 2006.

TABLE III CLASSIFICATION METHODS PERFORMANCE

Method Average Pct. of Correct Classifications

SVM Classifiers LS-SVM RBF 98.10 SVM Linear 97.46

Naive Classifiers Diagonal Quadratic 86.01 Diagonal Linear 79.67 Euclidian 73.11

Page 7: [IEEE 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational

[7] T.Y. T. Chan, “A Quick and Naive Euclidean Learner for Supervised Feature Selection”, Proceedings of The 6th IEEE International Conference on Electronics, Circuits and Systems, pp.587--590, IEEE Computer Society, 1999.

[8] C.C. Chang and C.J. Lin, “LIBSVM: a library for support vector machines”, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

[9] J. Cheng, A. Randall and P. Baldi, “Prediction of protein stability changes for single-site mutations using support vector machines”, Proteins, vol. 62, pp. 1125–1132, 2006.

[10] D. S. H. Chew, K. P. Choi, and M.-Y. Leung, “Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses”, Nucleic Acids Res, vol. 33, e134, 2005.

[11] D. S. H. Chew, M.-Y. Leung and K. P. Choi, “AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions”, BMC Bioinformatics, vol.8, issue 163, 2007.

[12] R. Cruz-Cano, D. Chandran and M.-Y. Leung, “Computational prediction of replication origins in herpesviruses”, ’07 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 283–290.

[13] R. Cruz-Cano., D. Chew, K.P. Choi and M.Y. Leung, “Least-Squares Support Vector Machine Approach to Viral Replication Origin Prediction”, submitted to the Journal of Computing, 2008.

[14] H.J. Delecluse, W. Hammerschmidt, “The genetic approach to the epstein-barr virus: from basic virology to gene therapy”, Mol Pathol, vol. 53, pp. 270–279, 2000.

[15] C. Ding and Peng, H, “Minimum redundancy feature selection from microarray gene expression data”, J. Bioinform. Comput. Biol., vol. 3, pp. 185–205, 2005.

[16] H. Deng, J. T. Chu, N.-H. Park, R. Sun, “Identification of cis sequences required for lytic DNA replication and packaging of murine gammaherpesvirus 68”, J Virol, vol. 78, pp. 9123–9131, 2004.

[17] M. Doran., D. S. Raicu, J. D. Furst, R. Settimi, M. Schipma, D. P. Chandler, “Oligonucleotide microarray identification of bacillus anthracis strains using support vector machines”, Bioinformatics, vol. 23, pp. 487–492, 2007.

[18] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomÞeld and E. S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring”, Science, vol. 286, pp. 531-537, 1999.

[19] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, “Gene selection for cancer classification using support vector machines”, Mach. Learn., vol. 46, pp. 389–422, 2002.

[20] C.B. Hartline, C. B., E. A. Harden, S. L. Williams-Aziz, N. L. Kushner, R. J. Brideau and E. R. Kern, “Inhibition of herpesvirus replication by a series of 4-oxo-dihydroquinolines with viral polymerase activity”, Antiviral Res, vol. 65, pp. 97–105, 2005.

[21] R. W. Jernigan and R. H. Baran, “Pervasive properties of the genomic signature”, BMC Genomics, vol. 3: 23, 2002.

[22] S. Karlin and C. Burge, “Dinucleotide relative abundance extremes: a genomic signature”, Trends Genet, vol. 11, pp. 283–290, 1995.

[23] R. Kohavi and G.H. John: Wrappers for Feature Subset Selection. Artif. Intell., vol. 97(1-2), pp. 273-324, 1997.

[24] W. Kong and K. W. Choo, “Predicting single nucleotide polymorphisms (snp) from dna sequence by support vector machine”, Front Biosci, vol. 12, pp. 1610–1614, 2007.

[25] Lee, K., S. Gunn, C. Harris, P. Reed. 2001. Classification of unbalanced data with transparent kernels. Neural Networks, 2001. Proceedings. IJCNN ’01. International Joint Conference on 4 2410–2415 vol.4.

[26] M.-Y. Leung, K. P. Choi, A. Xia and L. H. Y. Chen, “Nonrandom clusters of palindromes in herpesvirus genomes”, J Comput Biol, vol. 12, pp. 331–354, 2005.

[27] M.J. Masse, S. Karlin, G. A. Schachtel and E. S. Mocarski, “Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region”, Proc Natl Acad Sci USA, vol. 89, pp. 5246–5250, 1992.

[28] F Model, Péter Adorján, Alexander Olek and Christian Piepenbrock, “Feature selection for dna methylation based cancer classification”, Bioinformatics, vol. 17 (Suppl 1), pp. S157-S164, 2001.

[29] C. S. Newlon and J. F. Theis, “DNA replication joins the revolution: whole-genome views of DNA replication in budding yeast”, Bioessays, vol. 24, pp. 300–304, 2002.

[30] M. W. van Passel, J., E. E. Kuramae, A. C. M. Luyf, A. Bart and T. Boekhout, “The reach of the genome signature in prokaryotes”, BMC Evol Biol, vol. 6:84, 2006.

[31] P. Pavlidis, J. Cai, J.Weston and W.N. Grundy, “Gene functional classification from heterogeneous data”, Proceedings of the 5th International Conference on Computational Molecular Biology, pp. 249–255, 2001.

[32] D. Reisman, J. Yates and B. Sugden, “A putative origin of replication of plasmids derived from Epstein-Barr virus is composed of two cis-acting components”, Mol Cell Biol, vol. 5, pp. 1822– 1832, 1985.

[33] G. Richards, K. Brazier and W. Wang, “A Fast Wrapper Method for Feature Subset Selection”, Proceeding of the Artificial Intelligence and Applications, pp. 54-59, 2005.

[34] Y. Saeys, I. Inza and P. Larraaga, “A review of feature selection techniques in bioinformatics”, Bioinformatics, vol. 23, pp. 2507–2517, 2007.

[35] M. Segurado, A. de Luis and F. Antequera, “Genome-wide distribution of DNA replication origins at A+T-rich islands in Schizosaccharomyces pombe”, EMBO Rep, vol. 4, pp. 1048–1053, 2003.

[36] J. Suykens, “Least squares support vector machines for classification and nonlinear modeling”, Neural Network World, vol. 10, pp. 29–47, 2000.

[37] J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor and J. Vandewalle, Least Squares Support Vector Machines, World Scientific Pub. Co., 2002

[38] V. N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, USA. 1995

[39] E. C. Villarreal, “Current and potential therapies for the treatment of herpes-virus infections”, Prog Drug Res, vol. 60, pp. 263–307, 2003.

[40] Weller, S. K., A. Spadaro, J. E. Schaffer, A. W. Murray, A. M. Maxam and P. A. Schaffer, “Cloning, sequencing, and functional analysis of oriL, a herpes simplex virus type 1 origin of DNA synthesis”, Mol Cell Biol, vol. 5, pp. 930–942, 1985.

[41] X. Yao, H. Liu, R. Zhang, M. Liu, Z. Hu, A. Panaye, J. P. Doucet and B. Fan, “Qsar and classification study of 1,4-dihydropyridine calcium channel antagonists based on least squares support vector machines”, Mol Pharm, vol. 2, pp. 348–356, 2005.

[42] H. Yuan, S.-S. Tseng, W. Ganshan and Z. Fuyan, “A Two-phase Feature Selection Method Using both Filter and Wrapper”, IEEE SMC '99 Conference on Systems, Man, and Cybernetics, vol. 2, pp. 132 -136, 1999.

[43] Y. Zhu, L. Huang and D. G. Anders, “Human cytomegalovirus orilyt sequence requirements”, J Virol, vol. 72, pp. 4989–4996, 1998.