comparative evaluation of word composition distances for the recognition of scop relationships

10
BIOINFORMATICS Vol. 20 no. 2 2004, pages 206–215 DOI: 10.1093/bioinformatics/btg392 Comparative evaluation of word composition distances for the recognition of SCOP relationships Susana Vinga 1 , Rodrigo Gouveia-Oliveira 1 and Jonas S. Almeida 1,2, 1 Biomathematics Group, ITQB, Universidade Nova de Lisboa, Rua da Quinta Grande, n. 6, 2780-156 Oeiras, Portugal and 2 Department Biometry and Epidemiology, Medical University South Carolina, 135 Cannon Street, Suite 303, P.O. Box 250835, Charleston, SC 29425, USA Received on April 23, 2003; revised on June 27, 2003; accepted on July 11, 2003 ABSTRACT Motivation: Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a compar- ative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also pre- sents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith–Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content. Results: The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dis- similarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment meth- ods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particu- lar Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation jus- tifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods. Availability: All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric Contact: [email protected]; [email protected] INTRODUCTION Bioinformatics applications rely heavily on sequence comparison techniques, from searching a database with a query DNA sequence to the classification of protein domains. To whom correspondence should be addressed. In most cases, alignments are performed between the tar- get sequences and the resulting alignment scores are used to calculate a measure of dissimilarity. In protein comparison, the scoring methods depend on amino acid mutation rate information, represented as scoring matrices, and find optimal alignments between sequences by dynamic programming techniques. Alignment scores are particularly useful when sequences are known to be closely homologous since the more conserved regions are automatically detected. However, for remote homologues this approach tends to fail: proteins with <20% identity, a region sometimes referred to as the ‘twilight zone’, are not satisfactorily aligned neither its sim- ilarity detected (Pearson, 2000). It is also noteworthy that dynamic programming is computationally intensive and con- sequently unpractical for querying large datasets, which forces the use of some heuristics to reduce the running times, as exemplified by BLAST. In a recent paper (Vinga and Almeida, 2003), the authors reviewed alignment-free methods for sequence comparison but did not compare them quantitatively. In that review metrics based in L-tuple composition, the focus of this report, emerged as the alignment-free technique most often proposed by other researchers. In these algorithms each sequence is mapped on to an n-dimensional vector according to its word composition. Linear Algebra theory is further employed to define distances between sequences represented in those vector spaces, namely by using Euclidean (eu) distance and Information content (see Review for a full description and related references). This report also presents a novel distance function between protein sequences, the W-metric (Wm), which tailors L-tuple composition methods with techniques based in alignment. This is accomplished by defining a function that includes both one-tuple composition information, more specifically the dif- ferences in amino acid content between two proteins, and weights from the scoring matrices used in alignment methods. Although these two concepts are not new, their conjugation 206 Published by Oxford University Press at University of Portland on May 24, 2011 bioinformatics.oxfordjournals.org Downloaded from

Upload: independent

Post on 15-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

BIOINFORMATICS Vol 20 no 2 2004 pages 206ndash215DOI 101093bioinformaticsbtg392

Comparative evaluation of word compositiondistances for the recognition ofSCOP relationships

Susana Vinga1 Rodrigo Gouveia-Oliveira1 andJonas S Almeida12lowast

1Biomathematics Group ITQB Universidade Nova de Lisboa Rua da Quinta Granden 6 2780-156 Oeiras Portugal and 2Department Biometry and EpidemiologyMedical University South Carolina 135 Cannon Street Suite 303 PO Box 250835Charleston SC 29425 USA

Received on April 23 2003 revised on June 27 2003 accepted on July 11 2003

ABSTRACTMotivation Alignment-free metrics were recently reviewed bythe authors but have not until now been object of a compar-ative study This paper compares the classification accuracyof word composition metrics therein reviewed It also pre-sents a new distance definition between protein sequencesthe W-metric which bridges between alignment metricssuch as scores produced by the SmithndashWaterman algorithmand methods based solely in L-tuple composition such asEuclidean distance and Information contentResults The comparative study reported here used theSCOPASTRAL protein structure hierarchical database andaccessed the discriminant value of alternative sequence dis-similarity measures by calculating areas under the ReceiverOperating Characteristic curves Although alignment meth-ods resulted in very good classification accuracy at familyand superfamily levels alignment-free distances in particu-lar Standard Euclidean Distance are as good as alignmentalgorithms when sequence similarity is smaller such as forrecognition of fold or class relationships This observation jus-tifies its advantageous use to pre-filter homologous proteinssince word statistics techniques are computed much fasterthan the alignment methodsAvailability All MATLAB code used to generate the datais available upon request to the authors Additional materialavailable at httpbioinformaticsmusceduwmetricContact svingaitqbunlptalmeidajmuscedu

INTRODUCTIONBioinformatics applications rely heavily on sequencecomparison techniques from searching a database with aquery DNA sequence to the classification of protein domains

lowastTo whom correspondence should be addressed

In most cases alignments are performed between the tar-get sequences and the resulting alignment scores are used tocalculate a measure of dissimilarity In protein comparisonthe scoring methods depend on amino acid mutation rateinformation represented as scoring matrices and find optimalalignments between sequences by dynamic programmingtechniques Alignment scores are particularly useful whensequences are known to be closely homologous since themore conserved regions are automatically detected Howeverfor remote homologues this approach tends to fail proteinswith lt20 identity a region sometimes referred to as thelsquotwilight zonersquo are not satisfactorily aligned neither its sim-ilarity detected (Pearson 2000) It is also noteworthy thatdynamic programming is computationally intensive and con-sequently unpractical for querying large datasets which forcesthe use of some heuristics to reduce the running times asexemplified by BLAST

In a recent paper (Vinga and Almeida 2003) the authorsreviewed alignment-free methods for sequence comparisonbut did not compare them quantitatively In that review metricsbased in L-tuple composition the focus of this report emergedas the alignment-free technique most often proposed by otherresearchers In these algorithms each sequence is mapped onto an n-dimensional vector according to its word compositionLinear Algebra theory is further employed to define distancesbetween sequences represented in those vector spaces namelyby using Euclidean (eu) distance and Information content (seeReview for a full description and related references)

This report also presents a novel distance function betweenprotein sequences the W-metric (Wm) which tailors L-tuplecomposition methods with techniques based in alignmentThis is accomplished by defining a function that includes bothone-tuple composition information more specifically the dif-ferences in amino acid content between two proteins andweights from the scoring matrices used in alignment methodsAlthough these two concepts are not new their conjugation

206 Published by Oxford University Press

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

constitutes the novelty aspect of this metric The weights cor-respond to the estimation of log-likelihood ratios betweenprobabilities of symbols that best describe mutation ratesin known homologous proteins thus providing evolutionaryinformation

The usefulness of the L-tuple composition approach isassociated with its light computational load which makesit very useful in pre-filtering relevant sequences and thenusing alignment algorithms to refine the searches This typeof heuristic approach is already used in programs like BLAST(Altschul et al 1990) and FASTA (Pearson and Lipman1988) Although the solution may not be the optimal itdrastically shortens processing speed to the point that themethod can be used to query large databases Howevera comparative study of the effectiveness of alignment-freesequence dissimilarity measures is to the authorsrsquo best know-ledge absent from the literature Consequently it is difficultto decide at what similarity level are alignment methodsrequired Such a comparative study of how these differentmetrics perform is reported here This is the main motivationfor the present work where alignment-free linear algebra typemethods are comparatively assessed Some previous studieshave reported comparative assessments of various methods(Brenner et al 1998 Lindahl and Elofsson 2000 Pearson1991 1995) but not consistently for the same referencedataset These studies showed however the importance offollowing an extensive protocol involving as many examplesas possible in the assessment of any classification procedureOnly then it is possible to improve some heuristics commonlyapplied in sequence similarity searches and identify the bestalgorithmic choice for each problem category

We compared L-tuple metrics with SmithndashWaterman (SW)algorithm by Receiver Operating Characteristic (ROC) curvesapplying the algorithms to a subset of Structural Classifica-tion Of Proteins (SCOP)ASTRAL database This databaseconstitutes the reference gold standard for protein second-ary structure classification which makes it a commonly usedbenchmark for protein structure prediction algorithms a cru-cial field in Computational Biology applications In additionit has a hierarchical organization that can be browsed to assessclassification accuracy for each of its levels

SYSTEMS AND METHODSIn the section below the W-metric a novel word-statisticdistance between protein sequences is presented as well asadditional background on alignment-free algorithms In thesubsequent sections the reference protein datasets and themethods used to compare the distance measures are describedFinally the last two sections describe the algorithms andprotocol used and its implementation

Word statisticsThere is a large body of literature on Word Statistics (Reinertet al 2000) where sequences are interpreted as a succession

of symbols and are further analysed by first representing thefrequencies of its small segments (L-tuples or words) Thisapproach does not take into account any of the physico-chemical or structural properties of the amino acids ornucleotides There are also an increasing number of stud-ies focusing on distance definition in the frequency spaceof L-tuples These definitions are a fundamental step forthe subsequent application of exploratory analysis methodssuch as cluster analysis and dimensionality reduction tech-niques In a recent review (Vinga and Almeida 2003) theauthors overviewed these metrics and their application to bio-logical sequences both DNA and proteins That review willbe used as the main reference for description of the L-tupledistances and alignment-free algorithms that will be testedhere A protein X of length n is a sequence of symbolsfrom the alphabet of all possible amino acids X = s1 snsi isin A = A R N D V The mapping of X into theEuclidean space can be defined by representing X by itsamino acid composition in counts cX and frequencies f X

[Equation (1)]

cX = (cX

primeAprime cXprimeRprime cX

primeN prime cXprimeDprime cX

primeV prime)

f X = cX

n

(1)

For example the peptide X = AARNNDAA is mappedon to the vectors cX = (4 1 2 1 0 0 ) and f X =(05 0125 025 0125 0 0 ) Instead of single amino acidfrequencies longer fragments of length L could be considered(L-tuples) with resulting 20L long vector of frequencies Onecan further define a distance or dissimilarity measure betweentwo proteins X and Y d(X Y ) based on their correspondingvectors f X and f Y

W-metric definitionThe novel Wm hereby proposed to complement existing wordcomposition methods is based on the quadratic form definedin Equation (2) The distance between two proteins X andY dW (X Y ) is defined by their corresponding one-tuple fre-quencies f X and f Y weighted by matrix W below described

dW (X Y ) = (f X minus f Y )T middot W middot (f X minus f Y )

=sum

iisin A

sum

jisin A(f X

i minus f Yi ) middot (f X

j minus f Yj ) middot wij (2)

These quadratic forms play an important role in major theor-etical and applied disciplines and scientific fields from LinearAlgebra to Econometrics In Statistics they are used eg inparameter estimation and statistical tests (Schott 1997) Theyrepresent a scoring between conveniently weighted vectors ofobservations It is noteworthy that other L-tuple distances arealso based in quadratic forms [Equation (2)] eg when Wis the covariance matrix of the data it represents Mahalan-obis (ma) distance between the corresponding vectors and

207

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

the standard Euclidean (se) distance is obtained when tak-ing only covariance matrix diagonal The distance reduces tothe squared Euclidean distance when W is the identity matrix

The weight matrices W chosen in Equation (2) can berationalized as being scoring or amino acid substitutionmatrices instead of covariance-based weights as in otherdistances These matrices such as Point Accepted Muta-tion (PAM) (Dayhoff et al 1978) and BLOcks SUbstitu-tion Matrices (BLOSUM) (Henikoff and Henikoff 1992)are used in alignment-based methods and estimate the log-likelihood ratios between probabilities of symbols that bestdescribe mutation rates in known homologous proteins Inparticular BLOSUMX matrix is estimated with ungappedaligned blocks of proteins sharing less than X identityPAMn matrices account for evolutionary change in proteinsequences and its estimation is based in the constructionof phylogenetic trees which are subsequently used to cre-ate a Markov Chain n-step transition matrix This matrix isfurther transformed and normalized for conditional probab-ilities For extensive description of this substitution matricesand some estimation examples see Ewens and Grant (2001section 65)

The key idea of Wm is to weight amino acid compositiondifferences between two sequences f X

i minus f Yi according to

its relative conservation in proteins known to be homologousThe overall distance between two proteins will be the sum ofthese weighted factors For example if an amino acid is highlyconserved in known homologous sequences (high wii) twoproteins with a very different frequency of this amino acidshould be less similar than if the amino acids are lsquocloserrsquoto each other in that statistical sense If the opposite occursie if an amino acid is known to have high mutational rates(low wii) the differences between its compositions in the twosequences being compared should be attenuated in the overalldistance calculation The same scheme applies to off-diagonalelements wij (i = j) if there is a high mutation rate betweenthese two amino acids it means that wij is higher than the cor-responding weight of two amino acids very different so thiscomponent should be weighted more The main idea is thusweighting amino acid differences according to their similaritygiven by known evolutionary information The weighted met-ric hence includes both amino acid composition informationlike other alignment-free techniques and conserved homo-logy information as used to score the conventional alignmentalgorithms

Some variations of this metric were also tested namelyusing several normalization procedures It is appealing the lowcomputational load associated with the calculation expressedin Equation (2) It is not proven here however that theW matrix associated with mutation information is the bestin discriminating classification levels This can be furtheraccomplished by using Artificial Neural Networks (ANN) orother algorithms to optimize classification accuracy by findinga lsquobetterrsquo W weighting matrix

ROC curve definitionThe methods that will be used here to assess and compare theaccuracy of classification schemes and prediction algorithmsare based on the analysis of ROC curves This methodgoes back to signal detection and classification problemsand is now widely applied in Medical diagnosis studiesand psychometric analysis (Egan 1975) This approach isemployed in binary classification of continuous data usu-ally categorized as positive (1) or negative (0) cases Theclassification accuracy can be measured by plotting for dif-ferent threshold values the number of true positives (TP)also named sensitivity or coverage versus false positives (FP)or (1 minus specificity) encountered for each threshold properlynormalized [Equation (3)]

sensitivity = TruePositives

Positives= TP

TP + FN

specificity = TrueNegatives

Negatives= TN

TN + FP(3)

1 minus specificity = FP

TN + FP

A ROC curve is simply the plot of sensitivity versus(1 minus specificity) for different threshold values The area undera ROC curve (AUC) is a widely employed parameter toquantify the quality of a classificator because it is a thresholdindependent performance measure and is closely related tothe Wilcoxon signed-rank test (Bradley 1997) For a perfectclassifier the AUC is 1 and for a random classifier the AUCis 05 For additional results and comprehensive discussionof AUC measure see Bradley (1997) Baldi et al (2000)Brenner et al (1998) Green and Brenner (2002) describeother possible classification accuracy measures not employedin this study

Protein test datasetsmdashSCOPASTRALclassificationThe sequences used to perform the tests and compare differ-ent metrics are proteins from the SCOP database (Lo Conteet al 2002 Murzin et al 1995) This database consists ofProtein Data Bank (PDB) entries and provides a detailedand reliable description of protein structure relationshipsand homology The three-dimensional (3D) structure ana-lysis allows the detection of more remote homologous sincestructure is typically more conserved than sequence The fun-damental unit of classification is the protein domain whichis the basic element of protein structure and evolution TheASTRAL compendium provides additional tools and data-sets (Brenner et al 2000 Chandonia et al 2002) namelythe possibility to filter sequence sets where two different pro-teins have less than a chosen percentage identity to each otherThis classification is a hierarchical description of proteins(Fig 1) The first two levels family (fa) and superfamily(sf) describe evolutionary relationships the third one fold

208

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

Fig 1 SCOPASTRAL dbmdashhierarchical classification of proteinsExample of Fibroblast growth factor receptor (FGFR2) classificationin each of the four levels

(cf) describes geometrical relationships or major structuralsimilarity and the fourth one represents protein structuralclass (cl) This will allow the study of each classifier fordifferent levels of similarity

Two different datasets were tested in order to assess theaccuracy of each metric The basic protein set PDB40-Bwas extracted directly from the ASTRAL web site and cor-responds to SCOP database release 161 (November 2002)This subset includes all the sequences that share lt40 iden-tity to each other and has become a benchmark test set inthe evaluation of methods to detect remote protein homolo-gies (Brenner et al 1998 Dubchak et al 1999 Karwath andKing 2002 Lindahl and Elofsson 2000 Luo et al 2002Park et al 1997 Webb et al 2002) This dataset was sub-sequently trimmed to exclude sequences with unknown aminoacids and those belonging to families with lt5 elements thusobtaining the protein group named PDB40-v (Table 1) Forexample there are 232 families with only one sequence whichis not informative regarding intra-family dissimilarity whichmakes these domains insufficiently representative of a familyThe effect of trimming the dataset was in this way also studiedOnly the four major classes were included namely all-α classconstituted mainly by proteins with α helix all-β class essen-tially formed by β-sheet structures αβ class proteins withmixtures of α-helices and β-strands and α + β class thosewhere α-helices and β-strands are largely segregated OtherSCOP classes include multi-domain proteins small proteinstheoretical models and other types and were not included inthis study See Chothia et al (1997) and SCOP documentationfor description of protein folds and classification

This study also considered separately another protein setfrom an outdated release of the SCOP database (135) thePDB40-b due to the large amount of literature alreadypublished with those sequences (Luo et al 2002 andcorresponding references) Table 1 summarizes all thesequences sets examined in this paper

Protocol for comparative assessmentThe comparative test procedure followed in this report wasbased on a binary classification of each protein pair where1 corresponds to the two proteins sharing the same groupin SCOP database 0 otherwise The group can be definedat one of the four different levels of the database family(fa) superfamily (sf) class fold (cf) or class (cl) exploringthe hierarchical organization of the proteins in that struc-ture Therefore each protein pair is associated to four binaryclassifications one for each level

In order to compute the ROC curves we calculated thedistances between all possible protein pairs according to thedifferent metrics referred and briefly described below

The similarity measure based on alignment tested wasthe SmithndashWaterman raw score with no correction for stat-istical significance using score matrix BLOSUM50 and alinear gaping penalty scheme with a gap penalty of 8The distances based in L-tuple composition evaluated wereW-metric Euclidean standard Euclidean KullbackndashLeibler(ku) discrepancy Cosine (co) and (Mahalanobis) For the cor-responding complete definitions and properties see Vinga andAlmeida (2003) In Wm calculations some alternative weight-ing matrices W [Equation (2)] were used these included thescoring matrices BLOSUM50 BLOSUM40 BLOSUM62and PAM250 The following normalization procedures werealso applied take only the diagonal of W pass all its negativevalues to zero use the exponential function of the originalmatrix and normalize by minimum and range However inthis printed report only the results obtained with BLOSUM50will be presented The variations described are documentedon the online annex

For each metric the distances between all proteins pairswere subsequently sorted from maximum to minimum sim-ilarity ie from the closest to the farthest pair A perfectmetric would completely separate negative from positiverelationships ie the maximum similarity would corres-pond always to the same group and the binary classificationobtained after this distance sorting would be the vector(1 1 1 0 0 0) Of course this does not happen inpractice and the classes are interspersed The ROC curvespermit to assess the level of accuracy of this separationwithout choosing any distance threshold for the separationpoint In particular the AUC will give us a unique numberof the relative accuracy of each metric and level accord-ing to the SCOP classification scheme We also tested eachof the four classes separately with the same procedureto evaluate hypothetical differences between the structuralclasses

ComputationAll the algorithms were implemented in MATLAB language(version 6 release 13) The code is available upon request tothe authors

209

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

Table 1 Protein datasets used in this study

Datasets Classes TotalAll-α All-β αβ α + β

do fa sf cf do fa sf cf do fa sf cf do fa sf cf

PDB40-B (161) 867 409 257 151 1051 362 213 111 1237 467 190 117 1065 487 307 212 4220PDB40-v (161) 285 35 28 27 517 43 30 24 542 58 40 31 339 39 37 33 1683PDB40-b (135) 220 128 97 73 309 150 115 54 285 154 98 66 240 147 115 80 1054

For each protein set number of sequences or domains (do) families ( fa) superfamilies (sf ) and folds (cf ) in each class PDB40-B sequences that share lt40 to each othercurrent release (161) of SCOPASTRAL (not tested) PDB40-v set derived from PDB40-B (161) by excluding sequences with unknown amino acids and families with lt5 domainsPDB40-b sequence dataset used by Luo et al (2002) corresponds to previous release (135) of the same database

RESULTS AND DISCUSSIONIn the following sections we present some of the resultsobtained For extensive and additional results regardingall metrics and datasets see also the web page httpbioinformaticsmusceduwmetric where the complete graphsand tables are shown (data not shown due to space limitations)

Complete dataset

ROC curves and AUC values The ROC curves obtained forthe complete dataset (Table 1) are presented in Figures 2(PDB40-v) and 3 (PDB40-b) As overviewed in the Systemsand Methods section a random classifier would have identicalvalues of sensitivity and (1minusspecificity) for any thresholdvalue considered (dashed diagonal)

Figures 4 and 5 provide graphs with the areas under ROCcurves (AUC) obtained for both datasets and each SCOP levelThe AUC values are typically used as a measure of overalldiscrimination accuracy

As would be expected Figures 4 and 5 show that the AUCdecreases from family to class level for both datasets Thesequence similarity between proteins sharing the same fam-ily is still well recognized Consequently all the distancesachieve their best discrimination accuracy at this level Atclass level classification relationships reflect similar struc-tures which can have completely different sequences andamino acid compositions This underlies the observation thatsequence similarity is lost regardless of the metric fromfamily to class The comparative discriminant value of thedifferent metrics (Figs 4 and 5) shows two clear trends Firstat family level alignment has a clear advantage with AUCvalues of 086 and 081 (PDB40b and PDB40v sets) whereasall word-statistics metrics perform at or under 075 and 068respectively The most discriminant word-statistics metric atfamily level is the novel Wm introduced by this report (seeSystems and Methods section) reflecting the value of weight-ing the quadratic form [Equation (2)] by evolutionary ratherthan statistical criteria At the superfamily level the advantageof alignment remains but statistically weighting performs just

as well as the Wm Interestingly the unweighted Euclideanmetric covariance weighting Mahalanobis and information-based KullbackndashLeibler lag behind The main surprise of thisanalysis is to be observed at the next level the fold wherethe standard Euclidean metric performs as well as align-ment scores in both versions of SCOP especially for thelow specificityhigh sensitivity range (corresponds to manyFP relationships) In fact standard Euclidean is clearly morediscriminant than SW for 1minusspecificity values around 075Finally at the class level the absence of conserved segments infact turns alignment into a computationally expensive proced-ure to score amino acid composition differences At this pointmost alignment-free metrics outperform it The inspectionof the ROC curves themselves (Figs 2 and 3) further docu-ments this comparison between metrics The results obtainedare slightly less discriminant for the more recent version ofthe protein dataset (PDB40-v) for all the levels except forclass where higher values of AUC are obtained Howeverthere are no significant changes in their relative ordering It isnoteworthy that there is also a dependency between levels asregards classification accuracy Hits at a lower level may beargued to bias for more populated grouping at upper levelsHowever it should be noted that this study is of exploratoryrather than discriminant nature which places any pairwisecomparison regardless of the SCOP classification level onan equal standing

Variations in the Wm definition The Wm AUC values inthe previous graphics were obtained using the scoring matrixBLOSUM50 The results using BLOSUM40 BLOSUM62and PAM250 are virtually the same and will be omitted Nev-ertheless those results were compiled and are made availableat the support web page (see Availability) It is interesting tonote that although defining a different score for each domainpair the different matrices W produce the same score order-ing Similarly all the normalization procedures did not leadto improved discrimination producing worse classificationresults but are still made available in the same web page

210

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

constitutes the novelty aspect of this metric The weights cor-respond to the estimation of log-likelihood ratios betweenprobabilities of symbols that best describe mutation ratesin known homologous proteins thus providing evolutionaryinformation

The usefulness of the L-tuple composition approach isassociated with its light computational load which makesit very useful in pre-filtering relevant sequences and thenusing alignment algorithms to refine the searches This typeof heuristic approach is already used in programs like BLAST(Altschul et al 1990) and FASTA (Pearson and Lipman1988) Although the solution may not be the optimal itdrastically shortens processing speed to the point that themethod can be used to query large databases Howevera comparative study of the effectiveness of alignment-freesequence dissimilarity measures is to the authorsrsquo best know-ledge absent from the literature Consequently it is difficultto decide at what similarity level are alignment methodsrequired Such a comparative study of how these differentmetrics perform is reported here This is the main motivationfor the present work where alignment-free linear algebra typemethods are comparatively assessed Some previous studieshave reported comparative assessments of various methods(Brenner et al 1998 Lindahl and Elofsson 2000 Pearson1991 1995) but not consistently for the same referencedataset These studies showed however the importance offollowing an extensive protocol involving as many examplesas possible in the assessment of any classification procedureOnly then it is possible to improve some heuristics commonlyapplied in sequence similarity searches and identify the bestalgorithmic choice for each problem category

We compared L-tuple metrics with SmithndashWaterman (SW)algorithm by Receiver Operating Characteristic (ROC) curvesapplying the algorithms to a subset of Structural Classifica-tion Of Proteins (SCOP)ASTRAL database This databaseconstitutes the reference gold standard for protein second-ary structure classification which makes it a commonly usedbenchmark for protein structure prediction algorithms a cru-cial field in Computational Biology applications In additionit has a hierarchical organization that can be browsed to assessclassification accuracy for each of its levels

SYSTEMS AND METHODSIn the section below the W-metric a novel word-statisticdistance between protein sequences is presented as well asadditional background on alignment-free algorithms In thesubsequent sections the reference protein datasets and themethods used to compare the distance measures are describedFinally the last two sections describe the algorithms andprotocol used and its implementation

Word statisticsThere is a large body of literature on Word Statistics (Reinertet al 2000) where sequences are interpreted as a succession

of symbols and are further analysed by first representing thefrequencies of its small segments (L-tuples or words) Thisapproach does not take into account any of the physico-chemical or structural properties of the amino acids ornucleotides There are also an increasing number of stud-ies focusing on distance definition in the frequency spaceof L-tuples These definitions are a fundamental step forthe subsequent application of exploratory analysis methodssuch as cluster analysis and dimensionality reduction tech-niques In a recent review (Vinga and Almeida 2003) theauthors overviewed these metrics and their application to bio-logical sequences both DNA and proteins That review willbe used as the main reference for description of the L-tupledistances and alignment-free algorithms that will be testedhere A protein X of length n is a sequence of symbolsfrom the alphabet of all possible amino acids X = s1 snsi isin A = A R N D V The mapping of X into theEuclidean space can be defined by representing X by itsamino acid composition in counts cX and frequencies f X

[Equation (1)]

cX = (cX

primeAprime cXprimeRprime cX

primeN prime cXprimeDprime cX

primeV prime)

f X = cX

n

(1)

For example the peptide X = AARNNDAA is mappedon to the vectors cX = (4 1 2 1 0 0 ) and f X =(05 0125 025 0125 0 0 ) Instead of single amino acidfrequencies longer fragments of length L could be considered(L-tuples) with resulting 20L long vector of frequencies Onecan further define a distance or dissimilarity measure betweentwo proteins X and Y d(X Y ) based on their correspondingvectors f X and f Y

W-metric definitionThe novel Wm hereby proposed to complement existing wordcomposition methods is based on the quadratic form definedin Equation (2) The distance between two proteins X andY dW (X Y ) is defined by their corresponding one-tuple fre-quencies f X and f Y weighted by matrix W below described

dW (X Y ) = (f X minus f Y )T middot W middot (f X minus f Y )

=sum

iisin A

sum

jisin A(f X

i minus f Yi ) middot (f X

j minus f Yj ) middot wij (2)

These quadratic forms play an important role in major theor-etical and applied disciplines and scientific fields from LinearAlgebra to Econometrics In Statistics they are used eg inparameter estimation and statistical tests (Schott 1997) Theyrepresent a scoring between conveniently weighted vectors ofobservations It is noteworthy that other L-tuple distances arealso based in quadratic forms [Equation (2)] eg when Wis the covariance matrix of the data it represents Mahalan-obis (ma) distance between the corresponding vectors and

207

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

the standard Euclidean (se) distance is obtained when tak-ing only covariance matrix diagonal The distance reduces tothe squared Euclidean distance when W is the identity matrix

The weight matrices W chosen in Equation (2) can berationalized as being scoring or amino acid substitutionmatrices instead of covariance-based weights as in otherdistances These matrices such as Point Accepted Muta-tion (PAM) (Dayhoff et al 1978) and BLOcks SUbstitu-tion Matrices (BLOSUM) (Henikoff and Henikoff 1992)are used in alignment-based methods and estimate the log-likelihood ratios between probabilities of symbols that bestdescribe mutation rates in known homologous proteins Inparticular BLOSUMX matrix is estimated with ungappedaligned blocks of proteins sharing less than X identityPAMn matrices account for evolutionary change in proteinsequences and its estimation is based in the constructionof phylogenetic trees which are subsequently used to cre-ate a Markov Chain n-step transition matrix This matrix isfurther transformed and normalized for conditional probab-ilities For extensive description of this substitution matricesand some estimation examples see Ewens and Grant (2001section 65)

The key idea of Wm is to weight amino acid compositiondifferences between two sequences f X

i minus f Yi according to

its relative conservation in proteins known to be homologousThe overall distance between two proteins will be the sum ofthese weighted factors For example if an amino acid is highlyconserved in known homologous sequences (high wii) twoproteins with a very different frequency of this amino acidshould be less similar than if the amino acids are lsquocloserrsquoto each other in that statistical sense If the opposite occursie if an amino acid is known to have high mutational rates(low wii) the differences between its compositions in the twosequences being compared should be attenuated in the overalldistance calculation The same scheme applies to off-diagonalelements wij (i = j) if there is a high mutation rate betweenthese two amino acids it means that wij is higher than the cor-responding weight of two amino acids very different so thiscomponent should be weighted more The main idea is thusweighting amino acid differences according to their similaritygiven by known evolutionary information The weighted met-ric hence includes both amino acid composition informationlike other alignment-free techniques and conserved homo-logy information as used to score the conventional alignmentalgorithms

Some variations of this metric were also tested namelyusing several normalization procedures It is appealing the lowcomputational load associated with the calculation expressedin Equation (2) It is not proven here however that theW matrix associated with mutation information is the bestin discriminating classification levels This can be furtheraccomplished by using Artificial Neural Networks (ANN) orother algorithms to optimize classification accuracy by findinga lsquobetterrsquo W weighting matrix

ROC curve definitionThe methods that will be used here to assess and compare theaccuracy of classification schemes and prediction algorithmsare based on the analysis of ROC curves This methodgoes back to signal detection and classification problemsand is now widely applied in Medical diagnosis studiesand psychometric analysis (Egan 1975) This approach isemployed in binary classification of continuous data usu-ally categorized as positive (1) or negative (0) cases Theclassification accuracy can be measured by plotting for dif-ferent threshold values the number of true positives (TP)also named sensitivity or coverage versus false positives (FP)or (1 minus specificity) encountered for each threshold properlynormalized [Equation (3)]

sensitivity = TruePositives

Positives= TP

TP + FN

specificity = TrueNegatives

Negatives= TN

TN + FP(3)

1 minus specificity = FP

TN + FP

A ROC curve is simply the plot of sensitivity versus(1 minus specificity) for different threshold values The area undera ROC curve (AUC) is a widely employed parameter toquantify the quality of a classificator because it is a thresholdindependent performance measure and is closely related tothe Wilcoxon signed-rank test (Bradley 1997) For a perfectclassifier the AUC is 1 and for a random classifier the AUCis 05 For additional results and comprehensive discussionof AUC measure see Bradley (1997) Baldi et al (2000)Brenner et al (1998) Green and Brenner (2002) describeother possible classification accuracy measures not employedin this study

Protein test datasetsmdashSCOPASTRALclassificationThe sequences used to perform the tests and compare differ-ent metrics are proteins from the SCOP database (Lo Conteet al 2002 Murzin et al 1995) This database consists ofProtein Data Bank (PDB) entries and provides a detailedand reliable description of protein structure relationshipsand homology The three-dimensional (3D) structure ana-lysis allows the detection of more remote homologous sincestructure is typically more conserved than sequence The fun-damental unit of classification is the protein domain whichis the basic element of protein structure and evolution TheASTRAL compendium provides additional tools and data-sets (Brenner et al 2000 Chandonia et al 2002) namelythe possibility to filter sequence sets where two different pro-teins have less than a chosen percentage identity to each otherThis classification is a hierarchical description of proteins(Fig 1) The first two levels family (fa) and superfamily(sf) describe evolutionary relationships the third one fold

208

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

Fig 1 SCOPASTRAL dbmdashhierarchical classification of proteinsExample of Fibroblast growth factor receptor (FGFR2) classificationin each of the four levels

(cf) describes geometrical relationships or major structuralsimilarity and the fourth one represents protein structuralclass (cl) This will allow the study of each classifier fordifferent levels of similarity

Two different datasets were tested in order to assess theaccuracy of each metric The basic protein set PDB40-Bwas extracted directly from the ASTRAL web site and cor-responds to SCOP database release 161 (November 2002)This subset includes all the sequences that share lt40 iden-tity to each other and has become a benchmark test set inthe evaluation of methods to detect remote protein homolo-gies (Brenner et al 1998 Dubchak et al 1999 Karwath andKing 2002 Lindahl and Elofsson 2000 Luo et al 2002Park et al 1997 Webb et al 2002) This dataset was sub-sequently trimmed to exclude sequences with unknown aminoacids and those belonging to families with lt5 elements thusobtaining the protein group named PDB40-v (Table 1) Forexample there are 232 families with only one sequence whichis not informative regarding intra-family dissimilarity whichmakes these domains insufficiently representative of a familyThe effect of trimming the dataset was in this way also studiedOnly the four major classes were included namely all-α classconstituted mainly by proteins with α helix all-β class essen-tially formed by β-sheet structures αβ class proteins withmixtures of α-helices and β-strands and α + β class thosewhere α-helices and β-strands are largely segregated OtherSCOP classes include multi-domain proteins small proteinstheoretical models and other types and were not included inthis study See Chothia et al (1997) and SCOP documentationfor description of protein folds and classification

This study also considered separately another protein setfrom an outdated release of the SCOP database (135) thePDB40-b due to the large amount of literature alreadypublished with those sequences (Luo et al 2002 andcorresponding references) Table 1 summarizes all thesequences sets examined in this paper

Protocol for comparative assessmentThe comparative test procedure followed in this report wasbased on a binary classification of each protein pair where1 corresponds to the two proteins sharing the same groupin SCOP database 0 otherwise The group can be definedat one of the four different levels of the database family(fa) superfamily (sf) class fold (cf) or class (cl) exploringthe hierarchical organization of the proteins in that struc-ture Therefore each protein pair is associated to four binaryclassifications one for each level

In order to compute the ROC curves we calculated thedistances between all possible protein pairs according to thedifferent metrics referred and briefly described below

The similarity measure based on alignment tested wasthe SmithndashWaterman raw score with no correction for stat-istical significance using score matrix BLOSUM50 and alinear gaping penalty scheme with a gap penalty of 8The distances based in L-tuple composition evaluated wereW-metric Euclidean standard Euclidean KullbackndashLeibler(ku) discrepancy Cosine (co) and (Mahalanobis) For the cor-responding complete definitions and properties see Vinga andAlmeida (2003) In Wm calculations some alternative weight-ing matrices W [Equation (2)] were used these included thescoring matrices BLOSUM50 BLOSUM40 BLOSUM62and PAM250 The following normalization procedures werealso applied take only the diagonal of W pass all its negativevalues to zero use the exponential function of the originalmatrix and normalize by minimum and range However inthis printed report only the results obtained with BLOSUM50will be presented The variations described are documentedon the online annex

For each metric the distances between all proteins pairswere subsequently sorted from maximum to minimum sim-ilarity ie from the closest to the farthest pair A perfectmetric would completely separate negative from positiverelationships ie the maximum similarity would corres-pond always to the same group and the binary classificationobtained after this distance sorting would be the vector(1 1 1 0 0 0) Of course this does not happen inpractice and the classes are interspersed The ROC curvespermit to assess the level of accuracy of this separationwithout choosing any distance threshold for the separationpoint In particular the AUC will give us a unique numberof the relative accuracy of each metric and level accord-ing to the SCOP classification scheme We also tested eachof the four classes separately with the same procedureto evaluate hypothetical differences between the structuralclasses

ComputationAll the algorithms were implemented in MATLAB language(version 6 release 13) The code is available upon request tothe authors

209

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

Table 1 Protein datasets used in this study

Datasets Classes TotalAll-α All-β αβ α + β

do fa sf cf do fa sf cf do fa sf cf do fa sf cf

PDB40-B (161) 867 409 257 151 1051 362 213 111 1237 467 190 117 1065 487 307 212 4220PDB40-v (161) 285 35 28 27 517 43 30 24 542 58 40 31 339 39 37 33 1683PDB40-b (135) 220 128 97 73 309 150 115 54 285 154 98 66 240 147 115 80 1054

For each protein set number of sequences or domains (do) families ( fa) superfamilies (sf ) and folds (cf ) in each class PDB40-B sequences that share lt40 to each othercurrent release (161) of SCOPASTRAL (not tested) PDB40-v set derived from PDB40-B (161) by excluding sequences with unknown amino acids and families with lt5 domainsPDB40-b sequence dataset used by Luo et al (2002) corresponds to previous release (135) of the same database

RESULTS AND DISCUSSIONIn the following sections we present some of the resultsobtained For extensive and additional results regardingall metrics and datasets see also the web page httpbioinformaticsmusceduwmetric where the complete graphsand tables are shown (data not shown due to space limitations)

Complete dataset

ROC curves and AUC values The ROC curves obtained forthe complete dataset (Table 1) are presented in Figures 2(PDB40-v) and 3 (PDB40-b) As overviewed in the Systemsand Methods section a random classifier would have identicalvalues of sensitivity and (1minusspecificity) for any thresholdvalue considered (dashed diagonal)

Figures 4 and 5 provide graphs with the areas under ROCcurves (AUC) obtained for both datasets and each SCOP levelThe AUC values are typically used as a measure of overalldiscrimination accuracy

As would be expected Figures 4 and 5 show that the AUCdecreases from family to class level for both datasets Thesequence similarity between proteins sharing the same fam-ily is still well recognized Consequently all the distancesachieve their best discrimination accuracy at this level Atclass level classification relationships reflect similar struc-tures which can have completely different sequences andamino acid compositions This underlies the observation thatsequence similarity is lost regardless of the metric fromfamily to class The comparative discriminant value of thedifferent metrics (Figs 4 and 5) shows two clear trends Firstat family level alignment has a clear advantage with AUCvalues of 086 and 081 (PDB40b and PDB40v sets) whereasall word-statistics metrics perform at or under 075 and 068respectively The most discriminant word-statistics metric atfamily level is the novel Wm introduced by this report (seeSystems and Methods section) reflecting the value of weight-ing the quadratic form [Equation (2)] by evolutionary ratherthan statistical criteria At the superfamily level the advantageof alignment remains but statistically weighting performs just

as well as the Wm Interestingly the unweighted Euclideanmetric covariance weighting Mahalanobis and information-based KullbackndashLeibler lag behind The main surprise of thisanalysis is to be observed at the next level the fold wherethe standard Euclidean metric performs as well as align-ment scores in both versions of SCOP especially for thelow specificityhigh sensitivity range (corresponds to manyFP relationships) In fact standard Euclidean is clearly morediscriminant than SW for 1minusspecificity values around 075Finally at the class level the absence of conserved segments infact turns alignment into a computationally expensive proced-ure to score amino acid composition differences At this pointmost alignment-free metrics outperform it The inspectionof the ROC curves themselves (Figs 2 and 3) further docu-ments this comparison between metrics The results obtainedare slightly less discriminant for the more recent version ofthe protein dataset (PDB40-v) for all the levels except forclass where higher values of AUC are obtained Howeverthere are no significant changes in their relative ordering It isnoteworthy that there is also a dependency between levels asregards classification accuracy Hits at a lower level may beargued to bias for more populated grouping at upper levelsHowever it should be noted that this study is of exploratoryrather than discriminant nature which places any pairwisecomparison regardless of the SCOP classification level onan equal standing

Variations in the Wm definition The Wm AUC values inthe previous graphics were obtained using the scoring matrixBLOSUM50 The results using BLOSUM40 BLOSUM62and PAM250 are virtually the same and will be omitted Nev-ertheless those results were compiled and are made availableat the support web page (see Availability) It is interesting tonote that although defining a different score for each domainpair the different matrices W produce the same score order-ing Similarly all the normalization procedures did not leadto improved discrimination producing worse classificationresults but are still made available in the same web page

210

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

the standard Euclidean (se) distance is obtained when tak-ing only covariance matrix diagonal The distance reduces tothe squared Euclidean distance when W is the identity matrix

The weight matrices W chosen in Equation (2) can berationalized as being scoring or amino acid substitutionmatrices instead of covariance-based weights as in otherdistances These matrices such as Point Accepted Muta-tion (PAM) (Dayhoff et al 1978) and BLOcks SUbstitu-tion Matrices (BLOSUM) (Henikoff and Henikoff 1992)are used in alignment-based methods and estimate the log-likelihood ratios between probabilities of symbols that bestdescribe mutation rates in known homologous proteins Inparticular BLOSUMX matrix is estimated with ungappedaligned blocks of proteins sharing less than X identityPAMn matrices account for evolutionary change in proteinsequences and its estimation is based in the constructionof phylogenetic trees which are subsequently used to cre-ate a Markov Chain n-step transition matrix This matrix isfurther transformed and normalized for conditional probab-ilities For extensive description of this substitution matricesand some estimation examples see Ewens and Grant (2001section 65)

The key idea of Wm is to weight amino acid compositiondifferences between two sequences f X

i minus f Yi according to

its relative conservation in proteins known to be homologousThe overall distance between two proteins will be the sum ofthese weighted factors For example if an amino acid is highlyconserved in known homologous sequences (high wii) twoproteins with a very different frequency of this amino acidshould be less similar than if the amino acids are lsquocloserrsquoto each other in that statistical sense If the opposite occursie if an amino acid is known to have high mutational rates(low wii) the differences between its compositions in the twosequences being compared should be attenuated in the overalldistance calculation The same scheme applies to off-diagonalelements wij (i = j) if there is a high mutation rate betweenthese two amino acids it means that wij is higher than the cor-responding weight of two amino acids very different so thiscomponent should be weighted more The main idea is thusweighting amino acid differences according to their similaritygiven by known evolutionary information The weighted met-ric hence includes both amino acid composition informationlike other alignment-free techniques and conserved homo-logy information as used to score the conventional alignmentalgorithms

Some variations of this metric were also tested namelyusing several normalization procedures It is appealing the lowcomputational load associated with the calculation expressedin Equation (2) It is not proven here however that theW matrix associated with mutation information is the bestin discriminating classification levels This can be furtheraccomplished by using Artificial Neural Networks (ANN) orother algorithms to optimize classification accuracy by findinga lsquobetterrsquo W weighting matrix

ROC curve definitionThe methods that will be used here to assess and compare theaccuracy of classification schemes and prediction algorithmsare based on the analysis of ROC curves This methodgoes back to signal detection and classification problemsand is now widely applied in Medical diagnosis studiesand psychometric analysis (Egan 1975) This approach isemployed in binary classification of continuous data usu-ally categorized as positive (1) or negative (0) cases Theclassification accuracy can be measured by plotting for dif-ferent threshold values the number of true positives (TP)also named sensitivity or coverage versus false positives (FP)or (1 minus specificity) encountered for each threshold properlynormalized [Equation (3)]

sensitivity = TruePositives

Positives= TP

TP + FN

specificity = TrueNegatives

Negatives= TN

TN + FP(3)

1 minus specificity = FP

TN + FP

A ROC curve is simply the plot of sensitivity versus(1 minus specificity) for different threshold values The area undera ROC curve (AUC) is a widely employed parameter toquantify the quality of a classificator because it is a thresholdindependent performance measure and is closely related tothe Wilcoxon signed-rank test (Bradley 1997) For a perfectclassifier the AUC is 1 and for a random classifier the AUCis 05 For additional results and comprehensive discussionof AUC measure see Bradley (1997) Baldi et al (2000)Brenner et al (1998) Green and Brenner (2002) describeother possible classification accuracy measures not employedin this study

Protein test datasetsmdashSCOPASTRALclassificationThe sequences used to perform the tests and compare differ-ent metrics are proteins from the SCOP database (Lo Conteet al 2002 Murzin et al 1995) This database consists ofProtein Data Bank (PDB) entries and provides a detailedand reliable description of protein structure relationshipsand homology The three-dimensional (3D) structure ana-lysis allows the detection of more remote homologous sincestructure is typically more conserved than sequence The fun-damental unit of classification is the protein domain whichis the basic element of protein structure and evolution TheASTRAL compendium provides additional tools and data-sets (Brenner et al 2000 Chandonia et al 2002) namelythe possibility to filter sequence sets where two different pro-teins have less than a chosen percentage identity to each otherThis classification is a hierarchical description of proteins(Fig 1) The first two levels family (fa) and superfamily(sf) describe evolutionary relationships the third one fold

208

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

Fig 1 SCOPASTRAL dbmdashhierarchical classification of proteinsExample of Fibroblast growth factor receptor (FGFR2) classificationin each of the four levels

(cf) describes geometrical relationships or major structuralsimilarity and the fourth one represents protein structuralclass (cl) This will allow the study of each classifier fordifferent levels of similarity

Two different datasets were tested in order to assess theaccuracy of each metric The basic protein set PDB40-Bwas extracted directly from the ASTRAL web site and cor-responds to SCOP database release 161 (November 2002)This subset includes all the sequences that share lt40 iden-tity to each other and has become a benchmark test set inthe evaluation of methods to detect remote protein homolo-gies (Brenner et al 1998 Dubchak et al 1999 Karwath andKing 2002 Lindahl and Elofsson 2000 Luo et al 2002Park et al 1997 Webb et al 2002) This dataset was sub-sequently trimmed to exclude sequences with unknown aminoacids and those belonging to families with lt5 elements thusobtaining the protein group named PDB40-v (Table 1) Forexample there are 232 families with only one sequence whichis not informative regarding intra-family dissimilarity whichmakes these domains insufficiently representative of a familyThe effect of trimming the dataset was in this way also studiedOnly the four major classes were included namely all-α classconstituted mainly by proteins with α helix all-β class essen-tially formed by β-sheet structures αβ class proteins withmixtures of α-helices and β-strands and α + β class thosewhere α-helices and β-strands are largely segregated OtherSCOP classes include multi-domain proteins small proteinstheoretical models and other types and were not included inthis study See Chothia et al (1997) and SCOP documentationfor description of protein folds and classification

This study also considered separately another protein setfrom an outdated release of the SCOP database (135) thePDB40-b due to the large amount of literature alreadypublished with those sequences (Luo et al 2002 andcorresponding references) Table 1 summarizes all thesequences sets examined in this paper

Protocol for comparative assessmentThe comparative test procedure followed in this report wasbased on a binary classification of each protein pair where1 corresponds to the two proteins sharing the same groupin SCOP database 0 otherwise The group can be definedat one of the four different levels of the database family(fa) superfamily (sf) class fold (cf) or class (cl) exploringthe hierarchical organization of the proteins in that struc-ture Therefore each protein pair is associated to four binaryclassifications one for each level

In order to compute the ROC curves we calculated thedistances between all possible protein pairs according to thedifferent metrics referred and briefly described below

The similarity measure based on alignment tested wasthe SmithndashWaterman raw score with no correction for stat-istical significance using score matrix BLOSUM50 and alinear gaping penalty scheme with a gap penalty of 8The distances based in L-tuple composition evaluated wereW-metric Euclidean standard Euclidean KullbackndashLeibler(ku) discrepancy Cosine (co) and (Mahalanobis) For the cor-responding complete definitions and properties see Vinga andAlmeida (2003) In Wm calculations some alternative weight-ing matrices W [Equation (2)] were used these included thescoring matrices BLOSUM50 BLOSUM40 BLOSUM62and PAM250 The following normalization procedures werealso applied take only the diagonal of W pass all its negativevalues to zero use the exponential function of the originalmatrix and normalize by minimum and range However inthis printed report only the results obtained with BLOSUM50will be presented The variations described are documentedon the online annex

For each metric the distances between all proteins pairswere subsequently sorted from maximum to minimum sim-ilarity ie from the closest to the farthest pair A perfectmetric would completely separate negative from positiverelationships ie the maximum similarity would corres-pond always to the same group and the binary classificationobtained after this distance sorting would be the vector(1 1 1 0 0 0) Of course this does not happen inpractice and the classes are interspersed The ROC curvespermit to assess the level of accuracy of this separationwithout choosing any distance threshold for the separationpoint In particular the AUC will give us a unique numberof the relative accuracy of each metric and level accord-ing to the SCOP classification scheme We also tested eachof the four classes separately with the same procedureto evaluate hypothetical differences between the structuralclasses

ComputationAll the algorithms were implemented in MATLAB language(version 6 release 13) The code is available upon request tothe authors

209

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

Table 1 Protein datasets used in this study

Datasets Classes TotalAll-α All-β αβ α + β

do fa sf cf do fa sf cf do fa sf cf do fa sf cf

PDB40-B (161) 867 409 257 151 1051 362 213 111 1237 467 190 117 1065 487 307 212 4220PDB40-v (161) 285 35 28 27 517 43 30 24 542 58 40 31 339 39 37 33 1683PDB40-b (135) 220 128 97 73 309 150 115 54 285 154 98 66 240 147 115 80 1054

For each protein set number of sequences or domains (do) families ( fa) superfamilies (sf ) and folds (cf ) in each class PDB40-B sequences that share lt40 to each othercurrent release (161) of SCOPASTRAL (not tested) PDB40-v set derived from PDB40-B (161) by excluding sequences with unknown amino acids and families with lt5 domainsPDB40-b sequence dataset used by Luo et al (2002) corresponds to previous release (135) of the same database

RESULTS AND DISCUSSIONIn the following sections we present some of the resultsobtained For extensive and additional results regardingall metrics and datasets see also the web page httpbioinformaticsmusceduwmetric where the complete graphsand tables are shown (data not shown due to space limitations)

Complete dataset

ROC curves and AUC values The ROC curves obtained forthe complete dataset (Table 1) are presented in Figures 2(PDB40-v) and 3 (PDB40-b) As overviewed in the Systemsand Methods section a random classifier would have identicalvalues of sensitivity and (1minusspecificity) for any thresholdvalue considered (dashed diagonal)

Figures 4 and 5 provide graphs with the areas under ROCcurves (AUC) obtained for both datasets and each SCOP levelThe AUC values are typically used as a measure of overalldiscrimination accuracy

As would be expected Figures 4 and 5 show that the AUCdecreases from family to class level for both datasets Thesequence similarity between proteins sharing the same fam-ily is still well recognized Consequently all the distancesachieve their best discrimination accuracy at this level Atclass level classification relationships reflect similar struc-tures which can have completely different sequences andamino acid compositions This underlies the observation thatsequence similarity is lost regardless of the metric fromfamily to class The comparative discriminant value of thedifferent metrics (Figs 4 and 5) shows two clear trends Firstat family level alignment has a clear advantage with AUCvalues of 086 and 081 (PDB40b and PDB40v sets) whereasall word-statistics metrics perform at or under 075 and 068respectively The most discriminant word-statistics metric atfamily level is the novel Wm introduced by this report (seeSystems and Methods section) reflecting the value of weight-ing the quadratic form [Equation (2)] by evolutionary ratherthan statistical criteria At the superfamily level the advantageof alignment remains but statistically weighting performs just

as well as the Wm Interestingly the unweighted Euclideanmetric covariance weighting Mahalanobis and information-based KullbackndashLeibler lag behind The main surprise of thisanalysis is to be observed at the next level the fold wherethe standard Euclidean metric performs as well as align-ment scores in both versions of SCOP especially for thelow specificityhigh sensitivity range (corresponds to manyFP relationships) In fact standard Euclidean is clearly morediscriminant than SW for 1minusspecificity values around 075Finally at the class level the absence of conserved segments infact turns alignment into a computationally expensive proced-ure to score amino acid composition differences At this pointmost alignment-free metrics outperform it The inspectionof the ROC curves themselves (Figs 2 and 3) further docu-ments this comparison between metrics The results obtainedare slightly less discriminant for the more recent version ofthe protein dataset (PDB40-v) for all the levels except forclass where higher values of AUC are obtained Howeverthere are no significant changes in their relative ordering It isnoteworthy that there is also a dependency between levels asregards classification accuracy Hits at a lower level may beargued to bias for more populated grouping at upper levelsHowever it should be noted that this study is of exploratoryrather than discriminant nature which places any pairwisecomparison regardless of the SCOP classification level onan equal standing

Variations in the Wm definition The Wm AUC values inthe previous graphics were obtained using the scoring matrixBLOSUM50 The results using BLOSUM40 BLOSUM62and PAM250 are virtually the same and will be omitted Nev-ertheless those results were compiled and are made availableat the support web page (see Availability) It is interesting tonote that although defining a different score for each domainpair the different matrices W produce the same score order-ing Similarly all the normalization procedures did not leadto improved discrimination producing worse classificationresults but are still made available in the same web page

210

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip Superfamily (sf)

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

SCOP ASTRAL db Root

Fibroblast growth factor receptor FGFR2 Domain

Class (cl)α β αβ α+β

Fold (cf)Immunoglobulin-like beta-sandwich

Immunoglobulin

Family (fa)I set domains

Ref PDB1ev2 1djs 1e0o

hellip

hellip

hellip

hellip

Fig 1 SCOPASTRAL dbmdashhierarchical classification of proteinsExample of Fibroblast growth factor receptor (FGFR2) classificationin each of the four levels

(cf) describes geometrical relationships or major structuralsimilarity and the fourth one represents protein structuralclass (cl) This will allow the study of each classifier fordifferent levels of similarity

Two different datasets were tested in order to assess theaccuracy of each metric The basic protein set PDB40-Bwas extracted directly from the ASTRAL web site and cor-responds to SCOP database release 161 (November 2002)This subset includes all the sequences that share lt40 iden-tity to each other and has become a benchmark test set inthe evaluation of methods to detect remote protein homolo-gies (Brenner et al 1998 Dubchak et al 1999 Karwath andKing 2002 Lindahl and Elofsson 2000 Luo et al 2002Park et al 1997 Webb et al 2002) This dataset was sub-sequently trimmed to exclude sequences with unknown aminoacids and those belonging to families with lt5 elements thusobtaining the protein group named PDB40-v (Table 1) Forexample there are 232 families with only one sequence whichis not informative regarding intra-family dissimilarity whichmakes these domains insufficiently representative of a familyThe effect of trimming the dataset was in this way also studiedOnly the four major classes were included namely all-α classconstituted mainly by proteins with α helix all-β class essen-tially formed by β-sheet structures αβ class proteins withmixtures of α-helices and β-strands and α + β class thosewhere α-helices and β-strands are largely segregated OtherSCOP classes include multi-domain proteins small proteinstheoretical models and other types and were not included inthis study See Chothia et al (1997) and SCOP documentationfor description of protein folds and classification

This study also considered separately another protein setfrom an outdated release of the SCOP database (135) thePDB40-b due to the large amount of literature alreadypublished with those sequences (Luo et al 2002 andcorresponding references) Table 1 summarizes all thesequences sets examined in this paper

Protocol for comparative assessmentThe comparative test procedure followed in this report wasbased on a binary classification of each protein pair where1 corresponds to the two proteins sharing the same groupin SCOP database 0 otherwise The group can be definedat one of the four different levels of the database family(fa) superfamily (sf) class fold (cf) or class (cl) exploringthe hierarchical organization of the proteins in that struc-ture Therefore each protein pair is associated to four binaryclassifications one for each level

In order to compute the ROC curves we calculated thedistances between all possible protein pairs according to thedifferent metrics referred and briefly described below

The similarity measure based on alignment tested wasthe SmithndashWaterman raw score with no correction for stat-istical significance using score matrix BLOSUM50 and alinear gaping penalty scheme with a gap penalty of 8The distances based in L-tuple composition evaluated wereW-metric Euclidean standard Euclidean KullbackndashLeibler(ku) discrepancy Cosine (co) and (Mahalanobis) For the cor-responding complete definitions and properties see Vinga andAlmeida (2003) In Wm calculations some alternative weight-ing matrices W [Equation (2)] were used these included thescoring matrices BLOSUM50 BLOSUM40 BLOSUM62and PAM250 The following normalization procedures werealso applied take only the diagonal of W pass all its negativevalues to zero use the exponential function of the originalmatrix and normalize by minimum and range However inthis printed report only the results obtained with BLOSUM50will be presented The variations described are documentedon the online annex

For each metric the distances between all proteins pairswere subsequently sorted from maximum to minimum sim-ilarity ie from the closest to the farthest pair A perfectmetric would completely separate negative from positiverelationships ie the maximum similarity would corres-pond always to the same group and the binary classificationobtained after this distance sorting would be the vector(1 1 1 0 0 0) Of course this does not happen inpractice and the classes are interspersed The ROC curvespermit to assess the level of accuracy of this separationwithout choosing any distance threshold for the separationpoint In particular the AUC will give us a unique numberof the relative accuracy of each metric and level accord-ing to the SCOP classification scheme We also tested eachof the four classes separately with the same procedureto evaluate hypothetical differences between the structuralclasses

ComputationAll the algorithms were implemented in MATLAB language(version 6 release 13) The code is available upon request tothe authors

209

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

Table 1 Protein datasets used in this study

Datasets Classes TotalAll-α All-β αβ α + β

do fa sf cf do fa sf cf do fa sf cf do fa sf cf

PDB40-B (161) 867 409 257 151 1051 362 213 111 1237 467 190 117 1065 487 307 212 4220PDB40-v (161) 285 35 28 27 517 43 30 24 542 58 40 31 339 39 37 33 1683PDB40-b (135) 220 128 97 73 309 150 115 54 285 154 98 66 240 147 115 80 1054

For each protein set number of sequences or domains (do) families ( fa) superfamilies (sf ) and folds (cf ) in each class PDB40-B sequences that share lt40 to each othercurrent release (161) of SCOPASTRAL (not tested) PDB40-v set derived from PDB40-B (161) by excluding sequences with unknown amino acids and families with lt5 domainsPDB40-b sequence dataset used by Luo et al (2002) corresponds to previous release (135) of the same database

RESULTS AND DISCUSSIONIn the following sections we present some of the resultsobtained For extensive and additional results regardingall metrics and datasets see also the web page httpbioinformaticsmusceduwmetric where the complete graphsand tables are shown (data not shown due to space limitations)

Complete dataset

ROC curves and AUC values The ROC curves obtained forthe complete dataset (Table 1) are presented in Figures 2(PDB40-v) and 3 (PDB40-b) As overviewed in the Systemsand Methods section a random classifier would have identicalvalues of sensitivity and (1minusspecificity) for any thresholdvalue considered (dashed diagonal)

Figures 4 and 5 provide graphs with the areas under ROCcurves (AUC) obtained for both datasets and each SCOP levelThe AUC values are typically used as a measure of overalldiscrimination accuracy

As would be expected Figures 4 and 5 show that the AUCdecreases from family to class level for both datasets Thesequence similarity between proteins sharing the same fam-ily is still well recognized Consequently all the distancesachieve their best discrimination accuracy at this level Atclass level classification relationships reflect similar struc-tures which can have completely different sequences andamino acid compositions This underlies the observation thatsequence similarity is lost regardless of the metric fromfamily to class The comparative discriminant value of thedifferent metrics (Figs 4 and 5) shows two clear trends Firstat family level alignment has a clear advantage with AUCvalues of 086 and 081 (PDB40b and PDB40v sets) whereasall word-statistics metrics perform at or under 075 and 068respectively The most discriminant word-statistics metric atfamily level is the novel Wm introduced by this report (seeSystems and Methods section) reflecting the value of weight-ing the quadratic form [Equation (2)] by evolutionary ratherthan statistical criteria At the superfamily level the advantageof alignment remains but statistically weighting performs just

as well as the Wm Interestingly the unweighted Euclideanmetric covariance weighting Mahalanobis and information-based KullbackndashLeibler lag behind The main surprise of thisanalysis is to be observed at the next level the fold wherethe standard Euclidean metric performs as well as align-ment scores in both versions of SCOP especially for thelow specificityhigh sensitivity range (corresponds to manyFP relationships) In fact standard Euclidean is clearly morediscriminant than SW for 1minusspecificity values around 075Finally at the class level the absence of conserved segments infact turns alignment into a computationally expensive proced-ure to score amino acid composition differences At this pointmost alignment-free metrics outperform it The inspectionof the ROC curves themselves (Figs 2 and 3) further docu-ments this comparison between metrics The results obtainedare slightly less discriminant for the more recent version ofthe protein dataset (PDB40-v) for all the levels except forclass where higher values of AUC are obtained Howeverthere are no significant changes in their relative ordering It isnoteworthy that there is also a dependency between levels asregards classification accuracy Hits at a lower level may beargued to bias for more populated grouping at upper levelsHowever it should be noted that this study is of exploratoryrather than discriminant nature which places any pairwisecomparison regardless of the SCOP classification level onan equal standing

Variations in the Wm definition The Wm AUC values inthe previous graphics were obtained using the scoring matrixBLOSUM50 The results using BLOSUM40 BLOSUM62and PAM250 are virtually the same and will be omitted Nev-ertheless those results were compiled and are made availableat the support web page (see Availability) It is interesting tonote that although defining a different score for each domainpair the different matrices W produce the same score order-ing Similarly all the normalization procedures did not leadto improved discrimination producing worse classificationresults but are still made available in the same web page

210

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

Table 1 Protein datasets used in this study

Datasets Classes TotalAll-α All-β αβ α + β

do fa sf cf do fa sf cf do fa sf cf do fa sf cf

PDB40-B (161) 867 409 257 151 1051 362 213 111 1237 467 190 117 1065 487 307 212 4220PDB40-v (161) 285 35 28 27 517 43 30 24 542 58 40 31 339 39 37 33 1683PDB40-b (135) 220 128 97 73 309 150 115 54 285 154 98 66 240 147 115 80 1054

For each protein set number of sequences or domains (do) families ( fa) superfamilies (sf ) and folds (cf ) in each class PDB40-B sequences that share lt40 to each othercurrent release (161) of SCOPASTRAL (not tested) PDB40-v set derived from PDB40-B (161) by excluding sequences with unknown amino acids and families with lt5 domainsPDB40-b sequence dataset used by Luo et al (2002) corresponds to previous release (135) of the same database

RESULTS AND DISCUSSIONIn the following sections we present some of the resultsobtained For extensive and additional results regardingall metrics and datasets see also the web page httpbioinformaticsmusceduwmetric where the complete graphsand tables are shown (data not shown due to space limitations)

Complete dataset

ROC curves and AUC values The ROC curves obtained forthe complete dataset (Table 1) are presented in Figures 2(PDB40-v) and 3 (PDB40-b) As overviewed in the Systemsand Methods section a random classifier would have identicalvalues of sensitivity and (1minusspecificity) for any thresholdvalue considered (dashed diagonal)

Figures 4 and 5 provide graphs with the areas under ROCcurves (AUC) obtained for both datasets and each SCOP levelThe AUC values are typically used as a measure of overalldiscrimination accuracy

As would be expected Figures 4 and 5 show that the AUCdecreases from family to class level for both datasets Thesequence similarity between proteins sharing the same fam-ily is still well recognized Consequently all the distancesachieve their best discrimination accuracy at this level Atclass level classification relationships reflect similar struc-tures which can have completely different sequences andamino acid compositions This underlies the observation thatsequence similarity is lost regardless of the metric fromfamily to class The comparative discriminant value of thedifferent metrics (Figs 4 and 5) shows two clear trends Firstat family level alignment has a clear advantage with AUCvalues of 086 and 081 (PDB40b and PDB40v sets) whereasall word-statistics metrics perform at or under 075 and 068respectively The most discriminant word-statistics metric atfamily level is the novel Wm introduced by this report (seeSystems and Methods section) reflecting the value of weight-ing the quadratic form [Equation (2)] by evolutionary ratherthan statistical criteria At the superfamily level the advantageof alignment remains but statistically weighting performs just

as well as the Wm Interestingly the unweighted Euclideanmetric covariance weighting Mahalanobis and information-based KullbackndashLeibler lag behind The main surprise of thisanalysis is to be observed at the next level the fold wherethe standard Euclidean metric performs as well as align-ment scores in both versions of SCOP especially for thelow specificityhigh sensitivity range (corresponds to manyFP relationships) In fact standard Euclidean is clearly morediscriminant than SW for 1minusspecificity values around 075Finally at the class level the absence of conserved segments infact turns alignment into a computationally expensive proced-ure to score amino acid composition differences At this pointmost alignment-free metrics outperform it The inspectionof the ROC curves themselves (Figs 2 and 3) further docu-ments this comparison between metrics The results obtainedare slightly less discriminant for the more recent version ofthe protein dataset (PDB40-v) for all the levels except forclass where higher values of AUC are obtained Howeverthere are no significant changes in their relative ordering It isnoteworthy that there is also a dependency between levels asregards classification accuracy Hits at a lower level may beargued to bias for more populated grouping at upper levelsHowever it should be noted that this study is of exploratoryrather than discriminant nature which places any pairwisecomparison regardless of the SCOP classification level onan equal standing

Variations in the Wm definition The Wm AUC values inthe previous graphics were obtained using the scoring matrixBLOSUM50 The results using BLOSUM40 BLOSUM62and PAM250 are virtually the same and will be omitted Nev-ertheless those results were compiled and are made availableat the support web page (see Availability) It is interesting tonote that although defining a different score for each domainpair the different matrices W produce the same score order-ing Similarly all the normalization procedures did not leadto improved discrimination producing worse classificationresults but are still made available in the same web page

210

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 2 ROC curves for PDB40-v dataset Sensitivity (sen) versus 1minusspecificity (spe) SCOP levels family (fa) superfamily (sf) class fold(cf) and class (cl) Metrics SmithndashWaterman (SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean(eu) and Mahalanobis (ma) A random classifier would generate equal proportions of FP and TP classifications which corresponds to theROC diagonal (dashed line) Correspondingly the better classification schemes have plots with higher values of sensitivity for equal valuesof specificity resulting in higher values for the areas under the curve (AUC see Text) SW is the best at family and superfamily levels Wmand se outperform other alignment-free metrics Standard Euclidean is the best at fold level for high sensitivitylow specificity values Forclass level all metrics have similar results slightly above random guessing

Higher order tuples We also tested higher order word com-position metrics calculating 2- and 3-tuple distances betweenthe domains for eu se ku and co Somewhat intriguing was thefact that for all levels of classification discrimination worsened(see web page) However it should be noted that the highdimension of the frequency vectors in these cases (respect-ively 400 and 8000) and the relative low dimension of thesequences length itself (mean values around 175 amino acids)caused the frequency vector f to be very sparse Additionalproblems arising from this increased dimensionality of dataare the need to increase sampling size in order to maintain

accuracy which goes along with the lsquocurse of dimensionalityrsquo(Donoho 2000) Consequently only the results obtained forone-tuples were presented in this report The weighting pro-posed as observed before for the one-tuple scenario mightnot be the best for the recognition of the relationships Oneidea worth exploring would be to extract some effective higherorder tuples by adequate selection of the weights thus optim-izing the classification accuracy and avoiding hopefully thedimensionality problem However this would lead to dis-criminatory and optimization procedures which are out ofthe scope of this exploratory study

211

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

fa

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

sf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cf

SWWmsecokueuma

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

1-spe

sen

cl

SWWmsecokueuma

Fig 3 ROC curves for PDB40-b dataset Sensitivity (sen) versus 1 minus specificity (spe) SCOP levels fa sf cf and cl Metrics SmithndashWaterman(SW) W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) The classificationaccuracies for this dataset are slightly better than for the PDB40-v dataset (Fig 2) The qualitative relation between the metrics is maintained

Computational performance It is noteworthy that the SWalgorithm is computationally intensive Its running times canbe 1000-fold longer than that of the other metrics here com-pared For example in PDB40-v dataset SW took sim80 h andWm just 5 min using a 700 MHz PentiumIII with 1 GBtotal memory The other word composition metrics themselveshave varied computation implementation efficiencies (Vingaand Almeida 2003)

Stratified analysis by class

AUC values In order to compare the metrics we also con-ducted additional studies for each of the four classes (all-αall-β αβ and α +β) separately The AUC values are repres-ented in Figure 6 for SW alignment scores and se distance

the two metrics that emerged as the most discriminant in theprevious analysis (Figs 2ndash5) (see web page for similar analysisfor the other metrics)

It is easier to recognize family relationships by alignment(Fig 6 black symbols) for proteins belonging to class all-αwhere values are above the overall accuracy (AUC values ran-ging from 070 to 087) and for α + β class (AUC from 070to 091) The class where these relationships seem more dif-ficult to detect was the class all-β where we obtained thelowest AUC values for this level (060ndash077) For superfam-ily level class α + β enables a surprising accuracy for bothmetrics (AUC from 070 to 090) as opposed to class all-βwhere the superfamily relationships are still harder to detectonly by sequence inspection (AUC between 055 and 064)

212

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 4 AUC values for PDB40-v dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 2 Higher AUC values correspond to betterclassification schemes All the distances achieve their best discrim-ination accuracy at family level This figure illustrates the loss ofdiscrimination as the target of classification moves up in the SCOPlevel from family to class

fa sf cf cl05

055

06

065

07

075

08

085

09

level

AU

C

SWWmsecokueuma

Fig 5 AUC values for PDB40-b dataset for each hierarchical levelSCOP levels fa sf cf and cl Metrics SmithndashWaterman (SW)W-metric (Wm) standard Euclidean (se) cosine (co) KullbackndashLeibler (ku) Euclidean (eu) and Mahalanobis (ma) Areas underROC curves of Figure 3 The results are slightly more discriminantfor this dataset than for PDB40-v (Fig 4) but with no significantchanges in the metricsrsquo relative ordering

At fold level all-α class retains the higher AUC values forboth metrics (069ndash081) The graph obtained for PDB40-b isqualitatively the same (see web page) with a difference theAUC values for fold level are much lower for all-α and α +β

classes for both metrics

PDB40 version datasets comparison There is a significantimprovement of discrimination accuracy for α + β class in

fa sf cf cl055

06

065

07

075

08

085

09

095

level

AU

C

totalall-αall-βαβα+β

Fig 6 Stratified analysis by class in PDB40-v dataset AUC valuesfor SW algorithm (black) and se distance (gray) for each class totalset all-α all-β αβ and α + β SmithndashWaterman is generally abetter classification schememdashhigher AUC values At family levelthe best results are for proteins belonging to classes all-α and α +βthe lowest AUC values where obtained for class all-β At superfamilylevel class α + β enables a surprising accuracy for both metrics asopposed to class all-β which has the worse results At fold levelall-α class retains the higher AUC values for both metrics

PDB40-v dataset The difference in AUC values is constantlypositive for different metrics and levels reaching a valueas high as 021 at fold level with the SW alignment scoresIt seems that the trimming procedure taken when obtainingPDB40-v set (see Systems and Methods) affected particularlyall-α and α + β classes It is noteworthy these quantitativelydifferences obtained for the two datasets

The α-helix and β-sheet content Judging from publishedreports protein class classification is controversial Somestudies based class classification in the percentages of α-helixand β-sheets content of each chain In a recent report a schem-atic table was presented with different definitions (Eisenhaberet al 1996) As noted in that study there are some regionsof the space defined by those percentages that are not clearlyclassifiable It is in this uncertainty context that SCOP offersa classification that is a global measure and takes into accountall the structural information of all chains in a protein

In order to assess the correct assignment to classes andavoid arbitrary classification we extracted the α and β con-tent for each SCOP domain tested from the PDB web page(httpwwwrcsborgpdb) In Figure 7 we present the α andβ percentages for each domain grouped by the correspondingSCOP class classification obtained for the PDB40-b dataset

From Figure 7 it is apparent that some domains havearguable classifications For example protein with PDB iden-tification 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] has two chains that correspond to twoSCOP domains Domain 1HYMA has 2444 of α-helixand 0 of β-sheet (labelled lowast symbol close to the x-axis in

213

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

SVinga et al

0 20 40 60 80 1000

10

20

30

40

50

60

70

α

β 1HYMB1HYMB1HYMB1HYMB

1HYMA1HYMA1HYMA1HYMA

αβαβα+β

Fig 7 The α-helix and β-sheet content () for each domain inPDB40-b dataset grouped by SCOP class The classes are inter-spersed Protein 1HYMndashTrypsin inhibitor V [species pumpkin(Cucurbita maxima)] is globally classified in α + β class but theirtwo chains 1HYMA and 1HYMB have contrasting α-helix andβ-sheet content

Fig 7) and domain 1HYMB has 0 of α-helix and 3333 ofβ-sheet (labelled lowast symbol close to the y-axis in Fig 7) Nev-ertheless the whole protein was classified in the α + β classin spite of the fact that each of its chains taken individuallywould be classified in other classes The SCOP classificationis global in the sense that looks to the whole protein rather thanto a particular domain therefore classifying chains of 1HYMas α +β is formally correct Interestingly a multivariate ana-lysis of variance (MANOVA) of the amino acid compositionin the four classes leads to similar results (see web page)showing that class α +β is clearly intermixed with the othersin terms of α and β content

CONCLUSIONIn this report we quantitatively compared several proteindissimilarity measures based in L-tuple composition withalignment scores obtained with SmithndashWaterman algorithmA new metric the W-metric which combines both approachesby including word-statistics information weighted by scoringmatrices is described

The accuracy of each metric to detected protein rela-tionships was assessed through the four hierarchical levelsof the SCOPASTRAL database The comparative protocolemployed the AUCs which are a good measure of overallaccuracy of a classification scheme

The SW alignment score was shown to be the most discrim-inant at family and superfamily levels At family level the Wmis clearly more discriminant than the other L-tuple distancesfor sensitivity values between 05 and 08 From superfam-ily to class levels all metrics lose discriminant power and

converge to similar AUC values which makes it counterpro-ductive to use computational intensive alignment algorithmsto detect those relationships At fold level standard Euclideandistance outperforms most of the metrics achieving an unex-pected accuracy for high sensitivitylow sensibility rangeThis important result anticipates its use in providing a conser-vative pre-screening procedure for this problem category Infact since L-tuple methods are computationally much lighterthey can be useful to pre-select similar proteins before apply-ing the alignment algorithms thus combining the powerfulaspects of each technique and greatly improving heuristicmethods in sequence similarity searches

The graph showing α-helix and β-sheet content for eachdomain shows that class classification cannot be inferreddirectly from that information at least for mixed classesTherefore it might be advantageous in some applicationsto reconsider protein class classification of each domain byexploring the distribution of sequence distances by unsuper-vised learning algorithms

ACKNOWLEDGEMENTSThe authors thank John Schwacke of the Medical Univer-sity of South Carolina for providing streamlined MATLABcode for SmithndashWaterman alignment and Steven Brenner ofthe University of California at Berkley for precious advice inthe use of the PDB40-B set The authors thankfully acknow-ledge the financial support by grants SFRHBD31342000 toSV and SAPIENS3479499 from Fundaccedilatildeo para a Ciecircnciae a Tecnologia (FCT) of the Portuguese Ministeacuterio da Ciecircnciae do Ensino Superior RG-O thankfully acknowledges grantQLK2-CT-2000-01020 (EURIS) from the European Commis-sion This work was also supported in part by the NHLBIProteomics Initiative through contract N01-HV-28181 and aCancer Center grant from the Department of Energy (CEReed PI)

REFERENCESAltschulSF GishW MillerW MyersEW and LipmanDJ

(1990) Basic Local Alignment Search Tool J Mol Biol 215403ndash410

BaldiP BrunakS ChauvinY AndersenCA and NielsenH(2000) Assessing the accuracy of prediction algorithms forclassification an overview Bioinformatics 16 412ndash424

BradleyAP (1997) The use of the area under the ROC curve in theevaluation of machine learning algorithms Pattern Recog 301145ndash1159

BrennerSE ChothiaC and HubbardTJ (1998) Assessingsequence comparison methods with reliable structurally identi-fied distant evolutionary relationships Proc Natl Acad Sci USA95 6073ndash6078

BrennerSE KoehlP and LevittM (2000) The ASTRAL compen-dium for protein structure and sequence analysis Nucleic AcidsRes 28 254ndash256

214

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from

Comparative evaluation of word composition distances

ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

ChothiaC HubbardT BrennerS BarnsH and MurzinA (1997)Protein folds in the all-beta and all-alpha classes Annu RevBiophys Biomol Struct 26 597ndash627

DayhoffMO SchwartzR and OrcuttB (1978) A modelof evolutionary change in proteins In DayhoffMO (ed)Atlas of Protein Sequence and Structure National BiomedicalResearch Foundation Washington DC Vol 5 (Suppl 3)pp 345ndash352

DonohoDL (2000) Aide-Memoire High-dimensional data ana-lysis the curses and blessings of dimensionality Department ofStatistics Stanford University

DubchakI MuchnikI MayorC DralyukI and KimSH (1999)Recognition of a protein fold in the context of the StructuralClassification of Proteins (SCOP) classification Proteins 35401ndash407

EganJP (1975) Signal Detection Theory and ROC-AnalysisAcademic Press New York

EisenhaberF FrommelC and ArgosP (1996) Prediction of sec-ondary structural content of proteins from their amino acidcomposition alone II The paradox with secondary structuralclass Proteins 25 169ndash179

EwensWJ and GrantGR (2001) Statistical Methods in Bioinform-atics An Introduction Springer New York

GreenRE and BrennerSE (2002) Bootstrapping and normaliza-tion for enhanced evaluations of pairwise sequence comparisonProc IEEE 90 1834ndash1847

HenikoffS and HenikoffJG (1992) Amino acid substitutionmatrices from protein blocks Proc Natl Acad Sci USA 8910915ndash10919

KarwathA and KingRD (2002) Homology induction the use ofmachine learning to improve sequence similarity searches BMCBioinformatics 3 11

LindahlE and ElofssonA (2000) Identification of related pro-teins on family superfamily and fold level J Mol Biol 295613ndash625

Lo ConteL BrennerSE HubbardTJ ChothiaC andMurzinAG (2002) SCOP database in 2002 refinements accom-modate structural genomics Nucleic Acids Res 30 264ndash267

LuoRY FengZP and LiuJK (2002) Prediction of protein struc-tural class by amino acid and polypeptide composition Eur JBiochem 269 4219ndash4225

MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

ParkJ TeichmannSA HubbardT and ChothiaC (1997) Inter-mediate sequences increase the detection of homology betweensequences J Mol Biol 273 349ndash354

PearsonWR (1991) Searching protein sequence libraries compar-ison of the sensitivity and selectivity of the SmithndashWaterman andFASTA algorithms Genomics 11 635ndash650

PearsonWR (1995) Comparison of methods for searching proteinsequence databases Protein Sci 4 1145ndash1160

PearsonWR (2000) Protein sequence comparison and Proteinevolution TutorialmdashISMB2000

PearsonWR and LipmanDJ (1988) Improved tools for biologicalsequence comparison Proc Natl Acad Sci USA 85 2444ndash2448

ReinertG SchbathS and WatermanMS (2000) Probabilistic andstatistical properties of words an overview J Comput Biol 71ndash46

SchottJR (1997) Matrix Analysis for Statistics John Wiley NewYork

VingaS and AlmeidaJS (2003) Alignment-free sequencecomparisonmdasha review Bioinformatics 19 513ndash523

WebbB-JM LiuJS and LawrenceCE (2002) BALSABayesian algorithm for local sequence alignment Nucleic AcidsRes 30 1268ndash1277

215

at University of P

ortland on May 24 2011

bioinformaticsoxfordjournalsorg

Dow

nloaded from