Distinguishing between enzyme sequences & non-enzymes without using alignments
Post on 30-Sep-2016
Page 1 of 1(page number not for citation purposes)
Open AccessPoster presentationDistinguishing between enzyme sequences & non-enzymes without using alignmentsAjanthah Sangaralingam* and Andrew J Doig
Address: Department of Biomolecular Sciences, Manchester University, UK.
Email: Ajanthah Sangaralingam* - A.Sangaralingam@postgrad.manchester.ac.uk
* Corresponding author
Many protein similarity searching methods rely uponfinding sequence homology to a previously annotatedprotein within a protein database. If a protein has nosequence similarity or only weak sequence similarity to anannotated protein, however, then the task of predictingprotein function is often not possible. In this study, sup-port vector machines are used to develop a method notbased on sequence alignment to predict whether a proteinsequence is an enzyme or a non-enzyme, using featuresthat are calculated from sequences.
A large non-redundant dataset was constructed from theSWISSPROT database. Only sequences that have an ECnumber and entries in the ENZYME database wereincluded in the enzyme dataset. Sequence annotation wasfiltered carefully, when selecting the non-enzyme dataset.Sequences for which annotation was sparse, annotated asunknown, probable, homologue, hypothetical andenzyme were excluded.
Features used to describe each protein sequence were: hitsto the INTERPRO sequence motif database, sequencelength, amino acid frequencies and minimum distancesbetween amino acid pairs. Each protein sequence was rep-resented as an 11429 feature vector. Using all of the fea-tures gave a prediction accuracy of 89%. Feature selectionwas performed to analyse which features were mostinformative in discriminating between the two classes.3177 INTERPRO motifs were found in the sequences usedin this study. Of these, 1326 were found in only a singlesequence. These motifs were therefore not used as featuresin further experiments. Leaving out these motifs had littleeffect on the prediction accuracy of the model. Overall,accuracy was improved when only amino acid frequencies
and INTERPRO motifs were used as features for modelbuilding. This model was built using 1871 features. Weplan to apply this model to predict the function ofsequences that are annotated as unknown in the SWISS-PROT database.
from BioSysBio: Bioinformatics and Systems Biology ConferenceEdinburgh, UK, 1415 July 2005
Published: 21 September 2005
BMC Bioinformatics 2005, 6(Suppl 3):P24 BioSysBio: Bioinformatics and Systems Biology Conference Meeting abstracts A single PDF containing all abstracts in this Supplement is available here. http://www.biomedcentral.com/content/pdf/1471-2105-6-S3-info.pdf