3 İ 4 ecpred: enzyme prediction using combination of ... · molecular weight, number of residues,...

EMBL-EBI Tel. +44 (0) 1223 494 444

Wellcome Trust Genome Campus [email protected]

Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk

ECPred: Enzyme Prediction

Using Combination of Classifiers

Ahmet Sureyya Rifaioglu1, Tunca Dogan2, Omer Sinan Sarac3, Mehmet Volkan Atalay1,

Maria Jesus Martin2 and Rengul Cetin-Atalay4

ABSTRACT

Motivation : Efficient and accurate protein function

prediction methods are required to annotate the proteins

with unknown functions. Recent studies show that

combination of different methods enhances prediction

accuracy. In addition, data preparation and post-processing

of predictions are other important factors in functional

annotation of proteins.

Results : Here we propose “ECPred”, a novel hierarchical

approach to predict Enzyme Commission (EC) numbers

using combination of classifiers which are Blast-knn, SPMap

and PepStats-SVM. ECPred combines these methods and

gives a weighted mean score for each trained EC number.

In ECPred we use hierarchical data preparation and

evaluation steps to increase the accuracy of the predictions.

ECPred is trained for 851 EC classes. Cross-validation

results have shown that ECPred can predict enzyme

functions with high performance (average F-Score is 0.96).

METHODOLOGY

Blast-kNN : k-Nearest Neighbor algorithm is combined with BLAST. Similarity search is done among training

set of each EC number and k Blast scores from negative and positive training dataset are incorporated as:

SPMap : SPMap is a subsequence-based feature extraction method consisting of three main modules:

(i) Subsequence Extraction Module (ii) Clustering Module (iii) Probabilistic Profile Construction Module

Pepstats-SVM : Pepstats is a feature based method that calculates statistics for proteins including:

molecular weight, number of residues, charge etc. Proteins are represented as 37-D vectors in Pepstats.

Vectors obtained from Pepstats and SPMap are fed to the SVM classifier (independently) to obtain

classification scores between -1 and 1. Later a weighted mean score is calculated for each query protein, for

each functional class (as confidence of the prediction) by combining SVM scores and the Blast-kNN score.

INTRODUCTION

• The volume of protein sequence data is increasing

exponentially and manual curation efforts are

insufficient to annotate proteins with unknown

functions

• Therefore, effective automatic annotation methods

are required in order to overcome this problem.

• Enzymes are special type of proteins that catalyses

biochemical reactions.

• The Enzyme Commission number (EC number) is a

numerical classification scheme for enzymes, based

on the chemical reactions they catalyze

• Functional annotations of enzymes are crucial in

several fields of bioinformatics such as identification

of diseases, drug target prediction etc.

• Here we present “ECPred” which is an enzyme

prediction tool based on EC numbers.

• ECPred incorporates a novel data preparation and

hierarchical evaluation method.

Hierarchical Evaluation of Predictions: If prediction scores of predicted EC and all of its parents are

greater than the class specific optimal thresholds, the prediction passes the evaluation.

• EC numbers having 50 or more EC annotations in UniProtKB/Swiss-Prot are selected for training.

• 5-fold cross validation is performed and optimal decision thresholds are found for each EC number.

Subsequently, hierarchical evaluation method is applied for prediction.

• 851 EC numbers are trained and average F-Score is 0.96

EVALUATION & RESULTS

• In this study, ECPred method is proposed for enzyme function prediction, combining three classification

methods from different approaches: similarity, subsequence and feature-based

• A novel data preparation method is proposed based on EC hierarchy for positive and negative training datasets

• Individual thresholds are determined for each trained EC number

• Hierarchical post-processing method is proposed to determine the reliable predictions to be presented

• Cross-validation on UniProtKB/SwissProt enzymes revealed very high classification performance

• A web-server for ECPred will be ready soon where users can query sequences to obtain enzyme function

predictions

CONCLUSION

Sp : sum of k-nearest positive BLAST scores

Sn : sum of k-nearest negative BLAST scores

>= < <

EC NUMBER Prediction

Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7


Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7


Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7


Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7

DATA PREPARATION

• Each EC number is trained with its own training

dataset based on the level of corresponding EC

number on the hierarchy.

• Positive training data for EC number 1.1.1.-:

proteins that are associated with 1.1.1.- and proteins

associated with the descendants of 1.1.1.-

• Negative training data for EC number 1.1.1.-:

proteins that are associated with siblings of 1.1.1.-

and proteins associated with descendants of siblings

of 1.1.1.-

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 … 1.1.1.97

1.1.2.-

1.1.2.3 1.1.2.4

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1 Department of Computer Engineering, Middle East Technical University, Ankara, Turkey2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK3 Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey4 Informatics Institute, Middle East Technical University, Ankara, Turkey

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

3 İ 4 ecpred: enzyme prediction using combination of ... · molecular weight, number of residues,...

Documents