3 İ 4 ecpred: enzyme prediction using combination of ... · molecular weight, number of residues,...

1
EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk ECPred : Enzyme Prediction Using Combination of Classifiers Ahmet Sureyya Rifaioglu 1 , Tunca Dogan 2 , Omer Sinan Sarac 3 , Mehmet Volkan Atalay 1 , Maria Jesus Martin 2 and Rengul Cetin-Atalay 4 ABSTRACT Motivation : Efficient and accurate protein function prediction methods are required to annotate the proteins with unknown functions. Recent studies show that combination of different methods enhances prediction accuracy. In addition, data preparation and post-processing of predictions are other important factors in functional annotation of proteins. Results : Here we propose “ECPred”, a novel hierarchical approach to predict Enzyme Commission (EC) numbers using combination of classifiers which are Blast-knn, SPMap and PepStats-SVM. ECPred combines these methods and gives a weighted mean score for each trained EC number. In ECPred we use hierarchical data preparation and evaluation steps to increase the accuracy of the predictions. ECPred is trained for 851 EC classes. Cross-validation results have shown that ECPred can predict enzyme functions with high performance (average F-Score is 0.96). METHODOLOGY Blast-kNN : k-Nearest Neighbor algorithm is combined with BLAST. Similarity search is done among training set of each EC number and k Blast scores from negative and positive training dataset are incorporated as: SPMap : SPMap is a subsequence-based feature extraction method consisting of three main modules: (i) Subsequence Extraction Module (ii) Clustering Module (iii) Probabilistic Profile Construction Module Pepstats-SVM : Pepstats is a feature based method that calculates statistics for proteins including: molecular weight, number of residues, charge etc. Proteins are represented as 37-D vectors in Pepstats. Vectors obtained from Pepstats and SPMap are fed to the SVM classifier (independently) to obtain classification scores between -1 and 1. Later a weighted mean score is calculated for each query protein, for each functional class (as confidence of the prediction) by combining SVM scores and the Blast-kNN score. INTRODUCTION The volume of protein sequence data is increasing exponentially and manual curation efforts are insufficient to annotate proteins with unknown functions Therefore, effective automatic annotation methods are required in order to overcome this problem. Enzymes are special type of proteins that catalyses biochemical reactions. The Enzyme Commission number (EC number) is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze Functional annotations of enzymes are crucial in several fields of bioinformatics such as identification of diseases, drug target prediction etc. Here we present ECPredwhich is an enzyme prediction tool based on EC numbers. ECPred incorporates a novel data preparation and hierarchical evaluation method. Hierarchical Evaluation of Predictions: If prediction scores of predicted EC and all of its parents are greater than the class specific optimal thresholds, the prediction passes the evaluation. EC numbers having 50 or more EC annotations in UniProtKB/Swiss-Prot are selected for training. 5-fold cross validation is performed and optimal decision thresholds are found for each EC number. Subsequently, hierarchical evaluation method is applied for prediction. 851 EC numbers are trained and average F-Score is 0.96 EVALUATION & RESULTS In this study, ECPred method is proposed for enzyme function prediction, combining three classification methods from different approaches: similarity, subsequence and feature-based A novel data preparation method is proposed based on EC hierarchy for positive and negative training datasets Individual thresholds are determined for each trained EC number Hierarchical post-processing method is proposed to determine the reliable predictions to be presented Cross-validation on UniProtKB/SwissProt enzymes revealed very high classification performance A web-server for ECPred will be ready soon where users can query sequences to obtain enzyme function predictions CONCLUSION S p : sum of k-nearest positive BLAST scores S n : sum of k-nearest negative BLAST scores EC NUMBER Prediction Score Optimum Threshol d 1.-.-.- 0.75 0.7 1.1.-.- 0.90 0.8 1.1.1.- 0.75 0.7 1.1.99.- 0.35 0.8 1.97.-.- 0.40 0.7 1.1.1.1 0.97 0.95 1.1.1.2 0.60 0.8 1.1.1.97 0.20 0.7 EC NUMBER Prediction Score Optimum Threshol d 1.-.-.- 0.75 0.7 1.1.-.- 0.90 0.8 1.1.1.- 0.75 0.7 1.1.99.- 0.35 0.8 1.97.-.- 0.40 0.7 1.1.1.1 0.97 0.95 1.1.1.2 0.60 0.8 1.1.1.97 0.20 0.7 EC NUMBER Prediction Score Optimum Threshol d 1.-.-.- 0.75 0.7 1.1.-.- 0.90 0.8 1.1.1.- 0.75 0.7 1.1.99.- 0.35 0.8 1.97.-.- 0.40 0.7 1.1.1.1 0.97 0.95 1.1.1.2 0.60 0.8 1.1.1.97 0.20 0.7 EC NUMBER Prediction Score Optimum Threshol d 1.-.-.- 0.75 0.7 1.1.-.- 0.90 0.8 1.1.1.- 0.75 0.7 1.1.99.- 0.35 0.8 1.97.-.- 0.40 0.7 1.1.1.1 0.97 0.95 1.1.1.2 0.60 0.8 1.1.1.97 0.20 0.7 DATA PREPARATION Each EC number is trained with its own training dataset based on the level of corresponding EC number on the hierarchy. Positive training data for EC number 1.1.1.-: proteins that are associated with 1.1.1.- and proteins associated with the descendants of 1.1.1.- Negative training data for EC number 1.1.1.-: proteins that are associated with siblings of 1.1.1.- and proteins associated with descendants of siblings of 1.1.1.- 1.-.-.- 1.1.-.- 1.1.1.- 1.1.1.1 1.1.1.97 1.1.2.- 1.1.2.3 1.1.2.4 1.1.99.- 1.1.99.1 1.1.99.32 1.97-.- 1.97.1.- 1.97.1.1 1.97.1.99 1 Department of Computer Engineering, Middle East Technical University, Ankara, Turkey 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Computer Engineering Department, stanbul Technical University, stanbul, Turkey 4 Informatics Institute, Middle East Technical University, Ankara, Turkey 1.-.-.- 1.1.-.- 1.1.1.- 1.1.1.1 1.1.1.2 1.1.1.97 1.1.99.- 1.1.99.1 1.1.99.32 1.97-.- 1.97.1.- 1.97.1.1 1.97.1.99 1.-.-.- 1.1.-.- 1.1.1.- 1.1.1.1 1.1.1.2 1.1.1.97 1.1.99.- 1.1.99.1 1.1.99.32 1.97-.- 1.97.1.- 1.97.1.1 1.97.1.99 1.-.-.- 1.1.-.- 1.1.1.- 1.1.1.1 1.1.1.2 1.1.1.97 1.1.99.- 1.1.99.1 1.1.99.32 1.97-.- 1.97.1.- 1.97.1.1 1.97.1.99 1.-.-.- 1.1.-.- 1.1.1.- 1.1.1.1 1.1.1.2 1.1.1.97 1.1.99.- 1.1.99.1 1.1.99.32 1.97-.- 1.97.1.- 1.97.1.1 1.97.1.99

Upload: others

Post on 18-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3 İ 4 ECPred: Enzyme Prediction Using Combination of ... · molecular weight, number of residues, charge etc. Proteins are represented as 37-D vectors in Pepstats. Vectors obtained

EMBL-EBI Tel. +44 (0) 1223 494 444

Wellcome Trust Genome Campus [email protected]

Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk

ECPred: Enzyme Prediction

Using Combination of Classifiers

Ahmet Sureyya Rifaioglu1, Tunca Dogan2, Omer Sinan Sarac3, Mehmet Volkan Atalay1,

Maria Jesus Martin2 and Rengul Cetin-Atalay4

ABSTRACT

Motivation : Efficient and accurate protein function

prediction methods are required to annotate the proteins

with unknown functions. Recent studies show that

combination of different methods enhances prediction

accuracy. In addition, data preparation and post-processing

of predictions are other important factors in functional

annotation of proteins.

Results : Here we propose “ECPred”, a novel hierarchical

approach to predict Enzyme Commission (EC) numbers

using combination of classifiers which are Blast-knn, SPMap

and PepStats-SVM. ECPred combines these methods and

gives a weighted mean score for each trained EC number.

In ECPred we use hierarchical data preparation and

evaluation steps to increase the accuracy of the predictions.

ECPred is trained for 851 EC classes. Cross-validation

results have shown that ECPred can predict enzyme

functions with high performance (average F-Score is 0.96).

METHODOLOGY

Blast-kNN : k-Nearest Neighbor algorithm is combined with BLAST. Similarity search is done among training

set of each EC number and k Blast scores from negative and positive training dataset are incorporated as:

SPMap : SPMap is a subsequence-based feature extraction method consisting of three main modules:

(i) Subsequence Extraction Module (ii) Clustering Module (iii) Probabilistic Profile Construction Module

Pepstats-SVM : Pepstats is a feature based method that calculates statistics for proteins including:

molecular weight, number of residues, charge etc. Proteins are represented as 37-D vectors in Pepstats.

Vectors obtained from Pepstats and SPMap are fed to the SVM classifier (independently) to obtain

classification scores between -1 and 1. Later a weighted mean score is calculated for each query protein, for

each functional class (as confidence of the prediction) by combining SVM scores and the Blast-kNN score.

INTRODUCTION

• The volume of protein sequence data is increasing

exponentially and manual curation efforts are

insufficient to annotate proteins with unknown

functions

• Therefore, effective automatic annotation methods

are required in order to overcome this problem.

• Enzymes are special type of proteins that catalyses

biochemical reactions.

• The Enzyme Commission number (EC number) is a

numerical classification scheme for enzymes, based

on the chemical reactions they catalyze

• Functional annotations of enzymes are crucial in

several fields of bioinformatics such as identification

of diseases, drug target prediction etc.

• Here we present “ECPred” which is an enzyme

prediction tool based on EC numbers.

• ECPred incorporates a novel data preparation and

hierarchical evaluation method.

Hierarchical Evaluation of Predictions: If prediction scores of predicted EC and all of its parents are

greater than the class specific optimal thresholds, the prediction passes the evaluation.

• EC numbers having 50 or more EC annotations in UniProtKB/Swiss-Prot are selected for training.

• 5-fold cross validation is performed and optimal decision thresholds are found for each EC number.

Subsequently, hierarchical evaluation method is applied for prediction.

• 851 EC numbers are trained and average F-Score is 0.96

EVALUATION & RESULTS

• In this study, ECPred method is proposed for enzyme function prediction, combining three classification

methods from different approaches: similarity, subsequence and feature-based

• A novel data preparation method is proposed based on EC hierarchy for positive and negative training datasets

• Individual thresholds are determined for each trained EC number

• Hierarchical post-processing method is proposed to determine the reliable predictions to be presented

• Cross-validation on UniProtKB/SwissProt enzymes revealed very high classification performance

• A web-server for ECPred will be ready soon where users can query sequences to obtain enzyme function

predictions

CONCLUSION

Sp : sum of k-nearest positive BLAST scores

Sn : sum of k-nearest negative BLAST scores

>= < <

EC NUMBER Prediction

Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7

EC NUMBER Prediction

Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7

EC NUMBER Prediction

Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7

EC NUMBER Prediction

Score

Optimum

Threshol

d

1.-.-.- 0.75 0.7

1.1.-.- 0.90 0.8

1.1.1.- 0.75 0.7

1.1.99.- 0.35 0.8

1.97.-.- 0.40 0.7

1.1.1.1 0.97 0.95

1.1.1.2 0.60 0.8

1.1.1.97 0.20 0.7

DATA PREPARATION

• Each EC number is trained with its own training

dataset based on the level of corresponding EC

number on the hierarchy.

• Positive training data for EC number 1.1.1.-:

proteins that are associated with 1.1.1.- and proteins

associated with the descendants of 1.1.1.-

• Negative training data for EC number 1.1.1.-:

proteins that are associated with siblings of 1.1.1.-

and proteins associated with descendants of siblings

of 1.1.1.-

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 … 1.1.1.97

1.1.2.-

1.1.2.3 1.1.2.4

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1 Department of Computer Engineering, Middle East Technical University, Ankara, Turkey2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK3 Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey4 Informatics Institute, Middle East Technical University, Ankara, Turkey

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99

1.-.-.-

1.1.-.-

1.1.1.-

1.1.1.1 1.1.1.2 … 1.1.1.97

… 1.1.99.-

1.1.99.1 … 1.1.99.32

…1.97-.-

1.97.1.-

1.97.1.1 … 1.97.1.99