Transcript
Page 1: Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program

Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science

Sixth Annual Joint

Bioinformatics Symposium

2006

Acknowledgements: This work is supported in part by grants from the National Science Foundation (IIS 0219699), and the National Institutes of Health (GM 066387) to Vasant Honavar.

Machine Learning Versus Profile-Based Methods for Protein Phosphorylation Site PredictionYasser EL-Manzalawy, Cornelia Caragea, Drena Dobbs, and Vasant Honavar

Prediction of Phosphorylation Sites-MotivationProtein phosphorylation, performed by protein kinases, is a very important process involved in signal transduction pathways. Predicting phosphorylation sites is an essential step towards understanding phosphorylation, which in turn, is essential in understanding diseases and, ultimately, designing drugs that can prevent or cure diseases.

Phospho.ELM Data Set – a resource containing 1805 proteins from different species covering 1372 Tyr, 3175 Ser and 767 Thr experimentally verified phosphorylation sites manually curated from the literature.

We constructed separate data sets for kinase families that are well represented in terms of the data available in the database (i.e., they are known to recognize more than 50 phosphorylation sites) (see Table 1)

In this study, we empirically compare a number of Machine Learning (ML) and profile-based methods for predicting kinase-specific protein phosphorylation sites.

Fig.1: Addition of a phosphate to an amino acid

Table 1: Kinase families considered in our study and the number of Ser and Thr sites known to be phosphorylated

Fig.2: Conformation changes caused by phosphorylation

We propose a method for combining PSSM profiles and ML approaches. Our proposed method yields fast and simple classifiers that consistently outperform profile-based methods for predicting kinase-specific phosphorylation sites.

Kinase CDK CK2 MAPK PKA PKB PKC

Ser 124 188 82 222 43 215

Thr 60 38 26 20 12 47

Total 184 226 108 242 55 262

Sequence-Based Machine Learning Methods

• The set of features for each Ser or Thr is based on windows n amino acids (n=15) centered around each Ser or Thr residue. • Encode each window as a 20*n binary vector, in which entries denote whether or not a particular amino acid appears at a particular position• Using this binary encoding, evaluate the performance of Support Vector Machine with Gaussian kernel (Bin(SVM)), Naïve Bayes (Bin(NB)), and Decision Tree (Bin(C4.5)) machine learning algorithms

PSSM-Based Representation – Our Approach (PSSMPhos)

• Combines profile-based and machine learning approaches• PSSM motifs are obtained as before for each kinase family• Encode each window as an n+1 vector, using the computed PSSM, <e1(x1),…, en(xn),Score(x)>, where ei(xi) is the PSSM emitted score of observing amino acid xi at position i and Score(x) is the sum of the n emitted PSSM scores• Train kinase-specific classifiers (PSSMPhos(SVM), PSSMPhos(NB), PSSMPhos(C4.5)) on the PSSM based representation

Results

Table 2 compares the performance of ML methods against profile-based methods for predicting kinase-specific phosphorylation sites. We also report the ROC curves for basic PSSM and basic HMM in Fig. 3

Table 2: Prediction accuracy of different methods using 5-fold cross validation test

Method/Kinase

BasicHMM BasicPSSM PSSMPhos

(SVM)

PSSMPhos(NB)

PSSMPhos(C4.5)

Bin

(SVM)

Bin

(NB)

Bin

(c4.5)

Scansite

(low)

Scansite

(med)

Scansite

(high)

KinasePhos (default)

KinasePhos

(90)

CDK 82.61 86.96 91.03 90.49 90.76 91.03 91.03 91.03 - - - 82.88 83.70

CK2 79.65 78.76 82.74 80.31 79.87 82.96 79.65 77.43 77.21 68.14 57.08 77.43 76.77

MAPK 68.22 75.23 78.04 78.04 70.09 79.44 78.97 74.77 - - - 74.07 72.22

PKA 85.54 84.30 90.70 89.05 86.78 90.70 89.05 86.78 84.30 71.90 59.50 87.60 89.26

PKB 82.73 82.73 90.00 90.00 87.27 89.09 92.73 85.45 - - - 81.82 81.82

PKC 76.34 67.75 77.48 79.39 73.09 81.87 80.92 79.20 - - - 80.92 81.49

Fig.3: Comparison of ROC curves for BasicPSSM and BasicHMM for the six kinase families considered

Conclusions

We proposed PSSMPhos, a method for combining PSSM profiles and ML methods.

Our study demonstrates the superiority of ML over profile-based methods when enough training data is available.

Our experiments suggest that ML methods and profile-based methods should complement each other to produce more efficient phosphorylation site prediction tools.

Profile-Based Approaches Scansite

A web service that is using 63 experimentally developed motifs, represented as PSSM, for identifying potential Ser/Thr phosphorylated sites.

KinasePhos Another web service that uses Kinase-specific HMMs for predictions.

Basic PSSM Our implementation of PSSM motifs using PROFILEWEIGHT program.

Basic HMM Our implementation of HMM motifs using HMMER software package.

Top Related