update of profeat: a web server for computing structural ... · atomic-level topological...

Update of PROFEAT: a web server for computingstructural and physicochemical features ofproteins and peptides from amino acid sequenceH. B. Rao1, F. Zhu1,2, G. B. Yang3, Z. R. Li1,4,* and Y. Z. Chen2,4

1College of Chemistry, Sichuan University, Chengdu, 610064, P. R. China, 2Department of Pharmacy,Bioinformatics and Drug Design Group, National University of Singapore, Singapore 117543, 3College ofChemical Engineering, Sichuan University, Chengdu 610064 and 4State Key Laboratory of Biotherapy,Sichuan University, Chengdu 610041, P. R. China

Received January 23, 2011; Revised March 17, 2011; Accepted April 12, 2011

ABSTRACT

Sequence-derived structural and physicochemicalfeatures have been extensively used for analyzingand predicting structural, functional, expressionand interaction profiles of proteins and peptides.PROFEAT has been developed as a web server forcomputing commonly used features of proteins andpeptides from amino acid sequence. To facilitatemore extensive studies of protein and peptides,numerous improvements and updates have beenmade to PROFEAT. We added new functions forcomputing descriptors of protein–protein andprotein–small molecule interactions, segment de-scriptors for local properties of protein sequences,topological descriptors for peptide sequences andsmall molecule structures. We also added newfeature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-aminoacid composition, total amino acid properties andatomic-level topological descriptors) as well asfor small molecules (atomic-level topological de-scriptors). Overall, PROFEAT computes 11 featuregroups of descriptors for proteins and peptides,and a feature group of more than 400 descriptorsfor small molecules plus the derived features forprotein–protein and protein–small molecule inter-actions. Our computational algorithms have beenextensively tested and used in a number of pub-lished works for predicting proteins of specificstructural or functional classes, protein–protein

interactions, peptides of specific functions andquantitative structure activity relationships of smallmolecules. PROFEAT is accessible free of charge athttp://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.

INTRODUCTION

Sequence-derived structural and physicochemical featuresare highly useful for representing and distinguishingproteins or peptides of different structural, functionaland interaction properties, and have been widely used indeveloping methods and software for predicting proteinstructural and functional classes (1–7), protein–proteininteractions (8–10), protein–ligand interactions (11,12),protein substrates (13,14), molecular binding sites onproteins (15–20), subcellular locations (21), protein crys-tallization propensity (22–24) and peptides of specificproperties (25–30). Web servers, such as PROFEAT(31) and PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAA/) (32), have been built to facilitate the computationof protein and peptide features.Nonetheless, some features important for studying

proteins, peptides and molecular interactions have notbeen provided in these web servers. Examples of thesefeatures include atomic-level topological descriptors thatare useful for structure–property correlations (33) and de-scriptors of total amino acid properties (TAAPs) that havebeen used for modeling protein conformational stability(34), ligand binding site structural features (35) and inter-action with small molecules (36). Moreover, the descrip-tors provided in those available web servers are notsuitable for analyzing local properties of sequence

*To whom correspondence should be addressed. Tel: 86-28-85406139; Fax: 86-28-85407797; Email: [email protected]

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Published online 23 May 2011 Nucleic Acids Research, 2011, Vol. 39, Web Server issue W385–W390doi:10.1093/nar/gkr284

� The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

subsections, and additional works are needed to use de-scriptors to study protein–protein and protein–ligandinteractions. Therefore, it is desirable to provide segmentdescriptors for local properties of subsections of proteinsequences, and descriptors that can be straightforwardlyused for exploring protein–protein and protein–smallmolecule interactions.We updated PROFEAT by adding new functions for

computing descriptors of protein–protein and protein–small molecule interactions, segment descriptors forlocal properties of subsections of protein sequences,atomic-level topological descriptors for peptide sequencesand small molecule structures, and topological polarsurface areas of small molecules. Moreover, we addednew feature groups such as pseudo-amino acid compos-ition (PAAC), amphiphilic PAAC (APAAC), TAAPs, andatomic-level topological descriptors. The computationalalgorithms of these newly added feature groups havebeen extensively tested and used in a number of publishedworks for predicting proteins and peptides of specificproperties, protein–protein interactions, and quantitativestructure activity relationships of small molecules. A list ofpublications using features covered by PROFEAT isprovided in Supplementary Table S1 and in PROFEATonline server which can be accessed at http://bidd.cz3.nus.edu.sg/prof/part_of_publications.htm. PROFEAThomepage is shown in Figure 1. A list of features forproteins and peptides covered by this version ofPROFEAT is summarized in Table 1 and a list of thetopological descriptors for peptides and small moleculescomputed by PROFEAT is summarized in SupplementaryTable S2.

METHODS FOR NEWLY ADDED FEATURESAND FUNCTIONS

PAAC descriptors

First, three variables are derived from the originalhydrophobicity values H0

1ðiÞ, hydrophilicity values H02ði Þ

and side chain masses M0ðiÞ of 20 amino acids (i=1,2, . . . , 20) (32):

H1 ið Þ ¼

H01ði Þ �

P20i¼1

H01ði Þ

20ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP20i¼1

½H01ði Þ�

P20i¼1

H01ði Þ

20 �

2

20

s ð1Þ

H2 ið Þ ¼

H02ði Þ �

P20i¼1

H02ði Þ

20ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP20i¼1

½H02ði Þ�

P20i¼1

H02ði Þ

20 �

2

20

s ð2Þ

M ið Þ ¼

M0ði Þ �P20i¼1

M0ði Þ20ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP20

i¼1

½M0ði Þ�P20i¼1

M0 ði Þ20 �

2

20

s ð3Þ

Then, a correlation function can be computed as:

� Ri,Rj

� �¼

1

3H1 Rið Þ �H1 Rj

� �� 2n+H2 Rið Þ �H2 Rj

� �� 2+M Rið Þ �M Rj

� �� 2oð4Þ

from which, sequence order-correlated factors aredefined as:

�1 ¼1

N� 1

XN�1i¼1

�ðRi,Ri+1Þ

�2 ¼1

N� 2

XN�2i¼1

�ðRi,Ri+2Þ

�3 ¼1

N� 3

XN�3i¼1

�ðRi,Ri+3Þ

�� ¼1

N� �

XN��i¼1

�ðRi,Ri+�Þ, � < Nð Þ

ð5Þ

� < Nð Þ is a parameter. Let fi be the normalized occurrencefrequency of 20 amino acids in the protein sequence, a setof 20+� descriptors called the PAAC are defined as:

Xu ¼fuP20

i¼1

fi+wP�j¼1�j

when 1 � u � 20

Xu ¼w�u�20P20

i¼1

fi+wP�j¼1�j

when 20+1 � u � 20+�

ð6Þ

where w is the weighting factor for the sequence-ordereffect and is set to be w=0.05 as suggested by Shen (32).

APAAC

From H1ði Þ and H2ði Þ defined in Equation (1) and (2), thehydrophobicity and hydrophilicity correlation functionsare defined (32), respectively, as:

H1i, j ¼ H1ði ÞH1ðj Þ, H

2i, j ¼ H2ði ÞH2ðj Þ

from which sequence order factors can be defined as:

�1 ¼1

N� 1

XN�1i¼1

H1i, i+1, �2 ¼

1

N� 1

XN�2i¼1

H2i, i+1,

�3 ¼1

N� 2

XN�2i¼1

H1i, i+2; �4 ¼

1

N� 2

XN�2i¼1

H2i, i+2, . . . ,

�2��1 ¼1

N� �

XN��i¼1

H1i, i+�, �2� ¼

1

N� �

XN��i¼1

H2i, i+� ð� < NÞ

ð7Þ

W386 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

Figure 1. PROFEAT new web page.

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W387

and APAAC are defined as:

pu ¼fuP20

i¼1

fi+wP2�j¼1

�j

when 1 � u � 20

pu ¼w�uP20

i¼1

fi+wP2�j¼1

�j

when 20+1 � u � 20+�ð8Þ

where w is the weighting factor and is taken as w ¼ 0:5.

Topological descriptors at atomic level

Topological descriptors are based on graph theoryand encode information about the types of atoms andbonds in a molecule and the nature of their connections.Examples of topological descriptors include counts ofatom and bond types and indexes that encode the size,shape and types of branching in a molecule (37). Thesedescriptors can be calculated from the 2D structure of apeptide automatically generated from its sequence basedon the molecular structures of the amino acid residues inthe sequence. Supplementary Table S2 gives a list of thetopological descriptors computed by PROFEAT.

TAAP

TAAP descriptor for a specific physicochemical property iis defined as: Ptotði Þ ¼

PNj¼1 pnormj

i=N, where pnormij repre-

sents the property i of amino acid Rj that is normalizedbetween 0 and 1 using the following expression,pnorm

ij ¼ ðp

ij� pi

minÞ=ðpimax� pi

min Þ, where pij is the original

amino acid property i for residue j. pimax and pimin are, re-

spectively, the minimum and maximum values of theoriginal amino acid property i, and N is the length ofthe sequence (38–40).

Protein–protein interaction descriptors

Protein–protein interaction descriptors can be computedfrom the descriptors Va={Va(i), i=1, 2, . . . , n} andVb={Vb(i), i=1, 2, . . . , n} of individual proteins A and

B by three methods. In the first method, two protein-pairvectors Vab and Vba with dimension of 2n are constructedwith Vab=(Va, Vb) for interaction between proteins Aand B and Vba=(Vb, Va) for interaction betweenproteins B and A (8,9). In the second method, onevector V with dimension of 2n is constructed:V={Va(i)+Vb(i), Va(i)�Vb(i), i=1, 2, . . . , n} whichhas the property that V is unchanged when a and b areexchanged. In the third method, one vector V with dimen-sion of n2 is constructed by the tensor product:V={V(k)=Va(i)�Vb(j), i=1, 2, . . . , n, j=1, 2, . . . , n,k=(i � 1)� n+ j}.

Protein–ligand interaction descriptors

Protein–ligand interaction descriptor vector V can beconstructed from the protein descriptor vector Vp (Vp(i),i=1, . . . , np) and ligand descriptor vector Vl (Vl(i),i=1, . . . , nl) by two methods similar to the first andthird method for constructing protein pair descriptors.In the first method, one vector V with dimension ofnp+nl are constructed V=(Vp,Vl) for interactionbetween protein and ligand. In the second method, onevector V with dimension of np� nl is constructed by thetensor product: V={v(k)=Vp(i)�Vl(j), i=1, 2 , . . . , np,j=1, 2, . . . , nl, k=(i � 1)� np+j}.

Segmented sequence descriptors

To characterize the local feature of a protein sequence, aprotein sequence can be divided into several segments anddescriptors are calculated for each segment.

Topological descriptors for small molecules

For small molecules, topological descriptors are calculatedfrom the input 2D structures of small molecules in mol orsdf format. Names of these descriptors are thesame as those for protein segments which are listed inSupplementary Table S2.

Table 1. List of PROFEAT computed features for proteins, peptides and protein–protein interactions

Feature group Features No. of descriptors No. of descriptor values

Composition-1 Amino acid composition 1 20Composition-2 Dipeptide composition 1 400Autocorrelation 1 Normalized Moreau–Broto autocorrelation a a

Autocorrelation 2 Moran autocorrelation a a

Autocorrelation 3 Geary autocorrelation a a

Composition, Transition, Distribution Composition 7 21Transition 7 21Distribution 7 105

Quasi-sequence order descriptors Sequence order coupling number 2 90Quasi-sequence order descriptors 2 150

PAAC PAAC b b

APAAC APAAC c c

Topological descriptors Topological descriptors 405TAAPs TAP d d

aThe number depends on the choice of the number of properties of amino acid and the choice of the maximum values of the lag.bThe number depends on the choice of the number of the set of amino acid properties and the choice of the � value.cThe number depends on the choice of the � value.dThe numbers depend on the choice of the number of properties of amino acid.


REMARKS

Compared with its earlier version, the updated PROFEATis significantly enhanced in both the number of newlyadded features useful for representing various proteinproperties, and newly added functions for computingfeatures for local properties of protein segments,protein–protein interactions, protein–small moleculeinteractions and small molecules. These enhancementsare intended to provide more comprehensive features forfacilitating the analysis and prediction of proteins,peptides, small molecules of different properties and mo-lecular interactions involving proteins, peptides and smallmolecules. With continued interest in using molecular andinteraction features and developing new algorithms forrepresenting these features, new descriptors and functionssuch as those involving DNA, RNA and other nucleotidescan be integrated into PROFEAT in the near future tobetter facilitate the study of molecular and bio-molecularfunctions and interactions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: National Natural ScienceFoundation of China (grant number 20973118).

Conflict of interest statement. None declared.

REFERENCES

1. Karchin,R., Karplus,K. and Haussler,D. (2002) ClassifyingG-protein coupled receptors with support vector machines.Bioinformatics, 18, 147–159.

2. Cai,C.Z., Han,L.Y., Ji,Z.L., Chen,X. and Chen,Y.Z. (2003)SVM-Prot: web-based support vector machine software forfunctional classification of a protein from its primary sequence.Nucleic Acids Res., 31, 3692–3697.

3. Dubchak,I., Muchnik,I., Mayor,C., Dralyuk,I. and Kim,S.H.(1999) Recognition of a protein fold in the context of theStructural Classification of Proteins (SCOP) classification.Proteins, 35, 401–407.

4. Han,L., Cui,J., Lin,H., Ji,Z., Cao,Z., Li,Y. and Chen,Y. (2006)Recent progresses in the application of machine learningapproach for predicting protein functional class independentof sequence similarity. Proteomics, 6, 4023–4037.

5. Langlois,R.E. and Lu,H. (2010) Boosting the prediction andunderstanding of DNA-binding domains from sequence.Nucleic Acids Res., 38, 3149–3158.

6. Yan,R.X., Si,J.N., Wang,C. and Zhang,Z. (2009) DescFold: aweb server for protein fold recognition. BMC Bioinformatics, 10,416.

7. Zhu,F., Han,L., Zheng,C., Xie,B., Tammi,M.T., Yang,S., Wei,Y.and Chen,Y. (2009) What are next generation innovativetherapeutic targets? Clues from genetic, structural,physicochemical, and systems profiles of successful targets.J. Pharmacol. Exp. Ther., 330, 304–315.

8. Bock,J.R. and Gough,D.A. (2001) Predicting protein–proteininteractions from primary structure. Bioinformatics, 17, 455–460.

9. Lo,S.L., Cai,C.Z., Chen,Y.Z. and Chung,M.C. (2005) Effect oftraining datasets on support vector machine prediction ofprotein-protein interactions. Proteomics, 5, 876–884.

10. Qiu,J. and Noble,W.S. (2008) Predicting co-complexed proteinpairs from heterogeneous data. PLoS Comput. Biol., 4, e1000054.

11. Yamanishi,Y., Araki,M., Gutteridge,A., Honda,W. andKanehisa,M. (2008) Prediction of drug-target interactionnetworks from the integration of chemical and genomic spaces.Bioinformatics, 24, i232–i240.

12. Xia,Z., Wu,L.Y., Zhou,X. and Wong,S.T. (2010) Semi-superviseddrug-protein interaction prediction from heterogeneous biologicalspaces. BMC Syst. Biol., 4(Suppl. 2), S6.

13. Barkan,D.T., Hostetter,D.R., Mahrus,S., Pieper,U., Wells,J.A.,Craik,C.S. and Sali,A. (2010) Prediction of protease substratesusing sequence and structure features. Bioinformatics, 26,1714–1722.

14. Rottig,M., Rausch,C. and Kohlbacher,O. (2010) Combiningstructure and sequence information allows automated predictionof substrate specificities within enzyme families. PLoS Comput.Biol., 6, e1000636.

15. Wang,L. and Brown,S.J. (2006) BindN: a web-based tool forefficient prediction of DNA and RNA binding sites in aminoacid sequences. Nucleic Acids Res., 34, W243–W248.

16. Terribilini,M., Lee,J.H., Yan,C., Jernigan,R.L., Honavar,V. andDobbs,D. (2006) Prediction of RNA binding sites in proteinsfrom amino acid sequence. RNA, 12, 1450–1462.

17. Liu,Z.P., Wu,L.Y., Wang,Y., Zhang,X.S. and Chen,L. (2010)Prediction of protein-RNA binding sites by a random forestmethod with combined features. Bioinformatics, 26, 1616–1622.

18. Carson,M.B., Langlois,R. and Lu,H. (2010) NAPS: a residue-levelnucleic acid-binding prediction server. Nucleic Acids Res., 38,W431–W435.

19. Murakami,Y., Spriggs,R.V., Nakamura,H. and Jones,S. (2010)PiRaNhA: a server for the computational prediction ofRNA-binding residues in protein sequences. Nucleic Acids Res.,38, W412–W416.

20. Chen,C.T., Yang,E.W., Hsu,H.J., Sun,Y.K., Hsu,W.L. andYang,A.S. (2008) Protease substrate site predictors derived frommachine learning on multilevel substrate phage display data.Bioinformatics, 24, 2691–2697.

21. Rastogi,S. and Rost,B. (2010) Bioinformatics predictions oflocalization and targeting. Methods Mol. Biol., 619, 285–305.

22. Overton,I.M., Padovani,G., Girolami,M.A. and Barton,G.J.(2008) ParCrys: a Parzen window density estimation approach toprotein crystallization propensity prediction. Bioinformatics, 24,901–907.

23. Kurgan,L., Razib,A.A., Aghakhani,S., Dick,S., Mizianty,M. andJahandideh,S. (2009) CRYSTALP2: sequence-based proteincrystallization propensity prediction. BMC Struct. Biol., 9, 50.

24. Kandaswamy,K.K., Pugalenthi,G., Suganthan,P.N. and Gangal,R.(2010) SVMCRYS: an SVM approach for the prediction ofprotein crystallization propensity from protein sequence.Protein Pept. Lett., 17, 423–430.

25. Schneider,G. and Wrede,P. (1994) The rational design of aminoacid sequences by artificial neural networks and simulatedmolecular evolution: de novo design of an idealized leaderpeptidase cleavage site. Biophys J., 66, 335–344.

26. Cui,J., Han,L.Y., Lin,H.H., Zhang,H.L., Tang,Z.Q., Zheng,C.J.,Cao,Z.W. and Chen,Y.Z. (2007) Prediction of MHC-bindingpeptides of flexible lengths from sequence-derived structural andphysicochemical properties. Mol. Immunol., 44, 866–877.

27. Fjell,C.D., Jenssen,H., Hilpert,K., Cheung,W.A., Pante,N.,Hancock,R.E. and Cherkasov,A. (2009) Identification of novelantibacterial peptides by chemoinformatics and machine learning.J. Med. Chem., 52, 2006–2015.

28. Khatun,J., Hamlett,E. and Giddings,M.C. (2008) Incorporatingsequence information into the scoring function: a hidden Markovmodel for improved peptide identification. Bioinformatics, 24,674–681.

29. Shah,A.R., Agarwal,K., Baker,E.S., Singhal,M.,Mayampurath,A.M., Ibrahim,Y.M., Kangas,L.J., Monroe,M.E.,Zhao,R., Belov,M.E. et al. (2010) Machine learning basedprediction for peptide drift times in ion mobility spectrometry.Bioinformatics, 26, 1601–1607.

30. Jacob,L. and Vert,J.P. (2008) Efficient peptide-MHC-I bindingprediction for alleles with few known binders. Bioinformatics, 24,358–366.

31. Li,Z.R., Lin,H.H., Han,L.Y., Jiang,L., Chen,X. and Chen,Y.Z.(2006) PROFEAT: a web server for computing structural and

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W389

physicochemical features of proteins and peptides from aminoacid sequence. Nucleic Acids Res., 34, W32–W37.

32. Shen,H.B. and Chou,K.C. (2008) PseAAC: a flexible web serverfor generating various kinds of protein pseudo amino acidcomposition. Anal. Biochem., 373, 386–388.

33. Ren,B. (2003) Atomic-level-based AI topological descriptors forstructure-property correlations. J. Chem. Inf. Comput. Sci., 43,161–169.

34. Fernandez,L., Caballero,J., Abreu,J.I. and Fernandez,M. (2007)Amino acid sequence autocorrelation vectors andBayesian-regularized genetic neural networks for modeling proteinconformational stability: gene V protein mutants. Proteins, 67,834–852.

35. Niwa,T. (2006) Elucidation of characteristic structural featuresof ligand binding sites of protein kinases: a neural networkapproach. J. Chem. Inf. Model, 46, 2158–2166.

36. Niu,B., Jin,Y., Lu,L., Fen,K., Gu,L., He,Z., Lu,W., Li,Y. andCai,Y. (2009) Prediction of interaction between small moleculeand enzyme using AdaBoost. Mol. Divers, 13, 313–320.

37. Todeschini,R. and Consonni,V. (2000) Handbook of MolecularDescriptors. Wiley-VCH, Weinheim.

38. Gromiha,M.M. and Suwa,M. (2006) Influence of amino acidproperties for discriminating outer membrane proteins at betteraccuracy. Biochim. Biophys. Acta, 1764, 1493–1497.

39. Huang,L.T. and Gromiha,M.M. (2008) Analysis and predictionof protein folding rates using quadratic response surface models.J. Comput. Chem., 29, 1675–1683.

40. Gromiha,M.M. (2003) Importance of native-state topology fordetermining the folding rate of two-state proteins. J. Chem. Inf.Comput. Sci., 43, 1481–1485.


update of profeat: a web server for computing structural ... · atomic-level topological...

Documents