machine learning algorithms for t-cell epitopes prediction
TRANSCRIPT
ARTICLE IN PRESS
0925-2312/$ - se
doi:10.1016/j.ne
E-mail addr
Neurocomputing 69 (2006) 866–868
www.elsevier.com/locate/neucom
Letters
Machine learning algorithms for T-cell epitopes prediction
Loris Nanni
DEIS, IEIIT—CNR, Universita di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy
Received 13 June 2005; received in revised form 22 August 2005; accepted 24 August 2005
Available online 27 December 2005
Communicated by R.W. Newcomb
Abstract
The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the
process of antigen-specific T-cell activation. Performance of the various classifiers was compared by the area under their receiver
operating characteristic curve. The fusion of machine-learning-type classifiers showed improved performance over the best results
previously published in the literature. In particular, the best performance is achieved combining support vector machine and support
vector data description.
r 2005 Elsevier B.V. All rights reserved.
Keywords: Support vector data description; Support vector machine; T-cell epitope prediction
1. Introduction
Deciphering the patterns of peptides that elicit a majorhistocompatibility complex (MHC) restricted T-cell re-sponse is critical for vaccine development. In the past, anumber of methods have been developed to study theinteraction between peptide and MHC. Broadly thesemethods are based on:
�
structural information [7]; � mathematical approaches including binding motifs [2]; � quantitative matrices [10]; � artificial neural networks (ANN) [8]; � support vector machines (SVM) [12].MHC binders are not always T-cell epitopes however.Efforts to predict candidate T-cell epitopes have beenutilized: ANN [3,4,12]. The method proposed in [12] isbased on a complex feature extraction, based on physicalproperties of the amino acid and Support Vector Machine(SVM) as classifier. In [4] a new peptide encoding scheme isproposed to use with support vector machines for the directrecognition of T-cell epitopes. The method enables pre-
e front matter r 2005 Elsevier B.V. All rights reserved.
ucom.2005.08.005
ess: [email protected].
sentation of information on both amino acid positions inpeptides and similarity between amino acids through the useof sparse indicator vectors and the BLOSUM50 matrix [4].In this work, the encoding scheme based on the
BLOSUM50 matrix is investigated. The computationalresults demonstrate the superior performance of thisencoding scheme in comparison with the methods pro-posed in the literature. Moreover, we show that a simplefusion between SVM and Support Vector data description(SV), permits to obtain an error under the receiveroperating characteristic curve (EAUC) of 6.6, which islower than the best previous approaches (7.6 [12]).In particular, in this paper we study two one-class
classifiers: Support Vector data description (SV) and linearprogramming description (LPD). The problem in one-classclassification is to make a description of a target set ofobjects. The difference with conventional classification isthat in one-class classification only examples of one classare available. The objects from this class will be called thetarget objects. All other objects are per definition theoutlier objects. These classifiers were selected because theyhave shown a very good performance with regard to theSVM, which is considered to be the state-of-the-art inpattern recognition. In addition, to the best of ourknowledge, they were never applied to any bioinformatics
ARTICLE IN PRESS
AMINO-ACIDS
Feature ExtractionBased on
BLOSUM50SVM
SV
SUM
Fig. 1. System proposed.
L. Nanni / Neurocomputing 69 (2006) 866–868 867
tasks. As the pattern-recognition approach in bioinfor-matics made considerable progress, it is natural to explorenew, still untried techniques. We are aware that no singlemethod can solve the task to be tackled. Therefore, ourintentions are to find methods that can be combinedtogether. It is believed [6] that classifiers based on differentmethodologies offer complementary information about thepatterns to be classified. In Section 3, we show that thefusion, using the ‘‘Sum Rule’’ [5] (SUM), between SV andSVM outperforms the best result published in theliterature. In Fig. 1, this system is detailed.
The rest of the paper is organized as follows, in Section 2a brief description of the methods combined and tested inthis work is given, in Section 3 the results of theexperiments are discussed, and finally, in Section 4, wedraw some conclusions.
2. System
One of the important elements that influences theeffectiveness of a pattern recognition model is the designof the encoding method for the data. In this work, we studythe encoding technique that combines the amino acidsubstitution matrix BLOSUM50 with the sequence orderof the amino acid composition. This is achieved byreplacing each non-zero entry in the orthonormal encodingmodel by the corresponding value appearing in diagonalentries in the BLOSUM50 matrix. The BLOSUM50 scorecontains prior knowledge about which amino acids aresimilar or dissimilar to each other in distantly relatedproteins. In the standard orthonormal representation [9]each amino acid is then represented by a 20-bit vector with19 bits set to zero and one bit set to one, and each aminoacid vector is orthogonal to all other amino acid vectors.
2.1. Support vector data description (SV) [11]
To describe the domain of a dataset, we enclose the datawith a hypersphere with minimum volume. By minimizingthe volume of the captured feature space, we hope tominimize the chance of accepting outlier objects. Analo-gous to the ‘‘standard’’ SVMs, we can replace the innerproducts ðx � yÞ by kernel functions Kðx; yÞ which gives amuch more flexible method. Especially the Gaussian kernel
appears to provide a good data transformation. Thisclassifier is implemented as in dd_tools 0.95 (www-ict.ewi.tudelft.nl/�davidt/dd_tools.html).
2.2. Linear programming description (LP) [11]
This data descriptor is specifically constructed todescribe target objects which are represented in terms ofdistances to other objects. In some cases it might be mucheasier to define distances between objects than informativefeatures. The classifier has basically the following form:
f ðxÞ ¼X
i
wi dðx;xiÞ. (1)
The weights w are optimized such that just a few weightsstay non-zero, and the boundary is as tight as possible aroundthe data. This classifier is implemented as in dd_tools 0.95(www-ict.ewi.tudelft.nl/�davidt/dd_tools.html).
2.3. Support vector machine [1]
The goal is to establish the equation of a hyperplane thatdivides the training set, leaving all the points of the sameclass on the same side while maximizing the distancebetween the two classes and the hyperplane. Given alinearly separable set S, the optimal separating hyperplaneis the separating hyperplane for which the distance of theclosest point of S is maximum.If the set S is not linearly separable, the problem can be
solved by introducing Q nonnegative variables p and aparameter C that can be regarded as a regularizationparameter. The main mathematical property of SVMs isthat they map the input vectors into a high-dimensionalfeature space, and construct an optimal hyperplane, whichmaximizes the margin. The mapping is performed by akernel function which defines an inner product in the newspace. In this paper we use as kernel functions the radialbasis function kernel.
3. Experiments
All the tests have been conducted on the same datasetused in [12,4]. Peptides were synthesized by the simulta-neous-multiplepeptide synthesis methods and characterizedusing HPLC and mass spectrometry. LAU203-1.5 is anA*0201 restricted T-cell clone (TCC) from tumor-infil-trated lymph node cells of a melanoma patient. Twohundred and three synthetic peptides were selected basedon results using single- and multiple-amino acid substitu-tions and combinatorial peptide library experiments with achromium release antigen recognition assay. These pep-tides were tested against the LAU203-1.5 clone using thesame assay. A peptide with percentage-specific lysis higherthan 10% was considered positive.Due to the imbalance of two classes in the data set (36
stimulatory peptides and 167 non-stimulatory peptides), wefirst divided the data into positive and negative groups.
ARTICLE IN PRESS
Table 1
Error under the receiver operating characteristic curve (EAUC) obtained
by several methods in T-cell epitopes prediction problem
Method EAUC Method EAUC
SCO 16.7 LP 9
SZH 8.1 LP+SBL 6.8
SBL 7.6 SV+SBL 6.6
SV 9.3 SV+LP+SBL 7.3
Please note, SBL is the best result published in literature, our method
SV+SBL outperforms SBL.
L. Nanni / Neurocomputing 69 (2006) 866–868868
Then in each group random sampling was used to select80% of the total peptides for training and 20% as a test set.Finally the positive and negative groups were combinedseparately in the training and test sets. This procedure wasrepeated independently 10 times. The parameters of theclassifiers were obtained using a leave-one-out validation[1] set on the training set. Both training set and valida-tion set are used to train the classifier when we classify thetest set.
Table 1 reports the error under the EAUC [11] obtainedby methods previously published in the literature [12,4]:
�
Score matrix (SCO); � SVM using physical properties of the amino acid asfeature (SZH);
� SVM using the feature based on BLOSUM50 [4] (SBL).And by:
�
Support Vector data description (SV); � Linear programming description (LP); � Sum Rule between SV and SBL (SV+SBL); � Sum Rule between LP and SBL (LP+SBL); � Sum Rule among SV, LP, and SBL (SV+LP+SBL).Please note, the scores of the classifiers (before thefusion) are normalized to mean 0 and variance 1.
4. Conclusions
In this paper, we investigated the fusion of classifiers,applied to T-cell epitopes prediction, and tested it on a real-world dataset. It is well known in literature [6] thatclassifier ensembles that enforce diversity fare better thanthe ones that do not. To enforce the diversity we combinedclassifiers based on different methodologies. It is believedthat classifiers based on different methodologies ordifferent features offer complementary information aboutthe patterns to be classified. We proposed an ensemble ofclassifiers that combine a one-class classifiers (SV and LP),
and a ‘‘standard’’ SVM. The obtained results are veryencouraging: all the combination tested (SV+SBL,LP+SBL, SV+LP+SBL) permit to obtain an error underthe Receiver Operating Characteristic curve lower that thebest previous approaches [12,4]. The best performance isobtained by SV+SBL. It is interesting to note that theperformance of SV+LP+SBL is worse than SV+SBL.Maybe the average ‘‘diversity’’ [6] among these threeclassifiers is lower than the diversity between SV and SBL.Moreover, both SV and LP obtain an EAUC higher thanthat obtained by SBL.
References
[1] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, New York,
2001.
[2] J. Hammer, New methods to predict MHC-binding sequences within
protein antigens, Curr. Opinion Immunol. 7 (2) (1995) 263–269.
[3] M.C. Honeyman, V. Brusic, N.L. Stone, L.C. Harrison, Neural
network-based prediction of candidate T-cell epitopes, Nature
Biotechnol. 16 (10) (1998) 966–969.
[4] L. Huang, Y. Dai, A support vector machine approach for prediction
of T cell epitopes, in: Proceedings of the Third Asia-Pacific
Bioinformatics Conference (APBC2005), 17–21 January, Singapore,
2005, pp. 312–328.
[5] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers,
IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239.
[6] L.I. Kuncheva, Diversity in multiple classifier systems, Inf. Fusion 6
(1) (2005) 3–4.
[7] D.R. Madden, The three-dimensional structure of peptide—MHC
complexes, Ann. Rev. Immunol. 13 (5) (1995) 587–622.
[8] M. Milik, D. Sauer, A.P. Brunmark, L. Yuan, A. Vitiello, M.R.
Jackson, P.A. Peterson, J. Skolnick, C.A. Glass, Application of an
artificial neural network to predict specific class I MHC binding
peptide sequences, Nature Biotechnol. 16 (8) (1998) 753–756.
[9] T. Rognvaldsson, L. You, Why neural networks should not be used
for HIV-1 protease cleavage site prediction, Bioinformatics 20 (11)
(2003) 1702–1709.
[10] T. Sturniolo, E. Bono, J. Ding, L. Raddrizzani, O. Tuereci, U. Sahin,
M. Braxenthaler, F. Gallazzi, M.P. Protti, F. Sinigaglia, J. Hammer,
Generation of tissue-specific and promiscuous HLA ligand databases
using DNA microarrays and virtual HLA class II matrices, Nature
Biotechnol. 17 (6) (1999) 555–561.
[11] D.M.J. Tax, One-class classification; concept-learning in the absence
of counter-examples, Delft University of Technology, June 2001,
ISBN: 90-75691-05-x.
[12] Y. Zhao, C. Pinilla, D. Valmori, R. Roland Martin, R. Simon,
Application of support vector machines for T-cell epitopes predic-
tion, Bioinformatics 19 (15) (2003) 1978–1984.
Loris Nanni is a Ph.D. Candidate in Computer
Engineering at the University of Bologna, Italy.
He received his Master Degree cum laude in 2002
from the University of Bologna. In 2002 he started
his Ph.D. in Computer Engineering at DEIS,
University of Bologna. His research interests
include pattern recognition, and biometric systems
(fingerprint classification and recognition, signa-
ture verification, face recognition).