machine learning algorithms for t-cell epitopes prediction

3
Neurocomputing 69 (2006) 866–868 Letters Machine learning algorithms for T-cell epitopes prediction Loris Nanni DEIS, IEIIT—CNR, Universita` di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Received 13 June 2005; received in revised form 22 August 2005; accepted 24 August 2005 Available online 27 December 2005 Communicated by R.W. Newcomb Abstract The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the process of antigen-specific T-cell activation. Performance of the various classifiers was compared by the area under their receiver operating characteristic curve. The fusion of machine-learning-type classifiers showed improved performance over the best results previously published in the literature. In particular, the best performance is achieved combining support vector machine and support vector data description. r 2005 Elsevier B.V. All rights reserved. Keywords: Support vector data description; Support vector machine; T-cell epitope prediction 1. Introduction Deciphering the patterns of peptides that elicit a major histocompatibility complex (MHC) restricted T-cell re- sponse is critical for vaccine development. In the past, a number of methods have been developed to study the interaction between peptide and MHC. Broadly these methods are based on: structural information [7]; mathematical approaches including binding motifs [2]; quantitative matrices [10]; artificial neural networks (ANN) [8]; support vector machines (SVM) [12]. MHC binders are not always T-cell epitopes however. Efforts to predict candidate T-cell epitopes have been utilized: ANN [3,4,12]. The method proposed in [12] is based on a complex feature extraction, based on physical properties of the amino acid and Support Vector Machine (SVM) as classifier. In [4] a new peptide encoding scheme is proposed to use with support vector machines for the direct recognition of T-cell epitopes. The method enables pre- sentation of information on both amino acid positions in peptides and similarity between amino acids through the use of sparse indicator vectors and the BLOSUM50 matrix [4]. In this work, the encoding scheme based on the BLOSUM50 matrix is investigated. The computational results demonstrate the superior performance of this encoding scheme in comparison with the methods pro- posed in the literature. Moreover, we show that a simple fusion between SVM and Support Vector data description (SV), permits to obtain an error under the receiver operating characteristic curve (EAUC) of 6.6, which is lower than the best previous approaches (7.6 [12]). In particular, in this paper we study two one-class classifiers: Support Vector data description (SV) and linear programming description (LPD). The problem in one-class classification is to make a description of a target set of objects. The difference with conventional classification is that in one-class classification only examples of one class are available. The objects from this class will be called the target objects. All other objects are per definition the outlier objects. These classifiers were selected because they have shown a very good performance with regard to the SVM, which is considered to be the state-of-the-art in pattern recognition. In addition, to the best of our knowledge, they were never applied to any bioinformatics ARTICLE IN PRESS www.elsevier.com/locate/neucom 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.08.005 E-mail address: [email protected].

Upload: loris-nanni

Post on 10-Sep-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

E-mail addr

Neurocomputing 69 (2006) 866–868

www.elsevier.com/locate/neucom

Letters

Machine learning algorithms for T-cell epitopes prediction

Loris Nanni

DEIS, IEIIT—CNR, Universita di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy

Received 13 June 2005; received in revised form 22 August 2005; accepted 24 August 2005

Available online 27 December 2005

Communicated by R.W. Newcomb

Abstract

The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the

process of antigen-specific T-cell activation. Performance of the various classifiers was compared by the area under their receiver

operating characteristic curve. The fusion of machine-learning-type classifiers showed improved performance over the best results

previously published in the literature. In particular, the best performance is achieved combining support vector machine and support

vector data description.

r 2005 Elsevier B.V. All rights reserved.

Keywords: Support vector data description; Support vector machine; T-cell epitope prediction

1. Introduction

Deciphering the patterns of peptides that elicit a majorhistocompatibility complex (MHC) restricted T-cell re-sponse is critical for vaccine development. In the past, anumber of methods have been developed to study theinteraction between peptide and MHC. Broadly thesemethods are based on:

structural information [7]; � mathematical approaches including binding motifs [2]; � quantitative matrices [10]; � artificial neural networks (ANN) [8]; � support vector machines (SVM) [12].

MHC binders are not always T-cell epitopes however.Efforts to predict candidate T-cell epitopes have beenutilized: ANN [3,4,12]. The method proposed in [12] isbased on a complex feature extraction, based on physicalproperties of the amino acid and Support Vector Machine(SVM) as classifier. In [4] a new peptide encoding scheme isproposed to use with support vector machines for the directrecognition of T-cell epitopes. The method enables pre-

e front matter r 2005 Elsevier B.V. All rights reserved.

ucom.2005.08.005

ess: [email protected].

sentation of information on both amino acid positions inpeptides and similarity between amino acids through the useof sparse indicator vectors and the BLOSUM50 matrix [4].In this work, the encoding scheme based on the

BLOSUM50 matrix is investigated. The computationalresults demonstrate the superior performance of thisencoding scheme in comparison with the methods pro-posed in the literature. Moreover, we show that a simplefusion between SVM and Support Vector data description(SV), permits to obtain an error under the receiveroperating characteristic curve (EAUC) of 6.6, which islower than the best previous approaches (7.6 [12]).In particular, in this paper we study two one-class

classifiers: Support Vector data description (SV) and linearprogramming description (LPD). The problem in one-classclassification is to make a description of a target set ofobjects. The difference with conventional classification isthat in one-class classification only examples of one classare available. The objects from this class will be called thetarget objects. All other objects are per definition theoutlier objects. These classifiers were selected because theyhave shown a very good performance with regard to theSVM, which is considered to be the state-of-the-art inpattern recognition. In addition, to the best of ourknowledge, they were never applied to any bioinformatics

ARTICLE IN PRESS

AMINO-ACIDS

Feature ExtractionBased on

BLOSUM50SVM

SV

SUM

Fig. 1. System proposed.

L. Nanni / Neurocomputing 69 (2006) 866–868 867

tasks. As the pattern-recognition approach in bioinfor-matics made considerable progress, it is natural to explorenew, still untried techniques. We are aware that no singlemethod can solve the task to be tackled. Therefore, ourintentions are to find methods that can be combinedtogether. It is believed [6] that classifiers based on differentmethodologies offer complementary information about thepatterns to be classified. In Section 3, we show that thefusion, using the ‘‘Sum Rule’’ [5] (SUM), between SV andSVM outperforms the best result published in theliterature. In Fig. 1, this system is detailed.

The rest of the paper is organized as follows, in Section 2a brief description of the methods combined and tested inthis work is given, in Section 3 the results of theexperiments are discussed, and finally, in Section 4, wedraw some conclusions.

2. System

One of the important elements that influences theeffectiveness of a pattern recognition model is the designof the encoding method for the data. In this work, we studythe encoding technique that combines the amino acidsubstitution matrix BLOSUM50 with the sequence orderof the amino acid composition. This is achieved byreplacing each non-zero entry in the orthonormal encodingmodel by the corresponding value appearing in diagonalentries in the BLOSUM50 matrix. The BLOSUM50 scorecontains prior knowledge about which amino acids aresimilar or dissimilar to each other in distantly relatedproteins. In the standard orthonormal representation [9]each amino acid is then represented by a 20-bit vector with19 bits set to zero and one bit set to one, and each aminoacid vector is orthogonal to all other amino acid vectors.

2.1. Support vector data description (SV) [11]

To describe the domain of a dataset, we enclose the datawith a hypersphere with minimum volume. By minimizingthe volume of the captured feature space, we hope tominimize the chance of accepting outlier objects. Analo-gous to the ‘‘standard’’ SVMs, we can replace the innerproducts ðx � yÞ by kernel functions Kðx; yÞ which gives amuch more flexible method. Especially the Gaussian kernel

appears to provide a good data transformation. Thisclassifier is implemented as in dd_tools 0.95 (www-ict.ewi.tudelft.nl/�davidt/dd_tools.html).

2.2. Linear programming description (LP) [11]

This data descriptor is specifically constructed todescribe target objects which are represented in terms ofdistances to other objects. In some cases it might be mucheasier to define distances between objects than informativefeatures. The classifier has basically the following form:

f ðxÞ ¼X

i

wi dðx;xiÞ. (1)

The weights w are optimized such that just a few weightsstay non-zero, and the boundary is as tight as possible aroundthe data. This classifier is implemented as in dd_tools 0.95(www-ict.ewi.tudelft.nl/�davidt/dd_tools.html).

2.3. Support vector machine [1]

The goal is to establish the equation of a hyperplane thatdivides the training set, leaving all the points of the sameclass on the same side while maximizing the distancebetween the two classes and the hyperplane. Given alinearly separable set S, the optimal separating hyperplaneis the separating hyperplane for which the distance of theclosest point of S is maximum.If the set S is not linearly separable, the problem can be

solved by introducing Q nonnegative variables p and aparameter C that can be regarded as a regularizationparameter. The main mathematical property of SVMs isthat they map the input vectors into a high-dimensionalfeature space, and construct an optimal hyperplane, whichmaximizes the margin. The mapping is performed by akernel function which defines an inner product in the newspace. In this paper we use as kernel functions the radialbasis function kernel.

3. Experiments

All the tests have been conducted on the same datasetused in [12,4]. Peptides were synthesized by the simulta-neous-multiplepeptide synthesis methods and characterizedusing HPLC and mass spectrometry. LAU203-1.5 is anA*0201 restricted T-cell clone (TCC) from tumor-infil-trated lymph node cells of a melanoma patient. Twohundred and three synthetic peptides were selected basedon results using single- and multiple-amino acid substitu-tions and combinatorial peptide library experiments with achromium release antigen recognition assay. These pep-tides were tested against the LAU203-1.5 clone using thesame assay. A peptide with percentage-specific lysis higherthan 10% was considered positive.Due to the imbalance of two classes in the data set (36

stimulatory peptides and 167 non-stimulatory peptides), wefirst divided the data into positive and negative groups.

ARTICLE IN PRESS

Table 1

Error under the receiver operating characteristic curve (EAUC) obtained

by several methods in T-cell epitopes prediction problem

Method EAUC Method EAUC

SCO 16.7 LP 9

SZH 8.1 LP+SBL 6.8

SBL 7.6 SV+SBL 6.6

SV 9.3 SV+LP+SBL 7.3

Please note, SBL is the best result published in literature, our method

SV+SBL outperforms SBL.

L. Nanni / Neurocomputing 69 (2006) 866–868868

Then in each group random sampling was used to select80% of the total peptides for training and 20% as a test set.Finally the positive and negative groups were combinedseparately in the training and test sets. This procedure wasrepeated independently 10 times. The parameters of theclassifiers were obtained using a leave-one-out validation[1] set on the training set. Both training set and valida-tion set are used to train the classifier when we classify thetest set.

Table 1 reports the error under the EAUC [11] obtainedby methods previously published in the literature [12,4]:

Score matrix (SCO); � SVM using physical properties of the amino acid as

feature (SZH);

� SVM using the feature based on BLOSUM50 [4] (SBL).

And by:

Support Vector data description (SV); � Linear programming description (LP); � Sum Rule between SV and SBL (SV+SBL); � Sum Rule between LP and SBL (LP+SBL); � Sum Rule among SV, LP, and SBL (SV+LP+SBL).

Please note, the scores of the classifiers (before thefusion) are normalized to mean 0 and variance 1.

4. Conclusions

In this paper, we investigated the fusion of classifiers,applied to T-cell epitopes prediction, and tested it on a real-world dataset. It is well known in literature [6] thatclassifier ensembles that enforce diversity fare better thanthe ones that do not. To enforce the diversity we combinedclassifiers based on different methodologies. It is believedthat classifiers based on different methodologies ordifferent features offer complementary information aboutthe patterns to be classified. We proposed an ensemble ofclassifiers that combine a one-class classifiers (SV and LP),

and a ‘‘standard’’ SVM. The obtained results are veryencouraging: all the combination tested (SV+SBL,LP+SBL, SV+LP+SBL) permit to obtain an error underthe Receiver Operating Characteristic curve lower that thebest previous approaches [12,4]. The best performance isobtained by SV+SBL. It is interesting to note that theperformance of SV+LP+SBL is worse than SV+SBL.Maybe the average ‘‘diversity’’ [6] among these threeclassifiers is lower than the diversity between SV and SBL.Moreover, both SV and LP obtain an EAUC higher thanthat obtained by SBL.

References

[1] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, New York,

2001.

[2] J. Hammer, New methods to predict MHC-binding sequences within

protein antigens, Curr. Opinion Immunol. 7 (2) (1995) 263–269.

[3] M.C. Honeyman, V. Brusic, N.L. Stone, L.C. Harrison, Neural

network-based prediction of candidate T-cell epitopes, Nature

Biotechnol. 16 (10) (1998) 966–969.

[4] L. Huang, Y. Dai, A support vector machine approach for prediction

of T cell epitopes, in: Proceedings of the Third Asia-Pacific

Bioinformatics Conference (APBC2005), 17–21 January, Singapore,

2005, pp. 312–328.

[5] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers,

IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239.

[6] L.I. Kuncheva, Diversity in multiple classifier systems, Inf. Fusion 6

(1) (2005) 3–4.

[7] D.R. Madden, The three-dimensional structure of peptide—MHC

complexes, Ann. Rev. Immunol. 13 (5) (1995) 587–622.

[8] M. Milik, D. Sauer, A.P. Brunmark, L. Yuan, A. Vitiello, M.R.

Jackson, P.A. Peterson, J. Skolnick, C.A. Glass, Application of an

artificial neural network to predict specific class I MHC binding

peptide sequences, Nature Biotechnol. 16 (8) (1998) 753–756.

[9] T. Rognvaldsson, L. You, Why neural networks should not be used

for HIV-1 protease cleavage site prediction, Bioinformatics 20 (11)

(2003) 1702–1709.

[10] T. Sturniolo, E. Bono, J. Ding, L. Raddrizzani, O. Tuereci, U. Sahin,

M. Braxenthaler, F. Gallazzi, M.P. Protti, F. Sinigaglia, J. Hammer,

Generation of tissue-specific and promiscuous HLA ligand databases

using DNA microarrays and virtual HLA class II matrices, Nature

Biotechnol. 17 (6) (1999) 555–561.

[11] D.M.J. Tax, One-class classification; concept-learning in the absence

of counter-examples, Delft University of Technology, June 2001,

ISBN: 90-75691-05-x.

[12] Y. Zhao, C. Pinilla, D. Valmori, R. Roland Martin, R. Simon,

Application of support vector machines for T-cell epitopes predic-

tion, Bioinformatics 19 (15) (2003) 1978–1984.

Loris Nanni is a Ph.D. Candidate in Computer

Engineering at the University of Bologna, Italy.

He received his Master Degree cum laude in 2002

from the University of Bologna. In 2002 he started

his Ph.D. in Computer Engineering at DEIS,

University of Bologna. His research interests

include pattern recognition, and biometric systems

(fingerprint classification and recognition, signa-

ture verification, face recognition).