estimation of protein secondary structure from ftir spectra using neural networks

Estimation of protein secondary structure from FTIR spectra usingneural networks

Mete Severcana,*, Feride Severcanb, Parvez I. Harisc

aElectrical and Electronics Engineering Department, Middle East Technical University, Ankara 06531,TurkeybDepartment of Biology, Middle East Technical University, Ankara 06531, Turkey

cDepartment of Biological Sciences, De Montfort University, Leicester LE1 9BH, UK

Received 31 August 2000; revised 15 January 2001; accepted 15 January 2001

Abstract

Secondary structure of proteins have been predicted using neural networks (NN) from their Fourier transform infrared

spectra. Leave-one-out approach has been used to demonstrate the applicability of the method. A form of cross-validation is

used to train NN to prevent the over®tting problem. Multiple neural network outputs are averaged to reduce the variance of

predictions. The networks realized have been tested and rms errors of 7.7% for a-helix, 6.4% for b-sheet and 4.8% for turns

have been achieved. These results indicate that the methodology introduced is effective and estimation accuracies are in some

cases better than those previously reported in the literature. q 2001 Elsevier Science B.V. All rights reserved.

Keywords: Protein; Secondary structure prediction; Neural networks; FTIR; Spectroscopy

1. Introduction

Currently, X-ray crystallography or multidimen-

sional NMR spectroscopy can only be used to achieve

the complete structure determination of a protein at

high resolution. However, these techniques have some

disadvantages. For example, X-ray crystallography

requires high quality single crystals, which are not

always available, and the static nature of a protein

crystal may not represent the dynamic nature of a

protein in solution. In addition to this, the process is

very slow. NMR spectroscopy has the advantage that

it can be used to determine the structure of a protein in

solution. However, the interpretation of NMR spectra

of large proteins is very complicated, and the tech-

nique is presently limited to small proteins (30 kDa).

These practical limitations and complexities encoun-

tered in high resolution structural studies of proteins

stimulated the development of low-resolution techni-

ques such as Fourier transform infrared (FTIR) spec-

troscopy, which can be utilized for estimating the

secondary structure contents of proteins very rapidly.

For the analysis of secondary structure of proteins

from FTIR spectra, commonly, the amide I region

(1700±1600 cm21) is utilized. Different conforma-

tional types, such as helix, sheet, turns etc. result in

different discrete bands in amide I region, which are

usually broad and overlapping. Therefore, to identify

the bands from FTIR spectra, mathematical resolution

enhancement techniques have to be applied. Different

techniques, such as curve ®tting [1], partial least

squares analysis [2], factor analysis [3], and neural

networks (NN) [4] have been developed to predict

Journal of Molecular Structure 565±566 (2001) 383±387

0022-2860/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved.

PII: S0022-2860(01)00505-1

www.elsevier.nl/locate/molstruc

* Corresponding author. Tel.: 190-312-210-2332; fax: 190-312-

210-1261.

E-mail addresses: [email protected] (M. Severcan), phar-

[email protected] (P.I. Haris).

the secondary structure of proteins from FTIR spectra

by using the correlation between the FTIR spectral

bands and the crystalographic data for proteins

whose X-ray data is available. All of these methods

have varying degrees of advantages and disadvan-

tages [5]. As an alternative, we use a different

approach for the estimation of secondary structure

of a protein, by using NN with a different algorithm

than those reported previously [4]. The results indi-

cate that the methodology introduced is effective and

estimation accuracies are good and sometimes better

than, those previously reported in the literature.

2. Experimental

2.1. Data sets

We have used the same data set as used previously

by Lee et al. [3] in the determination of protein

secondary structure using factor analysis of the

FTIR spectra of 18 water-soluble proteins recorded

in water. The secondary structure contents of these

proteins are known from X-ray crystallography. The

details about the sources of the proteins, sample

preparation for infrared spectroscopy etc. can be

found in Ref. [3].

2.2. Neural networks

In this work a feed-forward multilayer perceptron

with resilient backpropagation training algorithm has

been used to predict secondary structures of proteins.

Resilient backpropagation training [6] has better

convergence properties compared to the conventional

backpropagation algorithm in which the convergence

heavily depends on the learning rate chosen.

The most important problem in developing a neural

network is generalization: a neural network is

expected to make good predictions when data not

present in the training set is used as an input. When

a network is not suf®ciently complex, for example if it

has very few hidden units, it can fail to detect fully the

signal in a complicated data set, leading to under®t-

ting. In this case high training error and high general-

ization error may result due to under®tting and high

statistical bias. On the other hand, a complex NN with

too many hidden units may yield low training error

but high generalization error due to over®tting and

high variance. If the network is overtrained, by

forcing the error to very low levels during training,

there is the danger of over®tting, in which case the NN

memorizes the input data, and no good generalization

can be made. Generalization depends on the size and

ef®ciency of training set, architecture of the network

and the physical complexity of the problem [7]. For

good generalization the NN has to have a suf®ciently

large data set and a proper number of hidden units. In

the literature one can ®nd a number of rules of thumb

for the minimum size of the training set for a network

with a given number of connection weights [8,9].

However, most of the time these parameters depend

on the ef®ciency of the data and the complexity of the

problem. Therefore, they are usually determined

experimentally when the data set is not suf®ciently

large.

In our study, due to the small size of the available

data set, leave-one-out method has been used. 17 of

the spectra are used for training, and the remaining

spectrum is used for testing. Each spectrum is repre-

sented by a 101-point sequence over the 1700±

1600 cm21 region. A 101-input NN having a few

hidden neurons requires a data set much larger than

17 spectra for training. Therefore the data set ®rst has

to be preprocessed by employing ef®cient data reduc-

tion methods. We have followed the method described

below for data reduction.

The FTIR spectra are ®rst range scaled to the

interval between 0 and 1. Then, the discrete cosine

transform (DCT) of each spectrum is taken and the

®rst 13 coef®cients including dc term are retained. 13

coef®cients are suf®cient to recover the original spec-

trum with less than 1% error. Theoretically,

Karhunen±Loeve transform (KLT) is the best trans-

form for representing an ensemble of signals, using

least number of coef®cients for the same reconstruc-

tion error in a statistical sense. However, it can be

shown that for signals with high correlation between

neighbouring samples, as in the spectra we are dealing

with, DCT approximates quite closely KLT [10].

2.3. Training

Training vectors p1, p2,¼,p18 corresponding to

each protein spectrum are vectors of size 13 whose

elements are the DCT coef®cients. Target values, i.e.

desired outputs of the NNs, t1,t2,¼,t18 are the X-ray

M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387384

characterization of the respective proteins. In the

leave-one-out method, ®rst, one of the vectors is left

out as a test vector, then a NN is trained using the

remaining 17 vectors and target values, and ®nally the

resulting NN is tested with the removed vector.

For training the NN with 17 vectors, we used a

cross-validation approach: We trained 17 different

NNs all starting with the same initial weight values,

each time leaving a different vector out for validation

and training with remaining 16 vectors. After training,

each network is tested with its validation vector and

rms error is calculated for the given set of initial

values. This procedure is repeated until the rms

error falls below an acceptable level. Finally, the

weight values of the NN having the smallest valida-

tion error is taken as the initial values of a NN and this

network is trained with all of the 17 vectors.

The training algorithm can be summarized as

follows:

1. initialize a NN with random weights;

2. train the NN using 16 training vectors leaving one

vector out for validation;

3. test the NN with the removed 17th vector;

4. repeat steps 2 and 3 for all of the 17 training

vectors;

5. calculate the rms error of prediction;

6. repeat steps 1 to 5 for different initial weights until

rms error is below an acceptable level;

7. of the 17 networks trained select the one with smal-

lest prediction error;

8. use weights obtained in step 7 as the initial weights

of a NN and train this network using all of the 17

vectors.

After the training is ®nished the resulting NN is tested

using the left-out 18th spectrum. To test the perfor-

mance of the method the above stated procedure is

repeated for each spectrum and rms error of prediction

is calculated.

To further improve the prediction accuracy we have

generated a jury of 10 networks for each prediction

and averaged the predicted values. Such an approach

to improve prediction accuracy is known as decision

fusion or statistical combining of multiple NN

outputs. The reason for improvement can be explained

as follows: The error terms of networks that have

fallen to different local minima will not be strongly

M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387 385

Table 1

Neural network prediction results

% a-Helix % b-Sheet %? Turns

X-ray Lee et al. Present study X-ray Lee et al. Present study X-ray Lee, et al. Present study

Alcohol dehydrogenase 29 26 21 40 40 38 19 19 19

Trypsin inhibitor 26 46 38 45 26 33 16 13 19

Carbonic anhydrase 16 6 10 45 54 57 25 23 22

Concanavalin A 3 13 8 65 57 59 22 28 24

Chymotrypsinogen 12 20 18 49 45 44 23 24 21

Chymotrypsin 11 13 15 50 50 48 25 22 23

Cytochrome c 49 53 59 11 17 14 22 17 14

Elastase 10 11 13 46 51 53 28 21 21

Hemoglobin 86 76 75 0 16 5 8 4 13

Insulin 61 53 50 15 20 19 12 14 16

Lysozyme 46 54 60 19 7 10 23 19 13

Myoglobin 88 94 90 0 2 11 3 7 4 13

Nuclease 26 32 16 37 36 42 23 22 22

Prealbumin 6 11 8 61 40 55 19 25 23

Papain 28 23 30 29 34 31 18 22 20

Protease 11 13 17 57 49 48 18 24 19

Ribonuclease A 23 19 19 46 49 47 21 16 22

Ribonuclease S 23 18 17 53 49 47 15 21 23

rms error 7.8 7.7 9.7 6.4 4.3 4.8

correlated. Therefore, averaging will help reduce the

error variance. [11±13]

3. Results and discussion

Using the method described above NNs are trained

using MATLAB Neural Network Toolbox. Resilient

backpropagation algorithm was used during training.

Networks had 13 inputs, one hidden layer, and an

output layer with one neuron. It was found that three

neurons in the hidden layer resulted in smaller predic-

tion errors. The standard sigmoid transfer function has

been modi®ed to avoid the problem caused by target

values that are zero or close to zero. For the prediction

of each structural component a separate neural

network was used. NNs with three outputs did not

perform as good as networks with single output.

The training parameters of the NNs were chosen as:

sum squared error sse� 0.025;

initial weight change D0� 0.1 (0.005 for training

in step 8);

maximum weight change Deltamax� 0.1 (0.005

for training in step 8);

increment to weight change Deltainc� 1.2;

decrement to weight change Deltadec� 0.5;

minimum gradient min-grad� 0.

Maximum number of epochs was chosen as 2000

although this number was never reached. The predic-

tions obtained are presented in Table 1. The table also

includes comparisons with the previous results

obtained by Lee et al. [3] using factor analysis method

for the same proteins. Factor analysis is one of the

methods used for determination of protein secondary

structure from FTIR spectra of proteins.

As seen from the table estimation accuracies

obtained are as good as, and sometimes better than,

those reported by Lee et al. [3]. The rms errors for %

a-helix, % b-sheet and % turns in our case are 7.7,

6.4, and 4.8, respectively, as compared to 7.8, 9.7 and

4.3 of the factor analysis results of Lee et al. [3]. Lee

et al. using the same factor analysis approach, but by

leaving out trypsin inhibitor from the calibration set,

obtained better estimates only for a-helix (3.9). The

rms values for b-sheet and turn (8.3 and 6.6, respec-

tively) structures are not better than those obtained in

this study. Since our objective is to develop an algo-

rithm that can be utilized for a broad class of proteins,

we have decided to compare the performance of the

NN method with the factor analysis approach for the

case where all available protein spectra, including

trypsin inhibitor, is included in the calibration set.

Under such conditions, the NN approach shows better

prediction for a-helix and b-sheet structures. The

precise reason for the differences between the two

methods are as yet unclear.

Factor analysis method has several advantages over

the other methods. It does not use deconvolution,

which sometimes may generate negative side lobes

on the absorption bands or arti®cial bands due to

noise. Manipulation of the spectrum is kept to a

minimum and no curve ®tting is necessary. In addi-

tion, no amide I band components needs to be

assigned. All of these advantages are carried to the

present NN method. Due to the unavailability of a

large data set the number of inputs of the NN used

in the present work had to be kept small. For this

reason, a data reduction method, particularly DCT,

has been utilized. Although DCT compacts most of

the information present in each spectrum in a small

number of coef®cients, it does not preserve local prop-

erties of the spectrum as in Discrete Fourier Trans-

form. Therefore, from each DCT coef®cient it would

not be possible to get any information on the localisa-

tion of spectral peaks. The NN makes use of the DCT

coef®cients solely for pattern recognition purposes.

However, when the size of the data set becomes

large enough, other transforms such as wavelets or

Gabor transform, which can also give information

about the local behavior of the spectra, could be used.

4. Conclusion

In the present study we have developed an accurate

infrared method using NNs for determining the

secondary structure of proteins in water in terms of

% a-helix, % b-sheet, and % turns. The target values

of the NN outputs are taken as the known structure

values obtained from X-ray crystallography. The

effectiveness of the method is tested using leave-

one-out strategy. Using this procedure we have

obtained results that are in some cases as good as, if

not better than, those previously reported in the

M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387386

literature. Work is in progress to further improve the

accuracy of our method, which should be signi®cantly

enhanced once the number of FTIR spectra of proteins

in the reference set is increased.

Acknowledgements

This work has been supported by British Council

Academic Link Program and State Planning Organi-

zation of Turkey (project number: DPT98K112530/

AFP98010805).

References

[1] H. Byler, H. Suzi, Biopolymers 25 (1986) 469.

[2] F. Dosseau, M. Pezolet, Biochemisrty 29 (1990) 8771.

[3] D.C. Lee, P.I. Haris, D. Chapman, C.R. Mitchell, Biochem-

istry 29 (1990) 9185.

[4] P. Pancoska, J. Kubelka, T.A. Keiderling, Appl. Spectrosc. 53

(1999) 655.

[5] W. Surewicz, H.H. Mantsch, D. Chapman, Biochemistry 32

(1993) 389.

[6] M. Riedmiller, H. Braun, in: Proceedings of the IEEE Inter-

national Conference on Neural Networks, San Francisco,

1993, pp. 586±591.

[7] S. Haykin, Neural Networks: A Comprehensive Foundation,

Prentice-Hall, Englewood Cliffs, NJ, 1994, pp. 176±179.

[8] E.B. Baum, D. Haussler, Neural Comput. 1 (1989) 151.

[9] R. Lange, R. Maenner, in: Proceedings of the International

Conference on Arti®cial Neural Networks, London, 1994,

pp. 497±500.

[10] A.K. Jain, Fundamentals of Digital Image Processing,

Prentice-Hall, Englewood Cliffs, NJ, 1986, pp. 150±154.

[11] K. Tumer, J. Ghosh, in: A. Sharkey (Ed.), Combining Arti®-

cial Neural Networks, Springer, Berlin, 1999, pp. 127±162.

[12] J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain,

IEEE Trans. Neural Networks 8 (1997) 54.

[13] M.P. Perrone, L.N. Cooper, in: R.J. Mammone (Ed.), Arti®cial

Neural Networks for Speech and Vision, Chapman & Hall,

London, 1993, pp. 126±141.

M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387 387

estimation of protein secondary structure from ftir spectra using neural networks

Documents