estimation of protein secondary structure from ftir spectra using neural networks
TRANSCRIPT
Estimation of protein secondary structure from FTIR spectra usingneural networks
Mete Severcana,*, Feride Severcanb, Parvez I. Harisc
aElectrical and Electronics Engineering Department, Middle East Technical University, Ankara 06531,TurkeybDepartment of Biology, Middle East Technical University, Ankara 06531, Turkey
cDepartment of Biological Sciences, De Montfort University, Leicester LE1 9BH, UK
Received 31 August 2000; revised 15 January 2001; accepted 15 January 2001
Abstract
Secondary structure of proteins have been predicted using neural networks (NN) from their Fourier transform infrared
spectra. Leave-one-out approach has been used to demonstrate the applicability of the method. A form of cross-validation is
used to train NN to prevent the over®tting problem. Multiple neural network outputs are averaged to reduce the variance of
predictions. The networks realized have been tested and rms errors of 7.7% for a-helix, 6.4% for b-sheet and 4.8% for turns
have been achieved. These results indicate that the methodology introduced is effective and estimation accuracies are in some
cases better than those previously reported in the literature. q 2001 Elsevier Science B.V. All rights reserved.
Keywords: Protein; Secondary structure prediction; Neural networks; FTIR; Spectroscopy
1. Introduction
Currently, X-ray crystallography or multidimen-
sional NMR spectroscopy can only be used to achieve
the complete structure determination of a protein at
high resolution. However, these techniques have some
disadvantages. For example, X-ray crystallography
requires high quality single crystals, which are not
always available, and the static nature of a protein
crystal may not represent the dynamic nature of a
protein in solution. In addition to this, the process is
very slow. NMR spectroscopy has the advantage that
it can be used to determine the structure of a protein in
solution. However, the interpretation of NMR spectra
of large proteins is very complicated, and the tech-
nique is presently limited to small proteins (30 kDa).
These practical limitations and complexities encoun-
tered in high resolution structural studies of proteins
stimulated the development of low-resolution techni-
ques such as Fourier transform infrared (FTIR) spec-
troscopy, which can be utilized for estimating the
secondary structure contents of proteins very rapidly.
For the analysis of secondary structure of proteins
from FTIR spectra, commonly, the amide I region
(1700±1600 cm21) is utilized. Different conforma-
tional types, such as helix, sheet, turns etc. result in
different discrete bands in amide I region, which are
usually broad and overlapping. Therefore, to identify
the bands from FTIR spectra, mathematical resolution
enhancement techniques have to be applied. Different
techniques, such as curve ®tting [1], partial least
squares analysis [2], factor analysis [3], and neural
networks (NN) [4] have been developed to predict
Journal of Molecular Structure 565±566 (2001) 383±387
0022-2860/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved.
PII: S0022-2860(01)00505-1
www.elsevier.nl/locate/molstruc
* Corresponding author. Tel.: 190-312-210-2332; fax: 190-312-
210-1261.
E-mail addresses: [email protected] (M. Severcan), phar-
[email protected] (P.I. Haris).
the secondary structure of proteins from FTIR spectra
by using the correlation between the FTIR spectral
bands and the crystalographic data for proteins
whose X-ray data is available. All of these methods
have varying degrees of advantages and disadvan-
tages [5]. As an alternative, we use a different
approach for the estimation of secondary structure
of a protein, by using NN with a different algorithm
than those reported previously [4]. The results indi-
cate that the methodology introduced is effective and
estimation accuracies are good and sometimes better
than, those previously reported in the literature.
2. Experimental
2.1. Data sets
We have used the same data set as used previously
by Lee et al. [3] in the determination of protein
secondary structure using factor analysis of the
FTIR spectra of 18 water-soluble proteins recorded
in water. The secondary structure contents of these
proteins are known from X-ray crystallography. The
details about the sources of the proteins, sample
preparation for infrared spectroscopy etc. can be
found in Ref. [3].
2.2. Neural networks
In this work a feed-forward multilayer perceptron
with resilient backpropagation training algorithm has
been used to predict secondary structures of proteins.
Resilient backpropagation training [6] has better
convergence properties compared to the conventional
backpropagation algorithm in which the convergence
heavily depends on the learning rate chosen.
The most important problem in developing a neural
network is generalization: a neural network is
expected to make good predictions when data not
present in the training set is used as an input. When
a network is not suf®ciently complex, for example if it
has very few hidden units, it can fail to detect fully the
signal in a complicated data set, leading to under®t-
ting. In this case high training error and high general-
ization error may result due to under®tting and high
statistical bias. On the other hand, a complex NN with
too many hidden units may yield low training error
but high generalization error due to over®tting and
high variance. If the network is overtrained, by
forcing the error to very low levels during training,
there is the danger of over®tting, in which case the NN
memorizes the input data, and no good generalization
can be made. Generalization depends on the size and
ef®ciency of training set, architecture of the network
and the physical complexity of the problem [7]. For
good generalization the NN has to have a suf®ciently
large data set and a proper number of hidden units. In
the literature one can ®nd a number of rules of thumb
for the minimum size of the training set for a network
with a given number of connection weights [8,9].
However, most of the time these parameters depend
on the ef®ciency of the data and the complexity of the
problem. Therefore, they are usually determined
experimentally when the data set is not suf®ciently
large.
In our study, due to the small size of the available
data set, leave-one-out method has been used. 17 of
the spectra are used for training, and the remaining
spectrum is used for testing. Each spectrum is repre-
sented by a 101-point sequence over the 1700±
1600 cm21 region. A 101-input NN having a few
hidden neurons requires a data set much larger than
17 spectra for training. Therefore the data set ®rst has
to be preprocessed by employing ef®cient data reduc-
tion methods. We have followed the method described
below for data reduction.
The FTIR spectra are ®rst range scaled to the
interval between 0 and 1. Then, the discrete cosine
transform (DCT) of each spectrum is taken and the
®rst 13 coef®cients including dc term are retained. 13
coef®cients are suf®cient to recover the original spec-
trum with less than 1% error. Theoretically,
Karhunen±Loeve transform (KLT) is the best trans-
form for representing an ensemble of signals, using
least number of coef®cients for the same reconstruc-
tion error in a statistical sense. However, it can be
shown that for signals with high correlation between
neighbouring samples, as in the spectra we are dealing
with, DCT approximates quite closely KLT [10].
2.3. Training
Training vectors p1, p2,¼,p18 corresponding to
each protein spectrum are vectors of size 13 whose
elements are the DCT coef®cients. Target values, i.e.
desired outputs of the NNs, t1,t2,¼,t18 are the X-ray
M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387384
characterization of the respective proteins. In the
leave-one-out method, ®rst, one of the vectors is left
out as a test vector, then a NN is trained using the
remaining 17 vectors and target values, and ®nally the
resulting NN is tested with the removed vector.
For training the NN with 17 vectors, we used a
cross-validation approach: We trained 17 different
NNs all starting with the same initial weight values,
each time leaving a different vector out for validation
and training with remaining 16 vectors. After training,
each network is tested with its validation vector and
rms error is calculated for the given set of initial
values. This procedure is repeated until the rms
error falls below an acceptable level. Finally, the
weight values of the NN having the smallest valida-
tion error is taken as the initial values of a NN and this
network is trained with all of the 17 vectors.
The training algorithm can be summarized as
follows:
1. initialize a NN with random weights;
2. train the NN using 16 training vectors leaving one
vector out for validation;
3. test the NN with the removed 17th vector;
4. repeat steps 2 and 3 for all of the 17 training
vectors;
5. calculate the rms error of prediction;
6. repeat steps 1 to 5 for different initial weights until
rms error is below an acceptable level;
7. of the 17 networks trained select the one with smal-
lest prediction error;
8. use weights obtained in step 7 as the initial weights
of a NN and train this network using all of the 17
vectors.
After the training is ®nished the resulting NN is tested
using the left-out 18th spectrum. To test the perfor-
mance of the method the above stated procedure is
repeated for each spectrum and rms error of prediction
is calculated.
To further improve the prediction accuracy we have
generated a jury of 10 networks for each prediction
and averaged the predicted values. Such an approach
to improve prediction accuracy is known as decision
fusion or statistical combining of multiple NN
outputs. The reason for improvement can be explained
as follows: The error terms of networks that have
fallen to different local minima will not be strongly
M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387 385
Table 1
Neural network prediction results
% a-Helix % b-Sheet %? Turns
X-ray Lee et al. Present study X-ray Lee et al. Present study X-ray Lee, et al. Present study
Alcohol dehydrogenase 29 26 21 40 40 38 19 19 19
Trypsin inhibitor 26 46 38 45 26 33 16 13 19
Carbonic anhydrase 16 6 10 45 54 57 25 23 22
Concanavalin A 3 13 8 65 57 59 22 28 24
Chymotrypsinogen 12 20 18 49 45 44 23 24 21
Chymotrypsin 11 13 15 50 50 48 25 22 23
Cytochrome c 49 53 59 11 17 14 22 17 14
Elastase 10 11 13 46 51 53 28 21 21
Hemoglobin 86 76 75 0 16 5 8 4 13
Insulin 61 53 50 15 20 19 12 14 16
Lysozyme 46 54 60 19 7 10 23 19 13
Myoglobin 88 94 90 0 2 11 3 7 4 13
Nuclease 26 32 16 37 36 42 23 22 22
Prealbumin 6 11 8 61 40 55 19 25 23
Papain 28 23 30 29 34 31 18 22 20
Protease 11 13 17 57 49 48 18 24 19
Ribonuclease A 23 19 19 46 49 47 21 16 22
Ribonuclease S 23 18 17 53 49 47 15 21 23
rms error 7.8 7.7 9.7 6.4 4.3 4.8
correlated. Therefore, averaging will help reduce the
error variance. [11±13]
3. Results and discussion
Using the method described above NNs are trained
using MATLAB Neural Network Toolbox. Resilient
backpropagation algorithm was used during training.
Networks had 13 inputs, one hidden layer, and an
output layer with one neuron. It was found that three
neurons in the hidden layer resulted in smaller predic-
tion errors. The standard sigmoid transfer function has
been modi®ed to avoid the problem caused by target
values that are zero or close to zero. For the prediction
of each structural component a separate neural
network was used. NNs with three outputs did not
perform as good as networks with single output.
The training parameters of the NNs were chosen as:
sum squared error sse� 0.025;
initial weight change D0� 0.1 (0.005 for training
in step 8);
maximum weight change Deltamax� 0.1 (0.005
for training in step 8);
increment to weight change Deltainc� 1.2;
decrement to weight change Deltadec� 0.5;
minimum gradient min-grad� 0.
Maximum number of epochs was chosen as 2000
although this number was never reached. The predic-
tions obtained are presented in Table 1. The table also
includes comparisons with the previous results
obtained by Lee et al. [3] using factor analysis method
for the same proteins. Factor analysis is one of the
methods used for determination of protein secondary
structure from FTIR spectra of proteins.
As seen from the table estimation accuracies
obtained are as good as, and sometimes better than,
those reported by Lee et al. [3]. The rms errors for %
a-helix, % b-sheet and % turns in our case are 7.7,
6.4, and 4.8, respectively, as compared to 7.8, 9.7 and
4.3 of the factor analysis results of Lee et al. [3]. Lee
et al. using the same factor analysis approach, but by
leaving out trypsin inhibitor from the calibration set,
obtained better estimates only for a-helix (3.9). The
rms values for b-sheet and turn (8.3 and 6.6, respec-
tively) structures are not better than those obtained in
this study. Since our objective is to develop an algo-
rithm that can be utilized for a broad class of proteins,
we have decided to compare the performance of the
NN method with the factor analysis approach for the
case where all available protein spectra, including
trypsin inhibitor, is included in the calibration set.
Under such conditions, the NN approach shows better
prediction for a-helix and b-sheet structures. The
precise reason for the differences between the two
methods are as yet unclear.
Factor analysis method has several advantages over
the other methods. It does not use deconvolution,
which sometimes may generate negative side lobes
on the absorption bands or arti®cial bands due to
noise. Manipulation of the spectrum is kept to a
minimum and no curve ®tting is necessary. In addi-
tion, no amide I band components needs to be
assigned. All of these advantages are carried to the
present NN method. Due to the unavailability of a
large data set the number of inputs of the NN used
in the present work had to be kept small. For this
reason, a data reduction method, particularly DCT,
has been utilized. Although DCT compacts most of
the information present in each spectrum in a small
number of coef®cients, it does not preserve local prop-
erties of the spectrum as in Discrete Fourier Trans-
form. Therefore, from each DCT coef®cient it would
not be possible to get any information on the localisa-
tion of spectral peaks. The NN makes use of the DCT
coef®cients solely for pattern recognition purposes.
However, when the size of the data set becomes
large enough, other transforms such as wavelets or
Gabor transform, which can also give information
about the local behavior of the spectra, could be used.
4. Conclusion
In the present study we have developed an accurate
infrared method using NNs for determining the
secondary structure of proteins in water in terms of
% a-helix, % b-sheet, and % turns. The target values
of the NN outputs are taken as the known structure
values obtained from X-ray crystallography. The
effectiveness of the method is tested using leave-
one-out strategy. Using this procedure we have
obtained results that are in some cases as good as, if
not better than, those previously reported in the
M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387386
literature. Work is in progress to further improve the
accuracy of our method, which should be signi®cantly
enhanced once the number of FTIR spectra of proteins
in the reference set is increased.
Acknowledgements
This work has been supported by British Council
Academic Link Program and State Planning Organi-
zation of Turkey (project number: DPT98K112530/
AFP98010805).
References
[1] H. Byler, H. Suzi, Biopolymers 25 (1986) 469.
[2] F. Dosseau, M. Pezolet, Biochemisrty 29 (1990) 8771.
[3] D.C. Lee, P.I. Haris, D. Chapman, C.R. Mitchell, Biochem-
istry 29 (1990) 9185.
[4] P. Pancoska, J. Kubelka, T.A. Keiderling, Appl. Spectrosc. 53
(1999) 655.
[5] W. Surewicz, H.H. Mantsch, D. Chapman, Biochemistry 32
(1993) 389.
[6] M. Riedmiller, H. Braun, in: Proceedings of the IEEE Inter-
national Conference on Neural Networks, San Francisco,
1993, pp. 586±591.
[7] S. Haykin, Neural Networks: A Comprehensive Foundation,
Prentice-Hall, Englewood Cliffs, NJ, 1994, pp. 176±179.
[8] E.B. Baum, D. Haussler, Neural Comput. 1 (1989) 151.
[9] R. Lange, R. Maenner, in: Proceedings of the International
Conference on Arti®cial Neural Networks, London, 1994,
pp. 497±500.
[10] A.K. Jain, Fundamentals of Digital Image Processing,
Prentice-Hall, Englewood Cliffs, NJ, 1986, pp. 150±154.
[11] K. Tumer, J. Ghosh, in: A. Sharkey (Ed.), Combining Arti®-
cial Neural Networks, Springer, Berlin, 1999, pp. 127±162.
[12] J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain,
IEEE Trans. Neural Networks 8 (1997) 54.
[13] M.P. Perrone, L.N. Cooper, in: R.J. Mammone (Ed.), Arti®cial
Neural Networks for Speech and Vision, Chapman & Hall,
London, 1993, pp. 126±141.
M. Severcan et al. / Journal of Molecular Structure 565±566 (2001) 383±387 387