using artificially generated spectral data to improve protein secondary structure prediction from...
TRANSCRIPT
www.elsevier.com/locate/yabio
Analytical Biochemistry 332 (2004) 238–244
ANALYTICAL
BIOCHEMISTRY
Using artificially generated spectral data to improveprotein secondary structure prediction from Fourier
transform infrared spectra of proteins
Mete Severcana,*, Parvez I. Harisb, Feride Severcanc
a Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara 06531, Turkeyb Faculty of Health and Life Sciences, De Montfort University, The Gateway, Leicester, LE1 9BH, UK
c Department of Biology, Middle East Technical University, Ankara 06531, Turkey
Received 8 February 2004
Available online 28 July 2004
Abstract
Secondary structures of proteins have been predicted using neural networks from their Fourier transform infrared spectra. To
improve the generalization ability of the neural networks, the training data set has been artificially increased by linear interpolation.
The leave-one-out approach has been used to demonstrate the applicability of the method. Bayesian regularization has been used to
train the neural networks and the predictions have been further improved by the maximum-likelihood estimation method. The net-
works have been tested and standard error of prediction (SEP) of 4.19% for a helix, 3.49% for b sheet, and 3.15% for turns have been
achieved. The results indicate that there is a significant decrease in the SEP for each type of structure parameter compared to pre-
vious works.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Protein; Secondary structure prediction; Neural networks; FTIR; Spectroscopy
Fourier transform infrared (FTIR)1 spectroscopy is
being increasingly used for investigating protein struc-
ture and stability (for a recent review, see [1]). FTIR
spectra of proteins contain valuable information about
the structure of proteins. Different conformational
types such as helix, sheet, turns, etc. result in different
absorption bands which are usually broad and overlap-
ping. Unfortunately, there are still problems with accu-rate quantification of protein secondary structure from
FTIR spectra. Various techniques such as curve fitting
[2], partial least squares analysis [3], factor analysis [4],
and neural networks [5] have been used to predict sec-
0003-2697/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ab.2004.06.030
* Corresponding author. Fax: +90-312-210-1261.
E-mail address: [email protected] (M. Severcan).1 Abbreviations used: FTIR, Fourier transform infrared; NN,
neural network; SEP, standard error of prediction; DCT, discrete
cosine transform; pdf, probability density function.
ondary structure of proteins from their infrared spec-
tra. All of these techniques have varying degrees of
advantages and disadvantages. In a previous study
[6], we proposed a neural network (NN) method that
is able to provide predictions better than previously
used methods. In that work, we search for a neural net-
work model that performs best for all the validation
vectors taken one at a time from a collection of train-ing vectors.
Recently, one of us [7,8] proposed two new NN im-
plementations which provide improved results. In the
first study, it was revealed that providing the neural net-
work analysis with only part of the amide I region from
empirically determined structure-sensitive regions in
combination with appropriate preprocessing of the spec-
tral data produced better results. This led to an SEP of4.47% for a helix, an SEP of 6.16% for b sheet, and an
SEP of 4.61% for turns. The second study incorporated
M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244 239
an automatic amide I frequency selection procedure uti-
lizing a genetic algorithm. The corresponding SEP val-
ues for this method are 4.58, 5.72, and 4.42. The
highly limited data set available is the main problem
in obtaining a NN with good generalization properties
and small prediction errors. In this study, we extendour previous work by increasing the size of the training
data set artificially, by using Bayesian regularization to
train the neural networks which minimizes a combina-
tion of squared errors and weights, and by using maxi-
mum-likelihood estimation to improve the predictions
further. The SEP values obtained in this case are
4.19% for a helix, 3.49% for b sheet, and 3.15% for
turns. After presenting the method and the results, wediscuss the advantages and limitations of the NN meth-
ods for predictions of secondary structure of proteins
from FTIR spectra.
Materials and methods
Data set
We have used the same data set that we used previ-
ously, taken from the work of Lee et al. [4] who used
them to test their factor analysis method. The data set
contains FTIR spectra of 18 water soluble proteins re-
corded in water. The secondary structure of these pro-
teins are known from X-ray crystallography. The
details about the sources of the proteins, sample prepa-rations for infrared spectroscopy, etc. can be found in
Lee et al. [4]. Details about FTIR spectroscopic analysis
of citrate synthase have been reported by Severcan and
Haris [9].
Neural network
A NN is a nonlinear parameterized mapping from aninput vector x to an output y = g (x; w), where the pa-
rameter vector w consists of the connection weights of
the NN. It can be trained to perform regression and
classification tasks. The weights w are adjusted during
training in such a way as to minimize a cost function
which is a function of the error e = t � g (p; w), where
t is the known target value for a given training input vec-
tor x = p.In this work, target values are the structure parame-
ters obtained from X-ray crystallography while the in-
put vectors are the infrared absorption values
measured at 101 points from 1600 to 1700cm�1, namely
the amide I band. The FTIR spectral shape within the
amide I band is mainly determined by the secondary
structure content and the contribution to the prediction
of structure parameters of the other bands outside am-ide I band is not significant [7]. The highly limited size
of the training data set constitutes the most important
problem encountered in this work. FTIR spectra of only
18 proteins and their corresponding secondary structure
parameters are currently available to us. Using the
leave-one-out method for testing the performance of
the NN model, only at most 17 sets of data are available
for training the NN. This constraint forces the use of aNN with a few hidden units and a restricted number of
connection weights. Therefore, a data reduction tech-
nique should be applied to the spectral data. For this
purpose we have employed discrete cosine transform
(DCT) [10]. DCT is a good approximation to the Karh-
unen–Loeve transform by which it is possible to repre-
sent a signal with the least number of coefficients for a
specified amount of error in a statistical sense. The mag-nitude of a DCT coefficient is an indication of the con-
tent of a particular frequency in the signal. Features of
the signal with sharp variations require the inclusion
of higher frequencies in the DCT representation, imply-
ing more DCT coefficients. In this work we have taken
the DCT of the FTIR spectra and determined the num-
ber of coefficients to be used by optimizing the NN. De-
pending on the protein structure this number variesbetween 10 and 20 in our case. The limited size of the
training data set also raises the problem of generaliza-
tion: a neural network is expected to make good predic-
tions when data not present in the training set are used
as an input. To improve the generalization ability, the
algorithm that we have developed uses (1) an artificially
increased data set, (2) training with Bayesian regulariza-
tion which minimizes a combination of squared errorsand weights and determines the correct combination to
produce a NN which generalizes well, and (3) maxi-
mum-likelihood estimation to determine the most likely
prediction from the conditional histogram of predic-
tions, as described below.
We have modeled each structure parameter by a dif-
ferent neural network. In the preprocessing step, first the
DCTs of the normalized spectra of K proteins (K = 18)are obtained. Then, the first L coefficients of DCT are
retained, forming length L training vectors pi,
i = 1, . . .,K. The target values ti, i = 1, . . .,K, are the cor-responding known structure parameters. Fig. 1 illus-
trates the basic steps of the NN training and testing
algorithm. To test the performance of the method the
leave-one-out method has been used, removing one ele-
ment of the protein data in turn, as illustrated by theparallel paths in Fig. 1. After removing one element of
the protein data, the size of the data set is artificially in-
creased by averaging pi with pi + 1, pi with pi + 2, pi with
pi + 3, and the corresponding target values ti with ti + 1, tiwith ti + 2, ti with ti + 3, in a cyclic manner. In this way
the number of training vectors are increased to 68. Then
a NN is trained using a Bayesian regularization back-
propagation algorithm [11,12]. The trained NN is testedwith the left out protein data and a prediction is ob-
tained. Since the training algorithm starts with random
Fig. 1. Basic training algorithm.
240 M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244
initial weights, this prediction is a random variable, and
therefore a maximum-likelihood estimate of this vari-
able can be obtained from the most likely value of the
conditional probability density function (pdf) of the pre-
diction conditioned on the target value being estimated.The conditional pdf can be obtained numerically ap-
proximately from the histogram of the predictions.
For this purpose, the training step is repeated with dif-
ferent initial weights a large number of times (200 in
our case), and a histogram is obtained. The most likely
value of this conditional histogram, i.e., the prediction
where the histogram is maximum, gives the maximum-
likelihood estimate. In the final step, standard error ofprediction defined by the square root of the mean square
value of the prediction errors is computed.
The SEP of the training algorithm described above
depends on the structure of the neural network and its
parameters. The number of inputs, i.e., the number of
DCT coefficients used, the number of units in the hidden
layer, and the sum squared error (performance goal) of
the NN have to be optimized. Since the spectral shapesare highly dependent on the contents of different protein
secondary structures it is expected that NN models for
different secondary structures will have different number
of inputs and different number of hidden units. For ex-
ample, sharp peaks due to b sheet structure can better be
described by higher frequency components in the DCT
expansion. Consequently, the optimized network hasmore inputs for the case of b sheet content prediction.
This fact explains why lower SEP values can be obtained
when different NN models are designed for different sec-
ondary structures.
Results and discussion
The algorithm described above has been implemented
using the Neural Network toolbox of MATLAB. For
each structure parameter a different neural network
model has been realized. The network parameters have
been optimized to yield the minimum SEP. As an exam-
ple, for the case of NN modeling the b sheet structure,
the minimum value of SEP was obtained for 19 DCT co-
efficients and 2 hidden neurons, as shown in Fig. 2. Op-timization results for a helix have yielded 12 DCT
coefficients and 2 hidden units, while for the NN model-
Fig. 2. Variation of SEP as a function of number of DCT coefficients
used as input, with number of hidden neurons as a parameter.
M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244 241
ing of turns structure 11 DCT coefficients and 3 hidden
units have been obtained. Table 1 presents the results to-
gether with our previous results for comparison purpos-
es. As can be seen from this table, there is a considerableimprovement in prediction accuracy for all of the three
types of structures. The prediction error has dropped
from 7.7 to 4.19% for a helix, from 6.4 to 3.49% for bsheet, and from 4.8 to 3.15% for turns.
Table 2 shows a comparison of our results with the
recent NN prediction results of Hering et al. [7,8] in
two different studies, using the same data set. As ob-
served, there is a significant reduction in prediction errorfor all three structure parameters, especially for b sheet
and turns contents.
Predicting secondary structure of proteins using neu-
ral networks can be considered a multidimensional
curve-fitting problem. The NN training algorithm tries
to fit a surface to the data points (pi, ti), where the co-
ordinate pi is an N-dimensional vector representing the
FTIR spectrum and the value of the surface at pi is thetarget value, ti, representing the known structure
Table 1
Comparison of the standard error of prediction results of this study with th
% a Helix % b Shee
X-ray Previous study Present study X-ray P
Alcohol dehydrogenase 29 21 25.41 40 3
Trypsin inhibitor 26 38 27.19 45 3
Carbonic anhydrase 16 10 8.70 45 5
Concanavalin A 3 8 7.48 65 5
Chymotrypsinogen 12 18 10.69 49 4
Chymotrypsin 11 15 14.47 50 4
Cytochrome c 49 59 52.18 11 1
Elastase 10 13 12.04 46 5
Hemoglobin 86 75 76.36 0
Insulin 61 50 61.32 15 1
Lysozyme 46 60 48.15 19 1
Myoglobin 88 90 88.30 0
Nuclease 26 16 20.71 37 4
Prealbumin 6 8 12.92 61 5
Papain 28 30 27.26 29 3
Protease 11 17 8.65 57 4
Ribonuclease A 23 19 26.84 46 4
Ribonuclease S 23 17 20.59 53 4
SEP 7.7 4.19
parameter of the ith protein. When the FTIR spectrum
of a new protein is provided, the corresponding vector
p is calculated and fed to the NN that simply reads
out, or interpolates, the structure parameter t from
the surface that the model represents. The accuracy
of the model depends on how well the surface repre-sents the physical phenomenon behind the FTIR mea-
surement process.
One of the factors that affects the accuracy is the
number of FTIR spectra available for training and
how well they cover different types of structural classes.
With a small number of spectra, if the NN is overtrained
to make the training error very small, the surface might
deviate considerably from the actual surface. Therefore,the prediction error may be large when the network is
tested with a new vector p. This is related to the well-
known generalization problem. To improve the general-
ization the complexity of the NN should be low enough.
It has been shown that the required training set size for
no overfitting and good generalization of a feedforward
NN is N � 38nw, where nw is the number of connection
weights [13]. Thus, using the available data set of 18FTIR spectra, it should be possible to train NNs with
two or three hidden units, one output unit, and 13 in-
puts, with good generalization. Based on this assump-
tion, in our previous work we had obtained the
prediction error values mentioned above using resilient
back-propagation training and a form of cross valida-
tion. Clearly, a larger set for training data should
improve generalization. In the absence of originaldata, artificial data generated by averaging different
combinations of input data, as described above, force
the multidimensional surface to pass through linearly
interpolated points. This method helps to reduce predic-
e results of the previous study
t % Turns
revious study Present study X-ray Previous study Present study
8 41.64 19 19 17.44
3 46.20 16 19 18.94
7 45.87 25 22 26.97
9 62.45 22 24 22.24
4 51.93 23 21 21.52
8 47.52 25 23 27.87
4 13.78 22 14 21.62
3 51.06 28 21 21.19
5 3.87 8 13 9.02
9 17.42 12 16 16.29
0 12.10 23 13 17.27
3 0.60 7 13 8.55
2 42.29 23 22 22.02
5 57.05 19 23 18.04
1 35.39 18 20 21.02
8 54.58 18 19 20.10
7 45.79 21 22 16.66
7 52.79 15 23 19.26
6.4 3.49 4.8 3.15
able 2
omparison of the standard error of prediction results of this study
ith the results of Hering et al. [7,8]
SEP % a Helix SEP % b Sheet SEP % Turns
his study 4.19 3.49 3.15
ering et al. [7] 4.5 6.2 4.6
ering et al. [8] 4.6 5.7 4.4
242 M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244
T
C
w
T
H
H
tion error and error variance as long as the input data
are reliable. The interpolated data do not introduce
any new features to the training set. However, it helps
to regularize the network by forcing its response to be
smoother and less likely to overfit. In fact, when we ap-plied the artificially increased data set for training in our
earlier method, we obtained a SEP of 3.9% for a helix,
however the SEP for b sheet increased slightly to 7.0%
and that of turns decreased slightly to 4.4%. The almost
50% reduction in the SEP for a helix clearly demon-
strates the advantage of incorporating artificial data.
The algorithm that we used in this work is much simpler
and yields much lower SEP on the average. Therefore,we preferred to present this method. However, it should
be stressed that if one of the input data points is in error
(which could simply be due to normalization of the spec-
tra as will be discussed below), averaging of this data
point with other input points may spread the local error
to other regions of the interpolation surface.
The second factor affecting the accuracy of the meth-
od is related to the accuracy of the training data. FTIRspectra are plots of log absorption values as a function
of infrared wavelength. The absorption depends not on-
ly on the protein but also on sample thickness and pro-
tein concentrations, parameters which are often difficult
to control. In addition, subtraction of the background
water absorption is usually done by eye, contributing
additional uncertainty to the data. It is obvious from
multidimensional surface description of the NN that ifthe input data points, i.e., vectors pi, are not correctly
positioned the surface will be distorted and the resulting
predictions will be in error. If it were possible to keep
relative levels of the spectra of different proteins at a
constant value for a predefined concentration and sam-
ple thickness, no such problem would arise. However, in
practice this ideal condition is difficult to realize even in
the same laboratory. Normalization of the spectra helpsto reduce this problem if the proteins have similar prop-
erties. However, to normalize the spectra usually one
has to scale and shift the spectra to satisfy certain con-
ditions. These operations may change the spectral
shapes of proteins in relation to each other and therefore
cause incorrect training of the NN, hence errors in pre-
diction occur. Typical normalizations made in practice
normalize the area under the amide I band while keepingthe absorption value at 1700cm�1 at a fixed value, nor-
malize the absorbance maximum swing in the amide I
band to a constant value, and, in the case of DCT, nor-
malize the transform coefficients of the absorbance spec-
tra with the average value, etc. One can increase the
number of different normalizations by including other
parameters, such as statistical moments. Clearly, each
normalization results in a different set of pi and thereforepredictions may be slightly different. We have presented
in this report predictions using a normalization that fix-
es the absorbance maximum of the spectra in the amide
I band constant, which gave better results than other
normalizations. Which type of normalization is best is
an open question.
Table 3 shows comparison of the absolute value of
the prediction errors of the current method with thoseof the previous method and the methods used by Lee
et al. [4] and Hering et al. [7]. The average absolute
error and the standard deviation of absolute error
have decreased by almost 50% partly due to the
smoothing effect of the newly generated artificial data
and partly due to the training algorithm used. On the
other hand, although the correlation of our previous
work with that of Hering et al. was 0.8, as was report-ed by Hering, the current errors have much less corre-
lation. This can be expected due to the considerable
decrease in the SEP of the present method. However,
the maximum absolute error for the current work oc-
curs for lysozyme and this agrees with the previous
work and with the results of Hering et al. [7]. Lyso-
zyme also has the second largest absolute error in
the results of Lee et al. [4]. Similarly, absolute averageerror for hemoglobin is also large and it agrees with
other results. At the opposite end, average absolute er-
rors for trypsin inhibitor and myoglobin have dropped
dramatically.
To test the effectiveness of the method in predicting
secondary structure content of proteins outside the ori-
ginal 18-protein data set, we trained NNs using all of
the 18 proteins and tested these networks by applyingthe FTIR spectrum of pig citrate synthase as input.
Two sets of NNs were obtained, one set for the extended
training data set and the other for the original data set.
Each NN was separately optimized. The SEP values for
the NNs trained with the original data set were 7.03%
for a helix, 12.00% for b sheet, and 4.72% for turn struc-
tures. The corresponding values for the NNs with the
extended data set were 4.19, 3.49, and 3.15%, respective-ly. Table 4 shows the secondary structure contents as de-
termined by X-ray crystallography and the
corresponding NN predictions. The absolute errors are
within or close to the SEPs obtained. Average absolute
error of structure parameters for NNs trained by the ori-
ginal data set alone is considerably larger than that for
NNs trained by the extended data set. However, this re-
sult should not imply that the prediction errors for anynewly introduced protein will always be within SEP.
SEP is a statistical measure of confidence of the NN.
Table 3
Comparison of the average absolute prediction errors of the current method with the previous method and the methods of Lee et al. and Hering et al
Current method Previous method Lee et al. Hering et al.
Alcohol dehydrogenase 2.26 3.33 1 2
Trypsin inhibitor 1.78 9 14 5.01
Carbonic anhydrase 3.38 7 7 7.28
Concanavalin A 2.42 4.33 8 1.87
Chymotrypsinogen 1.91 4.33 4.33 3.41
Chymotrypsin 2.94 2.67 1.67 2.2
Cytochrome c 2.11 7 5 6.43
Elastase 4.64 5.67 4.33 5.49
Hemoglobin 4.84 7 10 3.46
Insulin 2.34 6.33 5 4.42
Lysozyme 4.93 11 8 8.68
Myoglobin 0.81 3.67 6.67 2.36
Nuclease 3.85 5.33 2.67 3.99
Prealbumin 3.94 4 10.67 2.93
Papain 3.38 2 4.67 4.21
Protease 2.29 5.33 5.33 5.74
Ribonuclease A 2.80 2 4 1.29
Ribonuclease S 2.29 6.67 5 6
Average 2.94 5.37 5.96 4.27
Standard deviation 1.14 2.38 3.26 2.05
Correlation coefficient 0.28 0.10 0.31
Table 4
Secondary structure predictions for pig citrate synthase, a protein outside the original training set
% a Helix % b Sheet % Turns
X-ray crystallography 56.5 1.4 16.5
NN prediction with extended data set 52.3 3.9 19.1
Absolute error with extended data set 4.2 2.5 2.6
NN prediction with original data set 54.0 14.0 14.4
Absolute error with original data set 2.5 12.6 2.1
M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244 243
Prediction errors larger than SEP are possible although
their probability of occurrence becomes smaller as theprediction error increases.
The significant reduction in the error of prediction as
presented in our study must be viewed with caution. This
is because, in addition to the problems highlighted earli-
er, there are other factors that may influence the accuracy
of the results. Analysis of secondary structure content by
FTIR spectroscopy, using methods such as that used by
us, depends on protein reference spectra whose second-ary structures have been deduced from X-ray crystallog-
raphy which is a direct method for determining complete
three-dimensional structure of proteins. However, there
is more then one way of defining the beginning and end-
ing of certain secondary structural elements from X-ray
data. Various methods such as those reported by Levitt
and Greer [14], and Kabsch and Sander [15] are used
to quantify the contents of different secondary structuresfrom the X-ray data. Levitt and Greer�s method is based
on dihedral angles and bond distances associated with
the a carbons in conjunction with hydrogen bonding.
In contrast, Kabsch and Sander�s method is based on hy-
drogen-bonding patterns. Often the values of secondary
.
structures deduced by these methods differ significantly
for the same protein. This results in another possiblesource of error in secondary structure predictions using
FTIR spectroscopy. Furthermore, there are specific lim-
itations associated with the various methodologies used
for predicting secondary structure from FTIR spectra
such as curve fitting and multivariate and neural network
analyses. With multivariate and neural network ap-
proaches, the number and diversity of protein FTIR
spectra used in the reference set will also influence thequality of predictions. With curve-fitting techniques, as-
signment of peaks can often be a serious problem in ac-
curate secondary structure estimations. Therefore, one
has to be aware of how these will have an impact in the
error associated with the estimation of secondary struc-
tures from FTIR spectra.
In summary, results presented in this paper demon-
strate that in the absence of sufficient training data, ar-tificially generated data help to increase the prediction
accuracy. However, the ideal solution to the problem
is to sufficiently enrich the reference set by including ori-
ginal spectra of proteins covering the diversity of protein
structural classes.
244 M. Severcan et al. / Analytical Biochemistry 332 (2004) 238–244
Acknowledgments
This work has been supported by the Royal Society
of UK through a collaborative travel grant and by the
Scientific and Technical Research Council of Turkey
(Project No. 100E036).
References
[1] P.I. Haris, F. Severcan, FTIR spectroscopic characterization of
protein structure in aqueous and non-aqueous media, J. Mol.
Catal. B-Enzymatic 7 (1999) 207–221.
[2] D.M. Byler, H. Susi, Examination of the secondary structure of
proteins by deconvolved FTIR spectra, Biopolymers 25 (1986)
469–487.
[3] F. Dousseau, M. Pezolet, Determination of the secondary
structure content of proteins in aqueous solutions from their
Amide-I and Amide-II infrared bands: comparison between
classical and partial least-squares methods, Biochemistry 29
(1990) 8771–8779.
[4] D.C. Lee, P.I. Haris, D. Chapman, C.R. Mitchell, Determination
of protein secondary structure using factor-analysis of infrared
spectra, Biochemistry 29 (1990) 9185–9193.
[5] P. Pancoska, V. Janota, T.A. Keiderling, Novel matrix descriptor
for secondary structure segments in proteins: demonstration of
predictability from circular dichroism spectra, Anal. Biochem. 267
(1999) 72–83.
[6] M. Severcan, F. Severcan, P.I. Haris, Estimation of protein
secondary structure from FTIR spectra using neural networks, J.
Mol. Struct. 565-566 (2001) 383–387.
[7] J.A. Hering, P.R. Innocent, P.I. Haris, An alternative method for
rapid quantification of protein secondary structure from FTIR
spectra using neural networks, Spectroscopy 16 (2002) 53–69.
[8] J.A. Hering, P.R. Innocent, P.I. Haris, Atomic amide I frequency
selection for rapid quantification of protein secondary structure
from Fourier transform infrared spectra of proteins, Proteomics 2
(2002) 839–849.
[9] F. Severcan, P.I. Haris, Fourier transform infrared spectroscopy
suggests unfolding of loop structures precedes complete unfolding
of pig citrate synthase, Biopolymers 69 (2003) 440–447.
[10] A.K. Jain, Fundamentals of Digital Image Processing, Prentice-
Hall, Englewood Cliffs, 1986 pp. 150–154.
[11] D.J.C. MacKay, Bayesian interpolation, Neural Comput. 4 (1992)
415–447.
[12] F.D. Foresee, M.T. Hagan, Gauss-Newton approximation to
Bayesian learning, in: Proceedings of the International Joint
Conference on Neural Networks, 1997.
[13] R. Lange, R. Maenner, Quantifying a critical training set size for
generalization and overfitting using teacher neural networks, in:
M. Marinaro, P. Morasso (Eds.), Proceedings of the International
Conference on Artificial Neural Networks (ICANN), London,
GB 1, 1994 pp. 497–500.
[14] M. Levitt, J. Greer, Automatic identification of secondary
structure in globular proteins, J. Mol. Biol. 114 (1977) 181–239.
[15] W. Kabsch, C. Sander, Dictionary of protein secondary structure-
pattern recognition of hydrogen-bonded and geometrical features,
Biopolymers 22 (1983) 2577–2637.