protein concentration is not an absolute prerequisite for the determination of secondary structure...

8
Protein concentration is not an absolute prerequisite for the determination of secondary structure from circular dichroism spectra: a new scaling method Vincent Raussens, Jean-Marie Ruysschaert, and Erik Goormaghtigh * Laboratory for Structure and Function of Biological Membranes, Structural Biology and Bioinformatics Center, Free University of Brussels, Campus Plaine, CP 206/2, Boulevard du Triomphe, Brussels B-1050, Belgium Received 25 February 2003 Abstract We present here a simple and rapid method to extract good estimates of protein secondary structure content from circular di- chroism (CD) spectra without any prior knowledge of the sample concentration. The method involves two steps: first, a single- wavelength normalization procedure and, second, the application for each secondary structure of a quadratic model based on one or two wavelength intensities. These quadratic models were derived by a cross-validation analysis of a new protein CD spectrum database. Tested on CD spectra of proteins at different concentrations, the normalization was shown to render the method virtually independent of the sample concentration. Further tests on CD spectra not recorded in our laboratory showed that our quadratic models are of general applicability. Even though the success of the present approach is less than that for currently available methods, its simplicity and the fact that the concentration is not needed may be very attractive for the study of small amounts of membrane proteins or peptides for which an accurate concentration determination might be very difficult or impossible to obtain. Ó 2003 Elsevier Science (USA). All rights reserved. Keywords: Circular dichroism; Protein secondary structure; Protein concentration Circular dichroism (CD) is one of the most widely used techniques for determining the structure of pro- teins. Far-UV CD (below 240–250 nm) ellipticity is highly sensitive to the different secondary structures found in proteins and polypeptides, each secondary structure type having bands with characteristic wave- lengths and intensities. Many mathematical methods have been devised to extract this structure information from CD spectra. They are all based on the represen- tation of a spectrum as a linear combination of basis spectra. The basis spectra are characteristic of the vari- ous secondary structure elements or a combination of these. The major methods for extracting information from spectra are multilinear regression [1], singular va- lue decomposition [2], ridge regression [3], principal component factor analysis [4], convex constraint analy- sis [5], neural network [6], and self consistent method [7] (for a review see [8,9]). Due to the sensitivity of band intensity on the different secondary structures, the use of these methods requires precise knowledge of the sample concentration. In practice, this step becomes limiting when working with small amounts of rare biological materials. Colorimetric assays (such as Lowry et al. [10], Bradford [11], or bicinchoninic acid [12]) are not accu- rate enough because most of them depend (at least in part) on specific amino acid content of the protein studied, and this content can vary from the usually used standard (BSA). 1 This discrepancy between the calcu- lated and the actual concentration can be sometimes as high as several dozens percent [13], especially with membrane proteins or small peptides. Probably, the Analytical Biochemistry 319 (2003) 114–121 www.elsevier.com/locate/yabio ANALYTICAL BIOCHEMISTRY * Corresponding author. Fax: +32-26505382. E-mail address: [email protected] (E. Goormaghtigh). 1 Abbreviations used: BSA, bovine serum albumin; RaSP, rationally selected proteins; PC, principal component; PCA, principal component analysis. 0003-2697/03/$ - see front matter Ó 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0003-2697(03)00285-9

Upload: ulb

Post on 27-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ANALYTICAL

Analytical Biochemistry 319 (2003) 114–121

www.elsevier.com/locate/yabio

BIOCHEMISTRY

Protein concentration is not an absolute prerequisite forthe determination of secondary structure from circular

dichroism spectra: a new scaling method

Vincent Raussens, Jean-Marie Ruysschaert, and Erik Goormaghtigh*

Laboratory for Structure and Function of Biological Membranes, Structural Biology and Bioinformatics Center, Free University of Brussels,

Campus Plaine, CP 206/2, Boulevard du Triomphe, Brussels B-1050, Belgium

Received 25 February 2003

Abstract

We present here a simple and rapid method to extract good estimates of protein secondary structure content from circular di-

chroism (CD) spectra without any prior knowledge of the sample concentration. The method involves two steps: first, a single-

wavelength normalization procedure and, second, the application for each secondary structure of a quadratic model based on one or

two wavelength intensities. These quadratic models were derived by a cross-validation analysis of a new protein CD spectrum

database. Tested on CD spectra of proteins at different concentrations, the normalization was shown to render the method virtually

independent of the sample concentration. Further tests on CD spectra not recorded in our laboratory showed that our quadratic

models are of general applicability. Even though the success of the present approach is less than that for currently available

methods, its simplicity and the fact that the concentration is not needed may be very attractive for the study of small amounts of

membrane proteins or peptides for which an accurate concentration determination might be very difficult or impossible to obtain.

� 2003 Elsevier Science (USA). All rights reserved.

Keywords: Circular dichroism; Protein secondary structure; Protein concentration

Circular dichroism (CD) is one of the most widelyused techniques for determining the structure of pro-

teins. Far-UV CD (below 240–250 nm) ellipticity is

highly sensitive to the different secondary structures

found in proteins and polypeptides, each secondary

structure type having bands with characteristic wave-

lengths and intensities. Many mathematical methods

have been devised to extract this structure information

from CD spectra. They are all based on the represen-tation of a spectrum as a linear combination of basis

spectra. The basis spectra are characteristic of the vari-

ous secondary structure elements or a combination of

these. The major methods for extracting information

from spectra are multilinear regression [1], singular va-

lue decomposition [2], ridge regression [3], principal

component factor analysis [4], convex constraint analy-

* Corresponding author. Fax: +32-26505382.

E-mail address: [email protected] (E. Goormaghtigh).

0003-2697/03/$ - see front matter � 2003 Elsevier Science (USA). All rightsdoi:10.1016/S0003-2697(03)00285-9

sis [5], neural network [6], and self consistent method [7](for a review see [8,9]). Due to the sensitivity of band

intensity on the different secondary structures, the use of

these methods requires precise knowledge of the sample

concentration. In practice, this step becomes limiting

when working with small amounts of rare biological

materials. Colorimetric assays (such as Lowry et al. [10],

Bradford [11], or bicinchoninic acid [12]) are not accu-

rate enough because most of them depend (at least inpart) on specific amino acid content of the protein

studied, and this content can vary from the usually used

standard (BSA).1 This discrepancy between the calcu-

lated and the actual concentration can be sometimes as

high as several dozens percent [13], especially with

membrane proteins or small peptides. Probably, the

1 Abbreviations used: BSA, bovine serum albumin; RaSP, rationally

selected proteins; PC, principal component; PCA, principal component

analysis.

reserved.

V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121 115

most used method for protein concentration determi-nation is the measurement of the solution absorbance at

280 nm [14]. Yet, this technique needs a precise knowl-

edge of the protein extinction coefficient in native con-

ditions and might be perturbed by light scattering,

especially for membrane proteins in the presence of

micelle-forming detergent or in the presence of proteo-

liposomes. In insoluble proteins, such as prions or am-

yloid, thin films or gels are examined but concentrationscannot be precisely defined. The most accurate concen-

tration determination is certainly quantitative amino

acid analysis, but the technique is not always accessible

and requires a relatively large amount of material. Al-

ternatively, methods directly related to the amide bond

content like Biuret or to the total nitrogen content can

be utilized but these methods also require large amount

of material because they are not sensitive. Furthermore,they are also subjected to interference from different

agents quite commonly used in protein study (e.g., re-

ducing agents for the Biuret).

We investigated here the possibility of using a simple

normalization of the CD spectra that does not require

knowledge of the protein concentration. Even though

the success of the present approach is less than that for

current methods that require the input of protein con-centration, the fact that the concentration is not needed

may be very attractive for the study of small amounts of

materials, for membrane proteins and peptides for

which an accurate concentration determination might be

very difficult to obtain.

For the first time, McPhie [15] presented, in 2001, a

procedure, based on the analysis of an intensive value of

CD spectra, the Kuhn g factor, that does not requireknowledge of the protein concentration. We present

here another approach that is much simpler to use and

reaches the same final accuracy.

Materials and methods

RaSP database

The set of reference proteins used for this study is an

‘‘optimal’’ basis set that is described in another paper

(K.O. Oberg, J.-M. Ruysschaert, and E. Goormaghtigh,

unpublished). It represents a wide range of helix and

sheet fractional content values and 60 different protein

domain folds. Briefly, the set members were chosen from

120 proteins that had been identified as potential refer-ence set proteins through a search of the protein crystal

structure databases CATH [16], SCOP [17–19], and

PDB_SELECT [20,21] and commercial protein sources

(Sigma-Aldrich and Fluka). Proteins were chosen based

on their fold. The final selection was based on other

criteria including available purity as checked by densi-

tometry of SDS–PAGE analyses, crystal structure

quality, nonprotein contaminants, sufficient solubility,and stability. The final set of 50 proteins fully spans

several different ‘‘conformational spaces’’ as described

by CATH, has fractional content in the different sec-

ondary structures, and has distributions of structures

that reflect the natural abundances found in the PDB.

Briefly, the proteins (sorted by helix content as in Fig. 2)

are the following: (1) Trypsin inhibitor (soy bean), (2)

avidin, (3) erabutoxin b, (4) concavalin A, (5) metallo-thionein II, (6) a-hemolysin (alpha-toxin), (7) lectin(lentil), (8) superoxide dismutase (Cu, Zn), (9) immu-

noglobulin gamma, (10) xylanase, (11) trypsinogen, (12)

a-chymotrypsinogen A, (13) carbonic anhydrase, (14)thaumatin, (15) pepsinogen, (16) rennin (chymosin b),

(17) pepsin, (18) trypsin inhibitor (BPTI), (19) ubiquitin,

(20) monellin, (21) ribonuclease A, (22) ricin, (23) pa-

pain, (24) alcohol dehydrogenase, (25) glucose oxidase,(26) ovalbumin, (27) subtilisin BPN0 (nagarse), (28) a-lactalbumin, (29) subtilisin Carlsberg, (30) lysozyme,

(31) penicillin amidohydrolase, (32) DDDD-transpeptidase,

(33) lipoxygenase-1, (34) phosphoglyceric kinase, (35)

peroxidase, (36) dihydropteridine reductase, (37) triose

phosphate isomerase, (38) insulin, (39) cytochrome c,

(40) phospholipase A2, (41) superoxide dismutase (Fe),

(42) glutathione S-transferase, (43) parvalbumin, (44)citrate synthetase, (45) troponin, (46) apolipoprotein

E3, N-terminal domain (residues 1–183), (47) hemo-

globin, (48) ferritin (apo), (49) colicin A, C-terminal

domain, and (50) myoglobin.

Protein secondary structure evaluation from the DSSP

program output

The secondary structure of the RaSP proteins was

determined with the DSSP program [22]. There are eight

assignments made by DSSP. Six are familiar to protein

chemists: a helix (denoted by H), 310 helix (G), p helix(I), b sheet (E), turn (T), and unassigned structure (in-dicated by a blank space in the DSSP program output).

Unassigned structure has been referred to by many

names, such as irregular, other, disordered, random, orcoil. Because of their extremely low frequency, p heliceswere not considered in this study.

CD data collection and processing

All protein preparations were desalted by dialysis or

size-exclusion chromatography. CD spectra were col-

lected on a JASCO J-710 CD spectrometer using filteredprotein solutions in 2mM Hepes, pH 7.2, with an ab-

sorbance of �0.5–0.8 at 192 nm (�0.1mg/ml) in a 0.1-cm cell. Each CD spectrum was the accumulation of

eight scans at 50 nm/min with a 1-nm slit width and a

time constant of 0.5 s for a nominal resolution of 1.7 nm.

Data were collected from 185 to 260 nm.

116 V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121

Linear regression models—cross-validation

We present here a two-step procedure: (1) in the first

step, the search of the best wavenumbers to use for both

the normalization and the regression equations is carried

out; (2) in the second step, the normalization is applied

and the regression equations are applied to any un-

known protein. This step can be carried out easily with a

pocket calculator using the equations reported in Table1. The first ‘‘calibration’’ step is presented in detail here.

In an attempt to build a model describing the

secondary structure content, we used either the elliptic-

ities at selected wavelengths or the PC scores. In the

latter case, principal component analysis was per-

formed.

The simplest model relates the ellipticity at one

wavelength to one secondary structure content. For thesake of simplicity, we consider the helix content in

the following description. The model used is linear in the

ellipticity Ej;ki and the square of the ellipticity E2j;ki ,where j is the spectrum number and ki is the wavelengthindex. E represents here the ellipticity of the rescaledspectra (see below). The model includes a constant, a1,and two proportionality factors (a2 and a3), one for theellipticity and one for the square of the ellipticity. Forthe a helix content, this one-wavelength (ki) model canbe written

1 E1;ki E21;ki1

1

..

. ... ..

.

1 E50;ki E250;ki

0BBBBBBBBB@

1CCCCCCCCCA

�a1a2a3

0@

1A ¼

fhelix 1fhelix 2fhelix 3...

fhelix 49fhelix 50

0BBBBBBB@

1CCCCCCCA: ð1Þ

It can be easily solved for the best constants ak in theleast square sense.For identifying the best wavelength, cross-validation

was carried out. For this purpose, each spectrum was

removed in turn from the database, a model was built

from the remaining spectra, and the removed spectrum

was predicted, yielding a predicted concentration fhelix j.This prediction was repeated for all the spectra of the

series. The standard deviation of the difference between

the predicted values fhelix j and the ‘‘real’’ frefhelix j valueswas used to evaluate the quality of the model. This

Table 1

Predictive models for the main secondary structures

Predictive modelsa

a helix 27:58� 14:46 � E193 � 5:66 � E2193 þ 1:86 � E211 � 14b sheet 8:66þ 11:97 � E196 þ 7:36 � E2196 � 0:80 � E211 þ 15:turn 12:49þ 0:28 � E234 � 0:49 � E2234random 38:9þ 3:14 � E193 � 0:56 � E2193aAll CD spectra were normalized at 207 nm prior to the application of tb Standard deviation obtained after a cross-validation process for each se

standard deviation characterizes the quality of themodel built at wavelength ki.The whole cross-validation procedure was then re-

peated for every wavelength ki of the spectra (everynanometer). The wavelength with the smallest standard

deviation was retained for further building of a more

accurate model containing a second ellipticity value in

an ascending stepwise manner. The selected wavelength

is called k1 below.When PCA decomposition was realized, the scores

where used instead of the ellipticities in the procedure

described above. In the course of the cross-validation

procedure, PCA was carried out for every model.

Ascending stepwise model building

For the ascending stepwise building of the model, thebest wavelength k1 selected in the cross-validation pro-cess described above was retained in the model and a

second one was added as described in Eq. (2) for the ahelix content model.

1 E1;k1 E21;k1 E1;ki E21;ki1

1

..

. ... ..

. ... ..

.

1 E50;k1 E250;k1 E50;ki E250;ki

0BBBBBBBBBBBBBB@

1CCCCCCCCCCCCCCA

a1a2a3a4a5

0BBBBBB@

1CCCCCCA

¼

fhelix 1fhelix 2fhelix 3

..

.

fhelix 49fhelix 50

0BBBBBBBBB@

1CCCCCCCCCA

:

ð2Þ

As described before, the cross-validation procedure

was used, the constants a1 to a5 were obtained and usedto predict the secondary structure fhelix j from Eq. (2).

The standard deviation of the difference between the

predicted values fhelix j and the reference frefhelix j valueswas used to evaluate the quality of the new model. The

cross-validation procedure was repeated for everywavelength ki of the spectra. The wavelength with thesmallest standard deviation is retained. The model now

contains two wavelengths k1 and k2. The same proce-dure was repeated up to eight times in this study, de-

fining the eight best wavelengths for the description of a

Std Dev. (%)b

:72 � E2211 11.9

38 � E2211 11.1

4.15

10.3

he models. E represents the ellipticity at a given wavelength.

condary structure.

V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121 117

secondary structure. Finally, the entire procedure wasrepeated for all the secondary structures considered.

Normalization

Normalization consisted in multiplying every spec-

trum by a factor such that its ellipticity at a given

wavelength was equal to 1. All wavelengths (every

nanometer) were normalized in turn and the wholeprocess described above was restarted. The normaliza-

tion wavelength yielding the lowest standard deviation

was determined for every secondary structure.

Asynchronous map

Generalized two-dimensional correlation spectra

were calculated according to Noda [23,24] and Sasicet al. [25].

Results

a helix structure

In the first part of the work, we searched for the bestspectral normalization. To select the wavelength that

will give the best normalization, we arbitrarily set the

ellipticity of the RaSP50 CD spectra at 1 at every

wavelength in turn. Linear models describing the a helixstructure content were built with each of these normal-

ized sets of spectra. Each model was obtained by a

combined cross-validation/ascending stepwise approach

(see Materials and methods) and the correspondingstandard deviation was calculated.

Fig. 1 shows the standard deviation as a function of

the normalization wavelength for models using only one

Fig. 1. Evolution of the best standard deviation for the a helix pre-diction in cross-validation for the 50 proteins of the database as a

function of the wavelength at which the CD spectra were normalized.

The model contains one wavelength.

wavelength (Eq. (1)). The best normalization wave-length appears to be around 206–207 nm, with an ad-

ditional region of interest near 245 nm. Such a profile

computed with two to eight wavelength models indicates

that the normalization region, 206–207 nm, remains the

best in all cases (not shown).

In the second part of the work, we investigated the

effect of incorporating an increasing number of wave-

lengths in the construction of the model. Fig. 2 showsthe evolution of the prediction standard deviation for

the determination of the a helix structure content bycross-validation as a function of the number of wave-

lengths included in the model after normalization of all

the spectra at 207 nm. Clearly, the ellipticity at a single

wavelength (193 nm) contains most of the information

necessary to build the model. In Fig. 3, the normalized

spectra have been sorted according to the helix content(inset) and plotted. It is apparent from the rescaled

spectra (Fig. 3) that the region near 190 nm is correlated

to the helical content. After normalization (by a nega-

tive factor for most spectra), the ellipticity decreases as

the helix content increases. Addition of a second wave-

length (211 nm) significantly improves the description of

the helix content (standard deviation �12%). Additionof a third wavelength does not improve the predictionmodel anymore, and further addition results in a deg-

radation of the prediction, indicating that these other

wavelengths do not contain more useful independent

information for describing the helix content but rather

bring in noisy unrelated information. Fig. 4 illustrates

the relation between the actual a helix content and thepredicted one obtained by cross-validation for a model

containing two wavelengths, 193 and 211 nm, after

Fig. 2. Evolution of the standard deviation (%) for the determination

of the a helix structure content by cross-validation as a function of thenumber of added wavelengths to the model. All the spectra were

normalized at 207 nm. The wavelengths at which the best standard

deviations are found are indicated on the curve. The 0 added wave-

length point refer to the standard deviation characterizing the distri-

bution of the helix content in the RaSP database.

ig. 3. Series of spectra sorted according to the a helix content after rescaling at 207 nm. Inset: evolution of the a helix content with the spectrum

118 V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121

F

Fig. 4. Relation between the actual a helix concentration and thepredicted a helix concentration obtained by cross-validation for amodel containing the ellipticities at two wavelengths: 193 and 211 nm.

All the spectra were normalized at 207 nm for building this model. The

model built by linear regression is the central dotted line. The external

dotted lines are drawn at 1 standard deviation. The numbers identifythe spectra in the RaSP database. The circled proteins are troponin

(45); N-terminal apolipoprotein E3(1–183) (46); hemoglobin (47); co-

licin A (47); and myoglobin (50). The predictive model used to obtain

the helix content is described in Table 1.

number.

normalization of the spectra at 207 nm. The standarddeviation of the prediction is 11.9%. It is important to

compare this value to the standard deviation of the ac-

tual content distribution. For the a helix, the spread ofthe helical content in the RaSP50 database is charac-

terized by a standard deviation of 22%. This reference

value is shown in Fig. 2 as the starting point (this would

correspond to a model calculated with 0 wavelength

taken into account). It can be observed in Fig. 4 that a

series of spectra (circled spectra 45, 46, 47, 49, and 50)

with high helix content is predicted with too low a helix

content. We could not find any rational explanation

behind the poor prediction for these five proteins.Decomposition of the spectrum series into principal

components is another way to extract uncorrelated in-

formation (principal components are the eigenvectors of

the spectrum correlation matrix). When the normaliza-

tion/cross-validation process was repeated for building

models from the first three PCs, the best model built was

obtained with a normalization at 207 nm, in perfect

agreement with the previous data. A model using the190–240 nm range and including three PCs described the

helix content with a standard deviation of 11.7% (data

not shown). Interestingly, the series of spectra circled in

Fig. 4 appeared at the same position with respect to the

regression line, underlying the intrinsic lack of capability

for correctly predicting the helix content in these pro-

teins.

b sheet structure

A similar analysis has been carried out for the b sheetstructure. The best normalization wavelength was found

to be 207 nm again (not shown). As was the case for the

a helix, two wavelengths, 196 and 211 nm, are sufficientto describe the information contained in the spectra that

are related to the b sheet structure content. As forthe helix structure, adding more than two wavelengths

Fig. 5. Relation between the actual b sheet concentration and thepredicted b sheet concentration obtained by cross-validation for amodel containing two wavelengths: 196 and 211 nm. The model built

by linear regression is the central dotted line. The external dotted lines

are drawn at 1 standard deviation. The numbers identify the spectrain the RaSP database. The predictive model used to obtain the b sheetcontent is described in Table 1.

Fig. 6. Prediction of the a helix content for hemoglobin (Hemo alpha)and pepsin (Peps alpha) and for the b sheet content for hemoglobin(Hemo beta) and pepsin (Peps beta). The spectra of pepsin and he-

moglobin were recorded for different protein concentrations and dif-

ferent pathlengths as explained in the text. They were normalized at

207 nm before prediction. The arrows indicate the corresponding val-

ues determined for the proteins in the database.

V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121 119

resulted in the poorest predictive models when tested incross-validation mode. The standard deviation of the

prediction is now 11.1% (Fig. 5) (the actual standard

deviation in RaSP50 for b sheet is 17.7%).

Other structures

The other structures defined by DSSP have been

tested in the same way (see Table 1). For turns, randomstructures, and the sums of them, the standard devia-

tions of the prediction are 4, 11, and 12% respectively.

Yet, this is barely 1% better than the reference standard

deviation characterizing the distribution of the struc-

tures in the database. Summing up, the 310 helix and the

a helix did not improve the prediction for the helices ingeneral (not shown).

Precision

The investigation described so far deals with the ac-

curacy of the structure determination from rescaled CD

spectra. Another question is the precision obtained for

data recorded at different protein concentrations and

with different cell pathlengths. To address this question,

an investigator who took no part in the work describedso far conducted a series of dilutions of two test pro-

teins: pepsin (17) and hemoglobin (49). Spectra were

recorded at 0.05, 0.1, 0.5, and 1mg/ml in cells with

pathlengths of 0.1, 0.2, 0.5, and 1mm, yielding 16

spectra for each protein. Fig. 6 shows the predicted a

helix and b sheet contents obtained with the predictivemodels determined previously (Table 1) after normali-

zation at 207 nm. The most extreme conditions (i.e., thetwo most-diluted samples with a pathlengths of 0.1mm

(spectra too noisy) and the most concentrated sample

with a pathlength of 1mm (intensity too high)) were out

of range and removed from the figure. The remaining

spectra, as judged from Fig. 6, allowed a quite precise

determination of the secondary structure content, dem-

onstrating the validity of the normalization procedure.

Detailed inspection of the data reported in Fig. 6 revealthat there is no correlation between the deviation of a

particular measurement and the protein concentration

or the cell pathlength (not shown).

Portability

To establish the portability of the method, we ob-

tained the CD spectra of 18 proteins identical to pro-teins included in the series used for this work (RaSP50

database). The 18 CD spectra were extracted from a 42-

protein database combining spectra from different ori-

gins (for a description, see [26,27]). These spectra were

rescaled at 207 nm as previously described and a helixand b sheet structures were predicted with the predictivemodels (Table 1). The prediction for both structures was

identical to the value obtained from our own databasewith standard deviation of 7% (not shown). It can

therefore be concluded that the application of the pa-

rameters determined with the RaSP database used in

this work can be transferred to other spectra recorded

under completely different conditions with reasonably

good success.

Fig. 7. Asynchronous correlation map of the spectra series presented in

Fig. 2 after rescaling at 207 nm. For equally weighting the spectral

variations at each wavelength, normalization of the data across the

spectra had been realized for each wavelength before computation.

120 V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121

Discussion

All the current methods available for protein CD

spectra analysis need an accurate protein concentration

determination prior to any calculation. This step is

crucial to obtain good and valuable results. In practice,

the protein concentration determination with the accu-

racy needed by these methods is not always easy to

obtain for reasons briefly described in the introduction.In some cases, the protein concentration of the sample is

almost impossible to assess. Available quantities can be

very small, the proteins or peptides can be difficult to

handle for quantitative measurements (e.g., membrane

and hydrophobic proteins or peptides, proteins that

aggregate such as prion, b-amyloid peptide, etc., andproteins that have to be analyzed in gels or thin films).

In an attempt to overcome this problem, we present herea simple and rapid method to obtain rather good esti-

mates of the secondary structure contents without prior

knowledge of the sample concentration.

The method involves two steps: first a normalization

procedure at one wavelength and second the application

of a quadratic model equation including one or two

wavelength intensities.

It is important to assess the rationale behind theprocess described above. It appears that at least three

wavelengths are correlated with the secondary structure

content and are independent. The first one is the nor-

malization wavelength (207 nm). It is indeed necessary

that a good correlation exists between the normalization

wavelength and the structure content. The absence of

such a correlation would result in normalized spectra in

which the information relative to the structure is de-stroyed. The first wavelength included in the model must

obviously bring information that is independent of the

information contained in the wavelength used for nor-

malization. Once a first wavelength is added, the infor-

mation content at any wavelength that is correlated with

it becomes zero. The cross-validation procedure indi-

cates that, in the cases of a helix and b sheet, there isalso a second unrelated wavelength that can be includedin the model. The intensity at this second added wave-

length has to be uncorrelated with the first one and with

the normalization wavelength. To test this, we show in

Fig. 7 the asynchronous correlation map for the spectra

normalized at 207 nm. A major asynchronous correla-

tion was found between the spectral regions around 193

and 211 nm in good agreement with the selected wave-

lengths in the a helix and the b sheet predictive models(Table 1). This observation strengthens the selection of

the wavelength obtained from Fig. 2 and validates the

approach presented here. Altogether, it seems that three

unrelated types of information in a CD spectrum can

be used to describe and predict protein secondary

structures, one, probably the most useful one, is used for

normalization. The loss of information with respect to

concentration-normalized spectra is probably thereforereally significant.

Prediction of a helix content results in a standarddeviation of 11.9%. Yet, a group of proteins appear

specifically ill predicted in Fig. 4, namely troponin (45),

N-terminal apolipoprotein E3(1–183) (46), hemoglobin

(47), colicin (49), and myoglobin (50). It can be hy-

pothesized that useful information for predicting these

proteins with high helix content has been lost in thenormalization process. This is particularly striking in

view of the slightly better result (standard devia-

tion¼ 11.1%) achieved for the b sheet structure. Theother structures defined by DSSP (turns, 310 helices,

random) were predicted with standard deviations of 4, 3,

and 11, respectively, i.e., barely better than the standard

deviations of the distribution of these structures in the

database. Consequently, summing the a and 310 helixcontents did not improve the prediction (not shown).

The analysis of a series of spectra of pepsin and he-

moglobin recorded at different concentrations with dif-

ferent pathlengths demonstrates that the results

obtained are basically independent of the protein con-

centration, provided that the spectra are of sufficient

quality. This confirms the validity of our normalization

procedure.Finally, the analysis of spectra recorded and pub-

lished by others [26,27] using our predictive models in-

dicates that the simple equations (Table 1) that we

derived from the analysis of our new rationally selected

protein database are widely applicable.

Recently, McPhie [15] presented a procedure, based

on the analysis of an intensive value of CD spectra, the

V. Raussens et al. / Analytical Biochemistry 319 (2003) 114–121 121

Kuhn g factor, that for the first time does not requireknowledge of the protein concentration. We presented

here another approach that reaches the same final ac-

curacy. These two approaches are definitely nonoptimal,

compared to the most accurate methods available, but

those accurate methods all require a highly precise

knowledge of the sample concentration. Therefore,

McPhie�s method and ours can represent very usefulassessments of protein secondary structure when thisknowledge is lacking. In addition, we believe that our

method is much simpler to use. It requires only a nor-

malization of the data at 207 nm and the application of

predictive model equations reported in Table 1. This can

be easily done with a pocket calculator.

References

[1] N. Greenfield, G.D. Fasman, Computed circular dichroism

spectra for the evaluation of protein conformation, Biochemistry

8 (1969) 4108–4116.

[2] J.P. Hennessey, W.C. Johnson, Information content in the

circular dichroism of proteins, Biochemistry 20 (1981) 1085–

1094.

[3] S.W. Provencher, J. Gl€oockner, Estimation of globular protein

secondary structure from circular dichroism, Biochemistry 20

(1981) 33–37.

[4] R. Pribi�cc, Principal component analysis of Fourier transforminfrared and/or circular dichroism spectra of proteins applied in a

calibration of protein secondary structure, Anal. Biochem. 223

(1994) 26–34.

[5] A. Perczel, M. Hollosi, G. Tusnady, G.D. Fasman, Convex

constraint analysis: a natural deconvolution of circular dichroism

curves of proteins, Protein Eng. 4 (1991) 669–679.

[6] G. Bohm, R. Muhr, R. Jaenicke, Quantitative analysis of protein

far UV circular dichroism spectra by neural networks, Protein

Eng. 5 (1992) 191–195.

[7] N. Sreerama, R.W. Woody, A self-consistent method for the

analysis of protein secondary structure from circular dichroism,

Anal. Biochem. 209 (1993) 32–44.

[8] N. Greenfield, Methods to estimate the conformation of proteins

and polypeptide from circular dichroism data, Anal. Biochem. 235

(1996) 1–10.

[9] S.Y. Venyaminov, J.T. Yang, in: G.D. Fasman (Ed.), Circular

Dichroism and the Conformational Analysis of Biomolecules,

Plenum Press, New York, 1996, pp. 69–107.

[10] O.H. Lowry, N.J. Rosebrough, A.L. Farr, R.J. Randall, Protein

measurement with the Folin phenol reagent, J. Biol. Chem. 193

(1951) 265–275.

[11] M. Bradford, A rapid and sensitive method for the quantitation of

microgram quantities of protein utilizing the principle of protein–

dye binding, Anal. Biochem. 72 (1976) 248–254.

[12] P.K. Smith, R.I. Krohn, G.T. Hermanson, A.K. Mallia, F.H.

Gartner, M.D. Provenzano, E.K. Fujimoto, N.M. Goeke, B.J.

Olson, D.C. Klenk, Measurement of protein using bicinchoninic

acid, Anal. Biochem. 150 (1985) 76–85.

[13] W.H. Peters, A.M. Fleuren-Jakobs, K.M. Kamps, J.J. de Pont,

S.L. Bonting, Lowry protein determination on membrane prep-

arations: need for standardization by amino acid analysis, Anal.

Biochem. 124 (1982) 349–352.

[14] C.N. Pace, F. Vajdos, L. Fee, G. Grimsley, T. Gray, How to

measure and predict the molar absorption coefficient of a protein,

Protein Sci. 4 (1995) 2411–2423.

[15] P. McPhie, Circular dichroism studies on proteins in films and in

solution: estimation of secondary structure by g-factor analysis,

Anal. Biochem. 293 (2001) 109–119.

[16] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells,

J.M. Thornton, CATH—a hierarchic classification of protein

domain structures, Structure 5 (1997) 1093–1108.

[17] T.J.P. Hubbard, A.G. Murzin, S.E. Brenner, C. Chothia, SCOP: a

structural classification of proteins database, Nucleic Acids Res.

25 (1997) 236–239.

[18] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a

structural classification of proteins database for the investigation

of sequences and structures, J. Mol. Biol. 247 (1995) 536–540.

[19] G.J. Barton, SCOP: structural classification of proteins, Trends

Biochem. Sci. 19 (1994) 554–555.

[20] U. Hobohm, C. Sander, Enlarged representative set of protein

structures, Protein Sci. 3 (1994) 522–524.

[21] U. Hobohm, M. Scharf, R. Schneider, C. Sander, Selection of

representative protein data sets, Protein Sci. 1 (1992) 409–417.

[22] W. Kabsch, C. Sander, Dictionary of protein secondary structure:

pattern recognition of hydrogen-bonded and geometrical features,

Biopolymers 22 (1983) 2577–2637.

[23] I. Noda, Generalized 2-dimensional correlation method applicable

to infrared, Raman, and other types of spectroscopy, Appl.

Spectrosc. 47 (1993) 1329–1336.

[24] I. Noda, Determination of two-dimensional correlation spectra

using the Hilbert transform, Appl. Spectrosc. 54 (2000) 994–999.

[25] S. Sasic, A. Muszynski, Y. Ozaki, New insight into the mathe-

matical background of generalized two-dimensional correlation

spectroscopy and the influence of mean normalization pretreat-

ment on two-dimensional correlation spectra, Appl. Spectrosc. 55

(2001) 343–349.

[26] N. Sreerama, S.Y. Venyaminov, R.W. Woody, Estimation of

protein secondary structure from circular dichroism spectra:

inclusion of denatured proteins with native proteins in the

analysis, Anal. Biochem. 287 (2000) 243–251.

[27] N. Sreerama, R.W. Woody, Estimation of protein secondary

structure from circular dichroism spectra: comparison of CON-

TIN, SELCON, and CDSSTR methods with an expanded

reference set, Anal. Biochem. 287 (2000) 252–260.