toward a quantitative theory of intrinsically disordered ... · 3d structure cannot be resolved....

5
Toward a quantitative theory of intrinsically disordered proteins and their function Jintao Liu a , James R. Faeder b , and Carlos J. Camacho b,1 Departments of a Physics and Astronomy and b Computational Biology, University of Pittsburgh, Pittsburgh, PA 15260 Edited by Michael E. Fisher, University of Maryland, College Park, MD, and approved September 23, 2009 (received for review July 13, 2009) A large number of proteins are sufficiently unstable that their full 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous presence undercuts the principle that a protein’s structure determines its function. Here we present a quantitative theory that makes pre- dictions regarding the role of intrinsic disorder in protein structure and function. In particular, we discuss the implications of analytical solutions of a series of fundamental thermodynamic models of protein interactions in which disordered proteins are characterized by positive folding free energies. We validate our predictions by assigning protein function by using the gene ontology classifica- tion—in which ‘‘protein binding’’, ‘‘catalytic activity’’, and ‘‘tran- scription regulator activity’’ are the three largest functional cate- gories—and by performing genome-wide surveys of both the amount of disorder in these functional classes and binding affin- ities for both prokaryotic and eukaryotic genomes. Specifically, without assuming any a priori structure–function relationship, the theory predicts that both catalytic and low-affinity binding (Kd 10 7 M) proteins prefer ordered structures, whereas only high-affinity binding proteins (found mostly in eukaryotes) can tolerate disorder. Relevant to both transcription and signal trans- duction, the theory also explains how increasing disorder can tune the binding affinity to maximize the specificity of promiscuous interactions. Collectively, these studies provide insight into how natural selection acts on folding stability to optimize protein function. binding catalysis intrinsic disorder specificity transcription M ost proteins are not stable enough for current technologies to resolve their full 3D structure (1). In fact, estimates suggest that anywhere between 25% and 41% of the proteins in eukaryotic genomes contain long-disordered regions (2). It has been suggested that disorder itself plays a functional role by, e.g., allowing for multiple interaction partners (3) and functional diversity (4 – 6), which are particularly important in cell signaling and cancer (7). The correlation between intrinsic disorder and protein function, however, is still nebulous and led us to look for more general principles that might relate protein function and disorder. Unlike the aforementioned bioinformatics approaches and other heuristic models (8), here we examine the linkage between disorder and protein function from a thermodynamic point of view. Without assuming any structure–function relationship, we look for experimentally derived parameters that might relate protein function and disorder. As described by Dyson and Wright (9), proteins in the cellular environment may have disorder in long loops, end terminals, hinge regions, domains, and even covering their full sequences. However, in a complex, these motifs acquire well-defined 3D structures. Common descriptors to all these forms of disorder are the folding free energy (G f ) of the motifs participating in the molecular interaction and the dissociation constant (K d ) of the interaction, where a positive folding free energy corresponds to a disordered protein (10). We find that binding interactions between proteins become increasingly tolerant of the native disordered state (G f 0) as the strength of the physical interaction of the bound state (i.e., the ‘‘complementarity’’ of the complex) is increased. Indeed, for M concentrations, only binding affinities stronger than 10 7 M can optimally bind disordered proteins. More interestingly, we show that this intrinsic protein disorder can tune the binding free energy of the complex to maximize the specificity of promiscu- ous interactions. On the other hand, optimal catalytic conversion of substrates to products requires ordered structures with G f 1 kcal/mol. These results demonstrate the possibility that evolution may act on the stability of proteins to optimize basic functions such as binding and catalysis. A comparative genomic analysis of the amount of disorder in proteomes across all kingdoms further supports this conjecture and also reveals intriguing differences on the role of disorder between eukaryotes and prokaryotes for both binding and transcription proteins. Results Genome-wide surveys of protein disorder have shown that disorder is more prevalent in some functional categories than others (5, 6). We revisit this question by analyzing the fraction of amino acid residues in disordered regions of both eukaryotic and prokaryotic genomes for the three largest functional cate- gories in the gene ontology (11) classification (see Materials and Methods): ‘‘protein binding’’, ‘‘catalytic activity’’, and ‘‘transcrip- tion regulator activity’’. Fig. 1 shows the distributions of the amount of disorder in human, yeast, and Escherichia coli proteins (also shown are the distributions after removing proteins with more than one function; see also Fig. S1 of the SI Appendix). Contrary to the striking bias of catalytic and transcription human proteins to be significantly more ordered and disordered, re- spectively, disorder is neither strongly favored nor disfavored in binding proteins. These distinctions are still visible in yeast but are less obvious in bacterial genomes such as E. coli, whose proteins are found to be significantly more ordered than those found in eukaryotes across all functional categories. Based on a more comprehensive analysis of the preference of disorder among the different functional categories, we classify the genomes into three types (Fig. S2 of the SI Appendix): (type I) no strong preference for ordered structures in binding proteins but preference for disorder in transcription proteins, among which are human, mouse, zebrafish, chicken, rice, fruit fly, Arabidopsis thali- ana, and Dictyostelium discoideum; (type II) no strong preference for ordered structures for either binding or transcription proteins, among which one finds yeast, Schizosaccharomyces pombe, and Caenorhabditis elegans; and (type III) strong preference for ordered structures in both binding and transcription proteins, among which there are E. coli, Bacillus anthracis, and Pseudomonas fluorescens. For catalysis, all genomes show a strong preference for ordered Author contributions: J.R.F. and C.J.C. designed research; J.L. performed research; J.L., J.R.F., and C.J.C. analyzed data; and J.L., J.R.F., and C.J.C. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Freely available online through the PNAS open access option. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/cgi/content/full/ 0907710106/DCSupplemental. www.pnas.orgcgidoi10.1073pnas.0907710106 PNAS November 24, 2009 vol. 106 no. 47 19819 –19823 BIOPHYSICS AND COMPUTATIONAL BIOLOGY Downloaded by guest on September 2, 2020

Upload: others

Post on 16-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward a quantitative theory of intrinsically disordered ... · 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous

Toward a quantitative theory of intrinsicallydisordered proteins and their functionJintao Liua, James R. Faederb, and Carlos J. Camachob,1

Departments of aPhysics and Astronomy and bComputational Biology, University of Pittsburgh, Pittsburgh, PA 15260

Edited by Michael E. Fisher, University of Maryland, College Park, MD, and approved September 23, 2009 (received for review July 13, 2009)

A large number of proteins are sufficiently unstable that their full3D structure cannot be resolved. The origins of this intrinsicdisorder are not well understood, but its ubiquitous presenceundercuts the principle that a protein’s structure determines itsfunction. Here we present a quantitative theory that makes pre-dictions regarding the role of intrinsic disorder in protein structureand function. In particular, we discuss the implications of analyticalsolutions of a series of fundamental thermodynamic models ofprotein interactions in which disordered proteins are characterizedby positive folding free energies. We validate our predictions byassigning protein function by using the gene ontology classifica-tion—in which ‘‘protein binding’’, ‘‘catalytic activity’’, and ‘‘tran-scription regulator activity’’ are the three largest functional cate-gories—and by performing genome-wide surveys of both theamount of disorder in these functional classes and binding affin-ities for both prokaryotic and eukaryotic genomes. Specifically,without assuming any a priori structure–function relationship, thetheory predicts that both catalytic and low-affinity binding(Kd �10�7 M) proteins prefer ordered structures, whereas onlyhigh-affinity binding proteins (found mostly in eukaryotes) cantolerate disorder. Relevant to both transcription and signal trans-duction, the theory also explains how increasing disorder can tunethe binding affinity to maximize the specificity of promiscuousinteractions. Collectively, these studies provide insight into hownatural selection acts on folding stability to optimize proteinfunction.

binding � catalysis � intrinsic disorder � specificity � transcription

Most proteins are not stable enough for current technologiesto resolve their full 3D structure (1). In fact, estimates

suggest that anywhere between 25% and 41% of the proteins ineukaryotic genomes contain long-disordered regions (2). It hasbeen suggested that disorder itself plays a functional role by, e.g.,allowing for multiple interaction partners (3) and functionaldiversity (4–6), which are particularly important in cell signalingand cancer (7). The correlation between intrinsic disorder andprotein function, however, is still nebulous and led us to look formore general principles that might relate protein function anddisorder. Unlike the aforementioned bioinformatics approachesand other heuristic models (8), here we examine the linkagebetween disorder and protein function from a thermodynamicpoint of view.

Without assuming any structure–function relationship, welook for experimentally derived parameters that might relateprotein function and disorder. As described by Dyson and Wright(9), proteins in the cellular environment may have disorder inlong loops, end terminals, hinge regions, domains, and evencovering their full sequences. However, in a complex, thesemotifs acquire well-defined 3D structures. Common descriptorsto all these forms of disorder are the folding free energy (�Gf)of the motifs participating in the molecular interaction and thedissociation constant (Kd) of the interaction, where a positivefolding free energy corresponds to a disordered protein (10).

We find that binding interactions between proteins becomeincreasingly tolerant of the native disordered state (�Gf � 0) asthe strength of the physical interaction of the bound state (i.e.,

the ‘‘complementarity’’ of the complex) is increased. Indeed, for�M concentrations, only binding affinities stronger than 10�7 Mcan optimally bind disordered proteins. More interestingly, weshow that this intrinsic protein disorder can tune the binding freeenergy of the complex to maximize the specificity of promiscu-ous interactions. On the other hand, optimal catalytic conversionof substrates to products requires ordered structures with �Gf��1 kcal/mol. These results demonstrate the possibility thatevolution may act on the stability of proteins to optimize basicfunctions such as binding and catalysis. A comparative genomicanalysis of the amount of disorder in proteomes across allkingdoms further supports this conjecture and also revealsintriguing differences on the role of disorder between eukaryotesand prokaryotes for both binding and transcription proteins.

ResultsGenome-wide surveys of protein disorder have shown thatdisorder is more prevalent in some functional categories thanothers (5, 6). We revisit this question by analyzing the fractionof amino acid residues in disordered regions of both eukaryoticand prokaryotic genomes for the three largest functional cate-gories in the gene ontology (11) classification (see Materials andMethods): ‘‘protein binding’’, ‘‘catalytic activity’’, and ‘‘transcrip-tion regulator activity’’. Fig. 1 shows the distributions of theamount of disorder in human, yeast, and Escherichia coli proteins(also shown are the distributions after removing proteins withmore than one function; see also Fig. S1 of the SI Appendix).Contrary to the striking bias of catalytic and transcription humanproteins to be significantly more ordered and disordered, re-spectively, disorder is neither strongly favored nor disfavored inbinding proteins. These distinctions are still visible in yeast butare less obvious in bacterial genomes such as E. coli, whoseproteins are found to be significantly more ordered than thosefound in eukaryotes across all functional categories.

Based on a more comprehensive analysis of the preference ofdisorder among the different functional categories, we classify thegenomes into three types (Fig. S2 of the SI Appendix): (type I) nostrong preference for ordered structures in binding proteins butpreference for disorder in transcription proteins, among which arehuman, mouse, zebrafish, chicken, rice, fruit fly, Arabidopsis thali-ana, and Dictyostelium discoideum; (type II) no strong preferencefor ordered structures for either binding or transcription proteins,among which one finds yeast, Schizosaccharomyces pombe, andCaenorhabditis elegans; and (type III) strong preference for orderedstructures in both binding and transcription proteins, among whichthere are E. coli, Bacillus anthracis, and Pseudomonas fluorescens.For catalysis, all genomes show a strong preference for ordered

Author contributions: J.R.F. and C.J.C. designed research; J.L. performed research; J.L.,J.R.F., and C.J.C. analyzed data; and J.L., J.R.F., and C.J.C. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.

1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0907710106/DCSupplemental.

www.pnas.org�cgi�doi�10.1073�pnas.0907710106 PNAS � November 24, 2009 � vol. 106 � no. 47 � 19819–19823

BIO

PHYS

ICS

AN

DCO

MPU

TATI

ON

AL

BIO

LOG

Y

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

2, 2

020

Page 2: Toward a quantitative theory of intrinsically disordered ... · 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous

proteins. We note that the smaller bacterial genomes are all type III,whereas eukaryotes are either type I or II, with type I genomesbeing generally larger than type II.

This analysis suggests that selection pressures act on proteindisorder to optimize particular aspects of protein function,raising the question of what universal properties may have drivenproteins involved in binding, catalysis, and transcription toevolve along different pathways?

Thermodynamic Model. We show here that a simple thermody-namic model of molecular interactions can elucidate the role ofdisorder in binding and catalysis. In this model, folding is definedas a two-state equilibrium between the unfolded state (U) andthe folded state (F) (see SI Appendix for a three-state foldingmodel) (12). Thus, the ratio of folded to unfolded proteins isgiven by [F]eq/[U]eq � e��Gf/RT (‘‘eq’’ denotes equilibrium),where �Gf is the free energy of folding, R is the ideal gasconstant and T is temperature. Molecular interactions aredescribed by a simple binding model that assumes that onlyfolded proteins bind the substrate, i.e.,

�Gf

F/U

� S L|;Kd

c

FS[1]

By decoupling folding and binding, one can define Kdc �

[F]eq[S]eq/[FS]eq as the complementary affinity, which implicitlyaccounts for the effects of interface area, shape, hydrogen bonds,

and other interactions. Note that the size of the interfaceprovides a natural upper bound on the number of contactscontributing to the interaction. In this sense, higher comple-mentarity is often associated with a large interface, although insome cases it can be caused by other factors (e.g., small-moleculedrugs often have binding affinities between 10�9 to 10�12 M).Hence, Kd

c is equivalent to the experimental binding affinityKd

exp � ([U]eq�[F]eq)[S]eq/[FS]eq if protein F is folded beforebinding. On the other hand, if protein F is unstable (or disor-dered), then

Kdexp � Kd

c�1 � e�Gf/RT� . [2]

We note that in this formulation, Kdc characterizes the strength

of the binding interaction for the folded protein and is indepen-dent of the folding free energy, �Gf, allowing for a cleardistinction between binding and folding. Aside from the con-formational selection (13), disordered proteins could also func-tion through induced folding (1, 14) or a combination of partialfolding/unfolding (15). However, as demonstrated in the SIAppendix, our conclusions do not lose generality because we onlyrely on (quasi)equilibrium properties. For each functional cat-egory, we relate a measure of optimal performance to �Gf overthe range of parameters found in nature. With the exception oftranscription, where further discussion is needed, we will showthat this general model accounts for the observed distributionsin Fig. 1 if one assumes that natural selection acts on �Gf tooptimize protein function. In the following, we discuss the keyrelations between folding stability and function.

For binding proteins, the equilibrium complex concentrationis given by

FSbind �12 � �cp � cs � Kd

exp�

� ��cp � cs � Kdexp�2 � 4cpcs� , [3]

where cp � [U] � [F] � [FS] and cs � [S] � [FS] are the totalprotein and substrate concentration, respectively. Hence, it isclear that [FS]bind reaches a maximum if �Gf �� 0. The curvesin Fig. 2A show the ratio [FS]bind/[FS]bind

max as a function of foldingfree energy (�Gf), in the absence of excess protein or substrate(cp � cs � 1 �M). Given Kd

c, this ratio defines a measure of theefficiency of protein binding to produce maximum amount ofcomplex. For the physiologically relevant range of Kd

c between10�5 and 10�10 M, a binding efficiency of, say, 90% or higher, isobtained for folding-stability thresholds of �Gf � �1.2 kcal/moland �Gf � 2.9 kcal/mol, respectively (see ref. 16, where a similaranalysis was used to relate peptide immunogenicity and foldingstability). Specifically, we note that only strongly interactingproteins with Kd

exp � 1.2 � 10�7 M can efficiently bind disorderedproteins (�Gf � 0). As shown in Fig. 2 A, a more stringent criteriaof 97% binding efficiency also leads to a wide range of stabilitythresholds, where now Kd

exp � 1.0 � 10�8 M can tolerate disorder.An excess of protein (cp � cs) or substrate (cs � cp) canaccommodate a slightly larger amount of disorder (Fig. S3 of theSI Appendix), but this does not affect our main conclusion thathighly complementary interactions are more tolerant of disor-der, whereas the binding efficiency of low-complementarityinteractions is rapidly diminished by disorder.

For catalysis, we further considered the rate of substrateconversion to product via the FS complex, which within theMichaelis–Menten limit leads to the conversion rate

Vcat �kcatcpS

Kmc �1 � e�Gf/RT� � S

, [4]

Fig. 1. Disorder distribution. Normalized histograms of the percentage ofdisordered residues (see Materials and Methods) in the sequence of human (H.sapiens), yeast (S. cerevisiae) and E. coli (K-12) proteins within the geneontology (11) categories of ‘‘protein binding’’, ‘‘catalytic activity’’, and ‘‘tran-scription regulator activity’’. The distributions after removing the overlapbetween the three categories are shown by the lower bars (shaded). Alldistributions are normalized to the total number of proteins in each categorynoted in the upper right corner of each frame. In humans, contrary to the biasof transcription and catalytic proteins to be significantly more disordered andordered, respectively, binding proteins indicate that disorder is neitherstrongly favored nor disfavored. The statistical significance of these results,based on a Kolmogorov–Smirnov test (37), is P � 10�150. In yeast, althoughbinding and catalytic proteins show the same trend as occurs in highereukaryotes, transcription proteins overall show no significant preference fororder or disorder. In E. coli, all three functions show strikingly similar distri-butions favoring ordered structures. Similar distributions were found in othereukaryotic and prokaryotic genomes.

19820 � www.pnas.org�cgi�doi�10.1073�pnas.0907710106 Liu et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

2, 2

020

Page 3: Toward a quantitative theory of intrinsically disordered ... · 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous

where Kmc is the Michaelis constant and kcat is the enzyme

turnover rate. Fig. 2B shows that for typical Kmc values between

10�1 M and 10�6 M there is a relatively invariant threshold of thefolding free energy, �Gf � �1.0 kcal/mol, above which catalysisbecomes suboptimal (i.e., Vcat/Vcat

max 90%, where Vcatmax is

reached when �Gf �� 0). This threshold is maintained even forsubstrate concentrations as high as 10�5 M (Fig. S3 of the SIAppendix). Thus, catalytic function is optimized when thermo-dynamics strongly favor the ordered state. Interestingly, becauseto have a fast conversion rate the strength of the enzyme–substrate interaction characterized by the Michaelis constant Kmmust be much weaker than standard protein–protein Kd, en-zymes can also be thought of as a special case of extremely weakbinding proteins, i.e., ordered.

Specificity of Promiscuous Interactions. Our model also demon-strates that disorder provides a mechanism to distinguish be-tween two substrates that differ in binding affinity by a relativelysmall amount, say 1.5 kcal/mol (Fig. 3). For strong binding (Kd

exp

small), the amount of complex formation with each substrate isalmost indistinguishable. A positive �Gf, however, can tune Kd

exp

(Eq. 2) to maximize the discrimination between binding of thetwo substrates while at the same time maintaining a high level ofbinding to the higher-affinity substrate. Note that the experi-mental affinity required to bring about this optimal specificity islower the higher the concentration of protein or substrate. Our

finding is reminiscent of Schulz’s high-complementarity (orsmall Kd

c), low-affinity (or large Kdexp) rationalization of the

flexibility of nucleotide binding proteins (10), which has alsobeen applied in the context of signal transduction (9) as well asthe suggestion of Dunker et al. (17) that disorder uncouplescomplementarity (Kd

c) and affinity (Kdexp). We note that here is

that the quantitative theory defines ‘‘specificity’’ as simplyproviding better discrimination among similar physical interac-tions, a more common usage of the concept (14) that is likely toplay a critical role in complex cellular networks.

DiscussionOur survey indicates that the distribution of the amount ofdisorder depends strongly on protein function, and a first-principles thermodynamic analysis explains the nature of thisrelationship. For proteins whose main function is to bind otherproteins, the amount of disorder that can be tolerated withoutdegrading function is quite broad, depending on the comple-mentarity of the interaction. Catalytic proteins have a strongpreference for a stable folded state with �Gf ��1 kcal/mol,consistent with the notion that catalysis has strong conforma-tional requirements, as conjectured by Pauling (18) in theprestructure age and more recently discussed by other research-ers (see, e.g., ref. 19). Note, however, that although proteinstability below the aforementioned threshold (Fig. 2B and Fig.S3B of the SI Appendix) does not improve catalysis any further(20), this preorganized state leaves ample room for conforma-tional changes that might be required to bring about efficientcatalysis. Finally, we show that disorder can be used to maximizethe specificity of promiscuous interactions relevant to transcrip-tion and signal transduction.

Instead of rationalizing our findings in terms of adaptability orother processes that are not easily quantifiable, we restrict ourdiscussion to the experimentally derived parameters defined inour models, making our predictions both experimentally andquantitatively more relevant. For instance, Fig. 3 shows that for�M concentrations, highly complementary complexes, say, Kd

c �

Fig. 2. Binding and catalytic efficiency. (A) Ratio of complex concentration[FS]bind as given by Eq. 3 to maximum concentration [FS]bind

max (�Gf �� 0). cp � cs �1 �M. Vertical dash-dotted lines indicate the folding free energy for 90%(dashed lines for 97%) binding efficiency ([FS]bind/[FS]bind

max with Kdc � 10�5 and

10�10 M, respectively. To maintain high binding efficiency, weak bindingrequires negative �Gf (prefers order), whereas strong binding allows positive�Gf (tolerates disorder). (B) Fractional production rate for catalytic activityrelative to maximum catalytic rate Vcat

max (�Gf �� 0) as given by Eq. 4 ([S] � 1 �M).The vertical dash-dotted line indicates the folding free energy for 90%(dashed line for 97%) catalytic efficiency (Vcat/Vcat

max) with all relevant Kmc . To

maintain high catalytic efficiency, negative �Gf (ordered structure) is requiredfor the whole range of physiological parameters shown here. Note that toallow for fast conversion, enzyme–substrate interactions (characterized by theMichaelis constant Km) are limited to much weaker interactions than those ofbinding proteins (Kd).

Fig. 3. Maximum discrimination in binding to similar substrates. The solidcurve shows the equilibrium complex concentration [FS]bind (Eq. 3) normalizedby the strong binding limit [FS]bind

strong (Kdexp3 0). cp � cs is used without losing

generality. Each pair of vertical lines shows the relative amount of boundcomplex formed by two different substrates with a binding free energydifference of 1.5 kcal/mol. For strong binding, the complex concentrationsaturates, and there is almost no difference in the amount of complex formedby either substrate (dashed lines). On the other hand, decreasing the exper-imental binding affinity by destabilizing the folded state (F) enhances com-plex formation by the stronger binding substrate relative to the weaker one(dash-dotted lines).

Liu et al. PNAS � November 24, 2009 � vol. 106 � no. 47 � 19821

BIO

PHYS

ICS

AN

DCO

MPU

TATI

ON

AL

BIO

LOG

Y

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

2, 2

020

Page 4: Toward a quantitative theory of intrinsically disordered ... · 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous

nM, will yield maximum discrimination if folding instabilitylowers Kd

exp to �M. This extra discrimination is likely to play arole in the differential regulation of promiscuous binding do-

mains such as SH2/3s, whose typical affinities agree with thepredictions of the model (21). More interestingly, the theory alsoelucidates the dependence on concentration of the experimentalaffinity that optimizes specificity (Fig. 3).

The theory predicts that lower-affinity interactions are ex-pected to involve proteins with less disorder, which may helpexplain why disorder is less prevalent in prokaryotes (type III)than eukaryotes (types I and II). Indeed, the strikingly similardistributions for E. coli shown in Fig. 1 suggest that disorder doesnot play a role in function (similar data are observed for otherprokaryotes). Without disorder, protein binding efficiencywould imply Kd

exp � 10�7 M. A survey of the protein-ligandinteractions in the Protein Data Bank (PDB) PDBbind database(22) (Fig. 4) confirms not only that bacterial proteins may indeedbind small ligand molecules more weakly than humans proteinsbut also that there is a sharp drop in the number of E. coli ligands(20% compared with 50% for human) with Kd

exp smaller than thepredicted threshold of 10�7 M. From the point of view ofevolution, the drop of Kd

exp is also consistent with the intuitionthat short-lived microorganisms have less need to form long-lived complexes.

It is important to stress that protein-functional assignmentsare still incomplete (11). Indeed, for the genomes we analyzed,only a subset of all proteins has at least one assigned function,e.g., �75%, 88% and 32% of human, yeast, and E. coli, respec-tively. As already mentioned, our analysis encompasses motifsparticipating in the molecular interactions. Hence, for multisite/domain proteins a specific function should not necessarily re-quire folding of the entire protein. Fig. 5 further expands on theamount of intrinsic disorder in multifunctional proteins as wellas on the correlation of disorder and protein length. For the mostpart, we find that proteins with both binding and transcriptionfunctions have a disorder distribution similar to transcription,

Fig. 4. Distributions of experimentally measured protein–ligand binding af-finities. Data are taken from the PDBbind database (version 2007). The overalldistributions are consistent with our hypothesis that the lack of disorder inprokaryotes could be due to their relatively weaker binding affinities (�10�7 M).

Fig. 5. Intrinsic disorder as a function of protein length for proteins with (nonoverlapping) binding, transcription, and catalytic function (large circles), andfor proteins with more than one function, as indicated by the colored arrows from each individual functional category (smaller circles). For each polar coordinateplot, the radial and angular (counterclockwise) coordinates correspond to protein length in a log-scale and the percentage of residues that are classified asdisordered for the protein (as in Fig. 1), respectively. For clarity, percent disorder and protein length are labeled only in transcription and catalysis plots,respectively. Indicated outside each circle is the percentage of proteins in each functional category relative to the total number of proteins for which the functionhas been annotated for each organism (i.e., 15,260, 5,900, and 1,362 for human, yeast and E. coli, respectively). The figure shows that disorder does not correlatewith protein length for well-sampled functional categories. The analysis of disorder in multifunctional proteins also reveals interesting patterns. Specifically,binding does not seem to impact the level of disorder of either transcription or catalytic proteins, whereas disorder in proteins with both catalytic andtranscription functionalities appear to follow either one of the patterns found for the individual functions.

19822 � www.pnas.org�cgi�doi�10.1073�pnas.0907710106 Liu et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

2, 2

020

Page 5: Toward a quantitative theory of intrinsically disordered ... · 3D structure cannot be resolved. The origins of this intrinsic disorder are not well understood, but its ubiquitous

whereas the distribution for proteins with binding and catalyticfunctions is more similar to catalytic. For these subsets, we failedto observe significant correlations between disorder and proteinlength. For E. coli, most proteins are ordered. However, the fewhighly disordered proteins involved in transcription are allrelatively small, resulting in a weak negative correlation. Thesmall sets of proteins with both catalytic and transcriptionfunctions as well as all three functions (including binding) showa positive correlation with length while seemingly encompassinga combination of the disorder distributions of each individualfunctional category. Further analysis of disorder as a localproperty of the functioning site is likely to reveal insights intohow evolution has coupled structure and functions to cope withthe increasing complexity of higher organisms.

Ultimately, the theory might provide more subtle quantitativepredictions for the interplay between disorder and function forspecific proteins. Although current experimental technologiescannot readily analyze weakly stable proteins, let alone positivefolding free energies, computational techniques might help to fillthis gap. Although there are other aspects not considered here,such as the role of disorder in aggregation and degradation, ourfindings show how disorder has opened a new dimension in theregulation of molecular interactions for eukaryotes and, mostcertainly, humans. Collectively, our findings suggest that proteinfolding should be viewed as a continuum in which foldingstability is just one more parameter that evolution uses tooptimize function.

Materials and MethodsGene Ontology and Genome Databases. To assign protein function, we use thegene ontology classification, in which protein binding, catalytic activity, and

transcription regulator activity are the three largest functional categories. Thegene ontology annotations and protein sequences were from the memberdatabases of the Gene Ontology Consortium. Gene ontology annotation atthe European Bioinformatics Institute (23) for sequences in the Swiss-Protdatabase (24) were used for human (Homo sapiens), mouse (Mus musculus),zebrafish (Danio rerio), chicken (Gallus gallus) and A. thaliana; the Saccharo-myces Genome Database (25) for yeast (Saccharomyces cerevisiae); EcoCyc andEcoliHub (26) for E. coli (K-12); the Gramene database (27) for rice (Oryzasativa); the FlyBase (28) for fruit fly (Drosophila melanogaster); the WormBase(29) for C. elegans; the dictyBase (30) for D. discoideum; the Schizosaccharo-myces pombe GeneDB database (31) for S. pombe; and, the TIGR database (32)for B. anthracis and P. fluorescens. The data were current as of January 2009.

Disorder Prediction. For each protein, the percentage of disordered aminoacids was estimated by using the VSL2B predictor (33), which was trained withexperimental data by using machine learning techniques. The method hasbeen validated in comprehensive blind experiments (33). The predictor usesthe protein sequence as the input and gives the probability that each aminoacid is in a disordered region. A probability �0.5 predicts a residue to bedisordered. We also verified the distributions with another two predictors,FoldIndex (34) and DisEMBL (35) (see Fig. S1 of the SI Appendix), and similarresults were obtained.

Protein-Ligand Binding Affinity. The PDBbind database provides the experi-mentally measured binding affinities of protein-ligand complexes. Organisminformation was obtained from the PDB database (36) by using the PDB codesprovided by PDBbind.

ACKNOWLEDGMENTS. We are grateful to Drs. Ivet Bahar, Jeffrey Brodsky, andGeorge Makhatadze for valuable comments and suggestions on the manu-script. This work was partially supported by National Science FoundationGrant MCB-0444291/0744077.

1. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: Re-assessing the proteinstructure-function paradigm. J Mol Biol 293:321–331.

2. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic proteindisorder in complete genomes. Genome Informatics 11:161–171.

3. Haynes C, et al. (2006) Intrinsic disorder is a common feature of hub proteins from foureukaryotic interactomes. PLoS Comput Biol 2:e100.

4. Romero PR, et al. (2006) Alternative splicing in concert with protein intrinsic disorderenables increased functional diversity in multicellular organisms. Proc Natl Acad SciUSA 103:8390–8395.

5. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functionalanalysis of native disorder in proteins from the three kingdoms of life. J Mol Biol337:635–645.

6. Xie H, et al. (2007) Functional anthology of intrinsic disorder. 1. Biological processesand functions of proteins with long disordered regions. J Proteome Res 6:1882–1898.

7. Iakoucheva LM, Brown CJ, Lawson JD, Obradovic Z, Dunker AK (2002) Intrinsic disorderin cell-signaling and cancer-associated proteins. J Mol Biol 323:573–584.

8. Shoemaker BA, Portman JJ, Wolynes PG (2000) Speeding molecular recognition byusing the folding funnel: The fly-casting mechanism. Proc Natl Acad Sci USA 97:8868–8873.

9. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. NatRev Mol Cell Biol 6:197–208.

10. Schulz GE (1979) In Molecular Mechanism of Biological Recognition (Elsevier, Amster-dam), pp. 79–94.

11. Ashburner M, et al. (2000) Gene ontology: Tool for the unification of biology. NatGenet 25:25–29.

12. Zwanzig R (1997) Two-state models of protein folding kinetics. Proc Natl Acad Sci USA94:148–150.

13. Tsai CJ, Ma B, Sham YY, Kumar S, Nussinov R (2001) Structured disorder and confor-mational selection. Proteins 44:418–427.

14. Spolar RS, Record Jr MT (1994) Coupling of local folding to site-specific binding ofproteins to DNA. Science 263:777–784.

15. Rajamani D, Thiel S, Vajda S, Camacho CJ (2004) Anchor residues in protein–proteininteractions. Proc Natl Acad Sci USA 101:11287–11292.

16. Camacho CJ, Katsumata Y, Ascherman DP (2008) Structural and thermodynamic ap-proach to peptide immunogenicity. PLoS Comput Biol 4:e1000231.

17. Dunker AK,et al. (1998) Protein disorder and the evolution of molecular recognition:Theory, predictions and observations. Pac Symp Biocomput 473–484.

18. Pauling L (1948) Nature of forces between large molecules of biological interest.Nature 161:707–709.

19. Yang LW, Bahar I (2005) Coupling between catalytic site and collective dynamics: Arequirement for mechanochemical activity of enzymes. Structure 13:893–904.

20. Shoichet BK, Baase WA, Kuroki R, Matthews BW (2005) A relationship between proteinstability and protein function. Proc Natl Acad Sci USA 92:452–456.

21. Ladbury JE, Arold S (2000) Searching for specificity in SH domains. Chem Biol 7:R3–R8.22. Wang R, Fang X, Lu Y, Yang CY, Wang S (2005) The PDBbind database: Methodologies

and updates. J Med Chem 48:4111–4119.23. Barrell D, et al. (2009) The GOA database in 2009—An integrated gene ontology

annotation resource. Nucl Acids Res 37:D396–D403.24. UniProt Consortium (2008) The universal protein resource (UniProt). Nucleic Acids Res

36:D190–D195.25. Hong EL, et al. (2008) Gene ontology annotations at SGD: New data sources and

annotation methods. Nucleic Acids Res 36:D577–D581.26. Keseler IM, et al. (2009) EcoCyc: A comprehensive view of Escherichia coli biology. Nucl

Acids Res 37:D464–D470.27. Liang C, et al. (2008) Gramene: A growing plant comparative genomics resource. Nucl

Acids Res 36:D947–D953.28. Drysdale R, FlyBase Consortium (2008) FlyBase: A database for the Drosophila research

community. Methods Mol Biol 420:45–59.29. Bieri T, et al. (2007) WormBase: New content and better access. Nucleic Acids Res

35:D506–D510.30. Fey P, et al. (2009) dictyBase—A Dictyostelium bioinformatics resource update. Nucleic

Acids Res 37:D515–D519.31. Aslett M, Wood V (2006) Gene ontology annotation status of the fission yeast genome:

Preliminary coverage approaches 100%. Yeast 13:913–919.32. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families.

Nucleic Acids Res 31:371–373.33. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z (2006) Length-dependent

prediction of protein intrinsic disorder. BMC Bioinformatics 7:208.34. Prilusky J, et al. (2005) FoldIndex: A simple tool to predict whether a given protein

sequence is intrinsically unfolded. Bioinformatics 21:3435–3438.35. Linding R, et al. (2003) Protein disorder prediction: Implications for structural pro-

teomics. Structure 11:1453–1459.36. Berman HM, et al. (2000) The protein data bank. Nucleic Acids Res 28:235–242.37. DeGroot MH (1991) In Probability and Statistics. (Addison-Wesley, Reading, MA), 3rd Ed.

Liu et al. PNAS � November 24, 2009 � vol. 106 � no. 47 � 19823

BIO

PHYS

ICS

AN

DCO

MPU

TATI

ON

AL

BIO

LOG

Y

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

2, 2

020