arxiv:1211.1281v2 [q-bio.qm] 12 jan 2013

19
arXiv:1211.1281v2 [q-bio.QM] 12 Jan 2013 Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models Magnus Ekeberg 1 , Cecilia L¨ ovkvist 3 ,Yueheng Lan 5 , Martin Weigt 6 , Erik Aurell 2,3,4,†∗ (Dated: January 15, 2013) Spatially proximate amino acids in a protein tend to coevolve. A protein’s 3D structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction, but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/ . PACS numbers: 02.50.Tt – Inference methods, 87.10.Vg – Biological information, 87.15.Qt – Sequence analysis, 87.14.E- – Proteins I. INTRODUCTION In biology, new and refined experimental techniques have triggered a rapid increase in data availability during the last few years. Such progress needs to be accompa- nied by the development of appropriate statistical tools to treat growing data sets. An example of a branch un- dergoing intense growth in the amount of existing data is protein structure prediction (PSP), which, due to the strong relation between a protein’s structure and its func- tion, is one central topic in biology. As we shall see, one can accurately estimate the 3D structure of a protein by identifying which amino-acid positions in its chain are statistically coupled over evolutionary time scales [1–5]. Much of the experimental output is today readily acces- sible through public databases such as Pfam [6], which collects over 13,000 families of evolutionarily related pro- tein domains, the largest of them containing more than 2 × 10 5 different amino-acid sequences. Such databases allow researchers to easily access data, to extract infor- mation from it, and to confront their results. A recurring difficulty when dealing with interacting systems is distinguishing direct interactions from inter- * 1 Engineering Physics Program, KTH Royal Institute of Tech- nology, 100 44 Stockholm, Sweden, 2 ACCESS Linnaeus Cen- ter, KTH, Sweden, 3 Department of Computational Biology, Al- baNova University Center, 106 91 Stockholm, Sweden, 4 Aalto University School of Science, Helsinki, Finland, 5 Department of Physics, Tsinghua University, Beijing 100084, P. R. China, 6 Universit´ e Pierre et Marie Curie, UMR7238 – Laboratoire de enomique des Microorganismes, 15 rue de l’Ecole de M´ edecine, 75006 Paris, France, [email protected] actions mediated via multi-step paths across other ele- ments. Correlations are, in general, straightforward to compute from raw data, whereas parameters describing the true causal ties are not. The network of direct in- teractions can be thought of as hidden beneath observ- able correlations, and untwisting it is a task of inher- ent intricacy. In PSP, using mathematical means to dispose of the network-mediated correlations observable in alignments of evolutionarily related (and structurally conserved) proteins is necessary to get optimal results [1, 2, 7–9] and yields improvements worth the computa- tional strain put on the analysis. This approach to PSP, which we refer to as direct-coupling analysis (DCA), is the focus of this paper. In a more general setting, the problem of inferring interactions from observations of instances amounts to inverse statistical mechanics, a field which has been intensively pursued in statistical physics over the last decade [10–23]. Similar tasks were formulated earlier in statistics and machine learning, where they have been called model learning and inference [24–27]. To illustrate this concretely, let us start from an Ising model, P (σ 1 ,...,σ N )= 1 Z exp N i=1 h i σ i + 1i<jN J ij σ i σ j (1) and its magnetizations m i = hi log Z and connected cor- relations c ij = Jij log Z m i m j . Counting the number of observables (m i and c ij ) and the number of parameters (h i and J ij ) it is clear that perfect knowledge of the mag- netizations and correlations should suffice to determine the external fields and the couplings exactly. It is, how- ever, also clear that such a process must be computation-

Upload: others

Post on 21-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

arX

iv:1

211.

1281

v2 [

q-bi

o.Q

M]

12

Jan

2013

Improved contact prediction in proteins:

Using pseudolikelihoods to infer Potts models

Magnus Ekeberg1, Cecilia Lovkvist3,Yueheng Lan5, Martin Weigt6, Erik Aurell2,3,4,†∗

(Dated: January 15, 2013)

Spatially proximate amino acids in a protein tend to coevolve. A protein’s 3D structure henceleaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures fromsuch correlations is an open problem in structural biology, pursued with increasing vigor as more andmore protein sequences continue to fill the data banks. Within this task lies a statistical inferenceproblem, rooted in the following: correlation between two sites in a protein sequence can arise fromfirsthand interaction, but can also be network-propagated via intermediate sites; observed correlationis not enough to guarantee proximity. To separate direct from indirect interactions is an instanceof the general problem of inverse statistical mechanics, where the task is to learn model parameters(fields, couplings) from observables (magnetizations, correlations, samples) in large systems. Inthe context of protein sequences, the approach has been referred to as direct-coupling analysis.Here we show that the pseudolikelihood method, applied to 21-state Potts models describing thestatistical properties of families of evolutionarily related proteins, significantly outperforms existingapproaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.This improved performance also relies on a modified score for the coupling strength. The resultsare verified using known crystal structures of specific sequence instances of various protein families.Code implementing the new method can be found at http://plmdca.csc.kth.se/.

PACS numbers: 02.50.Tt – Inference methods, 87.10.Vg – Biological information, 87.15.Qt – Sequence

analysis, 87.14.E- – Proteins

I. INTRODUCTION

In biology, new and refined experimental techniqueshave triggered a rapid increase in data availability duringthe last few years. Such progress needs to be accompa-nied by the development of appropriate statistical toolsto treat growing data sets. An example of a branch un-dergoing intense growth in the amount of existing datais protein structure prediction (PSP), which, due to thestrong relation between a protein’s structure and its func-tion, is one central topic in biology. As we shall see, onecan accurately estimate the 3D structure of a protein byidentifying which amino-acid positions in its chain arestatistically coupled over evolutionary time scales [1–5].Much of the experimental output is today readily acces-sible through public databases such as Pfam [6], whichcollects over 13,000 families of evolutionarily related pro-tein domains, the largest of them containing more than2 × 105 different amino-acid sequences. Such databasesallow researchers to easily access data, to extract infor-mation from it, and to confront their results.A recurring difficulty when dealing with interacting

systems is distinguishing direct interactions from inter-

∗ 1 Engineering Physics Program, KTH Royal Institute of Tech-

nology, 100 44 Stockholm, Sweden, 2 ACCESS Linnaeus Cen-

ter, KTH, Sweden, 3 Department of Computational Biology, Al-

baNova University Center, 106 91 Stockholm, Sweden,4 Aalto

University School of Science, Helsinki, Finland, 5 Department

of Physics, Tsinghua University, Beijing 100084, P. R. China,6 Universite Pierre et Marie Curie, UMR7238 – Laboratoire de

Genomique des Microorganismes, 15 rue de l’Ecole de Medecine,

75006 Paris, France,†[email protected]

actions mediated via multi-step paths across other ele-ments. Correlations are, in general, straightforward tocompute from raw data, whereas parameters describingthe true causal ties are not. The network of direct in-teractions can be thought of as hidden beneath observ-able correlations, and untwisting it is a task of inher-ent intricacy. In PSP, using mathematical means todispose of the network-mediated correlations observablein alignments of evolutionarily related (and structurallyconserved) proteins is necessary to get optimal results[1, 2, 7–9] and yields improvements worth the computa-tional strain put on the analysis. This approach to PSP,which we refer to as direct-coupling analysis (DCA), isthe focus of this paper.In a more general setting, the problem of inferring

interactions from observations of instances amounts toinverse statistical mechanics, a field which has beenintensively pursued in statistical physics over the lastdecade [10–23]. Similar tasks were formulated earlier instatistics and machine learning, where they have beencalled model learning and inference [24–27]. To illustratethis concretely, let us start from an Ising model,

P (σ1, . . . , σN ) =1

Zexp

N∑

i=1

hiσi +∑

1≤i<j≤N

Jijσiσj

(1)and its magnetizationsmi = ∂hi

logZ and connected cor-relations cij = ∂Jij

logZ −mimj . Counting the numberof observables (mi and cij) and the number of parameters(hi and Jij) it is clear that perfect knowledge of the mag-netizations and correlations should suffice to determinethe external fields and the couplings exactly. It is, how-ever, also clear that such a process must be computation-

2

ally expensive, since it requires the computation of thepartition function Z for an arbitrary set of parameters.The exact but iterative procedure known as Boltzmannmachines [28] does in fact work on small systems, but itis out of question for the problem sizes considered in thispaper. On the other hand, the mean-field equations of(1) read [29–31]:

tanh−1 mi = hi +∑

j

Jijmj . (2)

From (2) and the fluctuation-dissipation relations anequation can be derived connecting the coupling coef-ficients Jij and the correlation matrix c = (cij) [10]:

Jij = −(c−1)ij . (3)

Equations (2) and (3) exemplify typical aspects of in-verse statistical mechanics, and inference in large sys-tems in general. On one hand, the parameter recon-struction using these two equations is not exact. It isonly approximate, because the mean-field equations (2)are themselves only approximate. It also demands rea-sonably good sampling, as the matrix of correlations isnot invertible unless it is of full rank, and small noiseon its O(N2) entries may result in large errors in esti-mating the Jij . On the other hand, this method is fast,as fast as inverting a matrix, because one does not needto compute Z. Except for mean-field methods as in (2),approximate methods recently used to solve the inverseIsing problem can be grouped as expansion in correla-tions and clusters [16, 19], methods based on the Betheapproximation [17, 18, 20–22], and the pseudolikelihoodmethod [23, 27].For PSP, it is not the Ising model but a 21-state Potts

model which is pertinent [1]: The model shall be learnedsuch that it represents the statistical features of largemultiple sequence alignments of homologous (evolution-arily related) proteins, and to reproduce the statistics ofcorrelated amino acid substitutions. This can be donewith the Potts equivalent of Eq. (1), i.e. using a modelwith pairwise interactions. As will be detailed below,strong interactions can be interpreted as indicators forspatial vicinity of residues in the three-dimensional pro-tein fold, even if residues are well separated along thesequence – thus linking evolutionary sequence statisticswith protein structure. But which of all the inferencemethods in inverse statistical mechanics, machine learn-ing or statistics is most suitable for treating real proteinsequence data? How do the test results obtained for inde-pendently generated equilibrium configurations of Pottsmodels translate to the case of protein sequences, whichare neither independent nor equilibrium configurationsof any well-defined statistical-physics model? The maingoal of this paper is to move towards an answer to thisquestion by showing that the pseudolikelihood method isa very powerful means to perform DCA, with a predic-tion accuracy considerably out-performing methods pre-viously assessed.

The paper is structured as follows: in Section II, wereview the ideas underlying PSP and DCA and explainthe biological hypotheses linking protein 3D structure tocorrelations in amino-acid sequences. We also review ear-lier approaches to DCA. In Section III, we describe thePotts model in the context of DCA and the propertiesof exponential families. We further detail a maximum-likelihood (ML) approach as brought to bear on the in-verse Potts problem and discuss in more detail why thisis impractical for realistic system sizes, and we intro-duce, similarly to (3) above, the inverse Potts mean-fieldmodel algorithm for the DCA (mfDCA) and a pseudolike-lihood maximization procedure (plmDCA). This sectionalso covers algorithmic details of both models such as reg-ularization and sequence reweighting. A further impor-tant issue is the selection of an interaction score, whichreduces coupling matrices to a scalar score, and allows forranking of couplings according to their ’strength’. In Sec-tion IV, we present results from prediction experimentsusing mfDCA and plmDCA assessed against known crys-tal structures. In Section V, we summarize our findings,put their implications into context, and discuss possiblefuture developments. The appendixes contain additionalmaterial supporting the main text.

II. PROTEIN STRUCTURE PREDICTION AND

DIRECT-COUPLING ANALYSIS

Proteins are essential players in almost all biologicalprocesses. Primarily, proteins are linear chains, each sitebeing occupied by 1 out of 20 different amino acids. Theirfunction relies, however, on the protein fold, which refersto the 3D conformation into which the amino-acid chaincurls. This fold guarantees, e.g., that the right aminoacids are exposed in the right positions at the proteinsurface, thus forming biochemically active sites, or thatthe correct pairs of amino acids are brought into contactto keep the fold thermodynamically stable.Experimentally determining the fold, using e.g. x-ray

crystallography or NMR tomography, is still a rathercostly and time-consuming process. On the other hand,every newly sequenced genome results in a large num-ber of newly predicted proteins. The number of se-quenced organisms now exceeds 3, 700, and continuesto grow exponentially (genomesonline.org [32]). Themost prominent database for protein sequences, Uniprot(uniprot.org [33]), lists about 25,000,000 different pro-tein sequences, whereas the number of experimentallydetermined protein structures is only around 85,000(pdb.org [34]).However, the situation of structural biology is not as

hopeless as these numbers might suggest. First, proteinshave a modular architecture; they can be subdivided intodomains which, to a certain extent, fold and evolve asunits. Second, domains of a common evolutionary ori-gin, i.e., so-called homologous domains, are expected toalmost share their 3D structure and to have related func-

3

FIG. 1. (Color online) Left panel: small MSA with two posi-tions of correlated amino-acid occupancy. Right panel: hypo-thetical corresponding spatial conformation, bringing the twocorrelated positions into direct contact.

tion. They can therefore be collected in protein domainfamilies: the Pfam database (pfam.sanger.ac.uk [6])lists almost 14,000 different domain families, and thenumber of sequences collected in each single family rangesroughly from 102 to 105. In particular the larger fami-lies, with more than 1,000 members, are of interest tous, as we argue that their natural sequence variabilitycontains important statistical information about the 3Dstructure of its member proteins, and can be exploited tosuccessfully address the PSP problem.

Two types of data accessible via the Pfam databaseare especially important to us. The first is the multiplesequence alignment (MSA), a table of the amino acidsequences of all the protein domains in the family linedup to be as similar as possible. A (small and illustrative)example is shown in Fig. 1 (left panel). Normally, not allmembers of a family can be lined up perfectly, and thealignment therefore contains both amino acids and gaps.At some positions, an alignment will be highly specific(cf. the second, fully conserved column in Fig. 1), whileat others it will be more variable. The second data typeconcerns the crystal structure of one or several membersof a family. Not all families provide this second type ofdata. We discuss its use for an a posteriori assessment ofour inference results in detail in Sec. IV.

The Potts-model based inference uses only the firstdata type, i.e., the sequence data. Small spatial separa-tion between amino acids in a protein, cf. the right panelof Fig. 1, encourages co-occurrence of favorable amino-acid combinations, cf. the left panel of Fig. 1. This spicesthe sequence record with correlations, which can be re-liably determined in sufficiently large MSAs. However,the use of such correlations for predicting 3D contacts asa first step to solve the PSP problem remained of limitedsuccess [35–37], since they can be induced both by directinteractions (amino acid A is close to amino acid B), andby indirect interactions (amino acids A and B are bothclose to amino acid C). Lapedes et al. [38] were the firstto address, in a purely theoretical setting, these ambigui-

ties of a correlation-based route to protein sequence anal-ysis, and these authors also outline a maximum-entropyapproach to get at direct interactions. Weigt et al. [1]successfully executed this program subsequently calleddirect-coupling analysis : the accuracy in predicting con-tacts strongly increases when direct interactions are usedinstead of raw correlations.

To computationally solve the task of inferring inter-actions in a Potts model, [1] employed a generalizationof the iterative message-passing algorithm susceptibilitypropagation previously developed for the inverse Isingproblem [17]. Methods in this class are expected to out-perform mean-field based reconstruction methods similarto (3) if the underlying graph of direct interactions is lo-cally close to tree-like, an assumption which may or maynot be true in a given application such as PSP. A sub-stantial draw-back of susceptibility propagation as usedin [1] is that it requires a rather large amount of auxiliaryvariables, and that DCA could therefore only be carriedout on protein sequences that are not too long. In [2],this obstacle was overcome by using instead a simplermean-field method, i.e., the generalization of (3) to a 21-state Potts model. As discussed in [2], this broadens thereach of the DCA to practically all families currently inPfam. It improves the computational speed by a factorof about 103–104, and it appears also to be more accuratethan the susceptibility-propagation based method of [1]in predicting contact pairs. The reason behind this thirdadvantage of mean-field over susceptibility propagationas an approximate method of DCA is unknown at thistime.

Pseudolikelihood maximization (PLM) is an alternativemethod developed in mathematical statistics to approxi-mate maximum-likelihood inference, which breaks downthe a priori exponential time complexity of computingpartition functions in exponential families [39]. On theinverse Ising problem it was first used by Ravikumar etal. [27], albeit in the context of graph sign-sparsity recon-struction; two of us showed recently that it outperformsmany other approximate inverse Ising schemes on theSherrington-Kirkpatrick model, and in several other ex-amples [23]. Although this paper reports the first use ofthe PLM method in DCA, the idea of using pseudolikeli-hoods for PSP is not completely novel. Balakrishnan etal. [8] devised a version of this idea, but using a set-uprather different from that in [2], regarding, for example,what portions of the data bases and which measures ofprediction accuracy were used, and also, not couched inthe language of inverse statistical mechanics. While acompetitive evaluation between [2] and [8] is still open,we have not attempted such a comparison in this work.

Other ways of deducing direct interactions in PSP, notmotivated from the Potts model but in somewhat simi-lar probabilistic settings, have been suggested in the lastfew years. A fast method utilizing Bayesian networkswas provided by Burger and van Nimwegen [7]. Morerecently, Jones et al. [9] introduced a procedure calledPSICOV (Protein Sparse Inverse COVariance). While

4

DCA and PSICOV both appear capable of outperform-ing the Bayesian network approach [2, 9], their relativeefficiency is currently open to investigation, and is notassessed in this work.Finally, predicting amino acid contacts is not only

a goal in itself, but also a means to assemble proteincomplexes [40, 41] and to predict full 3D protein struc-tures [3, 4, 42]. Such tasks require additional work, usingthe DCA results as input, and are outside the scope ofthe present paper.

III. METHOD DEVELOPMENT

A. The Potts model

Let σ = (σ1, σ2, · · · , σN ) represent the amino acid se-quence of a domain with length N . Each σi takes onvalues in 1, 2, ..., q, with q = 21: one state for each ofthe 20 naturally occurring amino acids and one additionalstate to represent gaps. Thus, an MSA with B alignedsequences from a domain family can be written as an in-teger array σ(b)Bb=1, with one row per sequence and onecolumn per chain position. Given an MSA, the empiricalindividual and pairwise frequencies can be calculated as

fi(k) =1

B

B∑

b=1

δ(σ(b)i , k),

fij(k, l) =1

B

B∑

b=1

δ(σ(b)i , k) δ(σ

(b)j , l), (4)

where δ(a, b) is the Kronecker symbol taking value 1 ifboth arguments are equal, and 0 otherwise. fi(k) is hencethe fraction of sequences for which the entry on positioni is amino acid k, gaps counted as a 21st amino acid.Similarly, fij(k, l) is the fraction of sequences in whichthe position pair (i, j) holds the amino acid combination(k, l). Connected correlations are given as

cij(k, l) = fij(k, l)− fi(k) fj(l). (5)

A (generalized) Potts model is the simplest probabilis-tic model P (σ) which can reproduce the empirically ob-served fi(k) and fij(k, l). In analogy to (1) it is definedas

P (σ) =1

Zexp

N∑

i=1

hi(σi) +∑

1≤i<j≤N

Jij(σi, σj)

, (6)

in which hi(σi) and Jij(σi, σj) are parameters to be de-termined through the constraints

P (σi = k) =∑

σ

σi=k

P (σ) = fi(k),

P (σi = k, σj = l) =∑

σ

σj=l

σi=k

P (σ) = fij(k, l). (7)

It is immediate that the probabilistic model which max-imizes the entropy while satisfying Eq. (7) must takethe Potts model form. Finding a Potts model whichmatches empirical frequencies and correlations is there-fore referred to as a maximum-entropy inference [43, 44],cf. also [1, 38] for a formulation in terms of protein-sequence modeling.

On the other hand, we can use Eq. (6) as a varia-tional ansatz and maximize the probability of the inputdata set σ(b)Bb=1 with respect to the model parametershi(σ) and Jij(σ, σ

′); this maximum-likelihood perspectiveis used in the following. We note that the Ising and thePotts models (and most models which are normally con-sidered in statistical mechanics) are examples of expo-nential families, and have the property that means andcorrelations are sufficient statistics [45–47]. Given un-limited computing power to determine Z, reconstructioncan not be done better using all the data compared to us-ing only (empirical) means and (empirical) correlations.It is only when one cannot compute Z exactly and hasto resort to approximate methods, that using directly allthe data can bring any advantage.

B. Model parameters and gauge invariance

The total number of parameters of Eq, (6) is Nq +N(N−1)

2 q2, but, in fact, the model as it stands is over-parameterized in the sense that distinct parameter setscan describe the same probability distribution. It is easyto see that the number of nonredundant parameters is

N(q − 1) + N(N−1)2 (q − 1)2, cf. an Ising model (q = 2),

which has N(N+1)2 parameters if written as in Eq. (1)

but would have 2N2 parameters if written in the form ofEq. (6).

A gauge choice for the Potts model, which eliminatesthe overparametrization in a similar manner as in theIsing model (and reduces to that case for q = 2), is

q∑

s=1

Jij(k, s) =

q∑

s=1

Jij(s, l) =

q∑

s=1

hi(s) = 0, (8)

for all i, j(> i), k, and l. Alternatively, we can choose agauge where one index, say i = q, is special, and measureall interaction energies with respect to this value, i.e.,

Jij(q, l) = Jij(k, q) = hi(q) = 0, (9)

for all i, j(> i), k, and l, cf. [2]. This gauge choicecorresponds to a lattice gas model with q − 1 differentparticle types, and a maximum occupation number one.

Using either (8) or (9) reconstruction is well-defined,and it is straight-forward to translate results obtained inone gauge to the other.

5

C. The inverse Potts problem

Given a set of independent equilibrium configurationsσ(b)Bb=1 of the model Eq. (6), the ordinary statisticalapproach to inferring parameters h,J would be to letthose parameters maximize the likelihood (i.e., the prob-ability of generating the data set for a given set of pa-rameters). This is equivalent to minimizing the (rescaled)negative log-likelihood function

l = −1

B

B∑

b=1

logP (σ(b)). (10)

For the Potts model (6), this becomes

l(h,J) = logZ −

N∑

i=1

q∑

k=1

fi(k)hi(k) (11)

−∑

1≤i<j≤N

q∑

k,l=1

fij(k, l)Jij(k, l).

l is differentiable, so minimizing it means looking for apoint at which ∂hi(k)l = 0 and ∂Jij(k,l)l = 0. Hence, MLestimates will satisfy

∂hi(k) logZ − fi(k) = 0,

∂Jij(k,l) logZ − fij(k, l) = 0. (12)

To achieve this minimization computationally, we need tobe able to calculate the partition function Z of Eq. (6) forany realization of the parameters h,J. This problemis computationally intractable for any reasonable systemsize. Approximate minimization is essential, and we willshow that even relatively simple approximation schemeslead to accurate PSP results.

D. Naive mean-field inversion

The mfDCA algorithm in [2] is based on the simplestand computationally most efficient approximation, i.e.,naive mean-field inversion (NMFI). It starts from theproper generalization of (2), cf. [48], and then uses lin-ear response: The J ’s in the lattice-gas gauge Eq. (9)become:

JNMFIij (k, l) = −(C−1)ab, (13)

where a = (q − 1)(i − 1) + k and b = (q − 1)(j − 1) + l.The matrix C is the N(q− 1)×N(q− 1) covariance ma-trix assembled by joining the N(q− 1)×N(q− 1) valuescij(k, l) as defined in Eq. (5), but leaving out the laststate q. In Eq. (13), i, j ∈ 1, ..., N are site indices,and k, l run from 1 to q − 1. Under gauge Eq. (9), allthe other coupling parameters are zero. The term ’naive’has become customary in the inverse statistical mechan-ics literature, often used to highlight the difference to aThouless-Anderson-Palmer level inversion or one basedon the Bethe approximation. The original meaning ofthis term lies, as far as we are aware, in information ge-ometry [49, 50].

E. Pseudolikelihood maximization

Pseudolikelihood substitutes the probability in (10) bythe conditional probability of observing one variable σr

given observations of all the other variables σ\r. Thatis, the starting point is

P (σr = σ(b)r |σ\r = σ

(b)\r )

=

exp

hr(σ(b)r ) +

N∑

i=1i6=r

Jri(σ(b)r , σ

(b)i )

∑q

l=1 exp

hr(l) +N∑

i=1i6=r

Jri(l, σ(b)i )

, (14)

where, for notational convenience, we take Jri(l, k) tomean Jir(k, l) when i < r. Given an MSA, we can maxi-mize the conditional likelihood by minimizing

gr(hr,Jr) = −1

B

B∑

b=1

log[

Phr,Jr(σr = σ(b)r |σ\r = σ

(b)\r )]

.

(15)Note that this only depends on hr and Jr = Jiri6=r,i.e., on the parameters featuring node r. If (15) is usedfor all r this leads to unique values for the parameters hr

but typically different predictions for Jrq and Jqr (whichshould be the same). Minimizing (15) must therefore besupplemented by some procedure on how to reconcile dif-ferent values of Jrq and Jqr ; one way would be to simply

use their average 12 (Jrq + J

Tqr) [27].

We here reconcile different Jrq and Jqr by minimizingan objective function adding gr for all nodes:

lpseudo(h,J) =

N∑

r=1

gr(hr,Jr) (16)

= −1

B

N∑

r=1

B∑

b=1

log[

Phr ,Jr(σr = σ(b)r |σ\r = σ

(b)\r )]

.

Minimizers of lpseudo generally do not minimize l; the re-placement of likelihood with pseudolikelihood alters theoutcome. Note, however, that replacing l by lpseudo re-solves the computational intractability of the parameteroptimization problem: instead of depending on the fullpartition function, the normalization of the conditionalprobability (14) contains only a single summation overthe q = 21 Potts states. The intractable average overthe N − 1 conditioning spin variables is replaced by anempirical average over the data set in Eq. (16).

F. Regularization

A Potts model describing a protein family with se-quences of 50-300 amino acids requires ca. 5 · 105 to2 ·107 parameters. At present, few protein families are inthis range in size, and regularization is therefore needed

6

to avoid overfitting. In NMFI, the problem results in anempirical covariance matrix which typically is not of fullrank, and Eq. (13) is not well-defined. In [2], one of theauthors therefore used the pseudocount method wherefrequencies and empirical correlations are adjusted usinga regularization variable λ:

fi(k) =1

λ+B

[

λ

q+

B∑

b=1

δ(σ(b)i , k)

]

, (17)

fij(k, l) =1

λ+B

[

λ

q2+

B∑

b=1

δ(σ(b)i , k) δ(σ

(b)j , l)

]

.

The pseudocount is a proxy for many observations, which– if they existed – would increase the rank of the corre-lation matrix; the pseudocount method hence promotesinvertibility of the matrix in Eq. (13). It was observed in[2] that for good performance in DCA, the pseudocountparameter λ has to be taken fairly large, on the order ofB.

In the PLM method, we take the standard route ofadding a penalty term to the objective function:

hPLM ,JPLM = argminh,J

lpseudo(h,J)+R(h,J). (18)

The turnout is then a trade-off between likelihoodmaximization and whatever qualities R is pushing for.Ravikumar et al. [27] pioneered the use of l1 regulariz-ers for the inverse Ising problem, which forces a fractionof parameters to assume value zero, thus effectively re-ducing the number of parameters. This approach is notappropriate here since we are concerned with the accu-racy (resp. divergence) of the strongest predicted cou-plings; for our purposes it makes no substantial differ-ence whether weak couplings are inferred to be small orset precisely to 0. Our choice for R is therefore the sim-pler l2 norm

Rl2(h,J) = λh

N∑

r=1

||hr||22 + λJ

1≤i<j≤N

||Jij ||22, (19)

using two regularization parameters λh and λJ for fieldand coupling parameters. An advantage of a regular-izer is that it eliminates the need to fix a gauge, sinceamong all parameter sets related by a gauge transforma-tion, i.e., all parameter sets resulting in the same Pottsmodel, there will be exactly one set which minimizesthe strictly convex regularizer. For the case of the l2norm, it can be shown that this leads to a gauge where∑q

s=1 Jij(k, s) =λh

λJhi(k),

∑q

s=1 Jij(s, l) =λh

λJhj(l), and

∑q

s=1 hi(s) = 0 for all i, j(> i), k, and l.

To summarize this discussion: For NMFI, we regu-larize with pseudocounts under the gauge constraintsEq. (9). For PLM, we regularize with Rl2 under the fullparametrization.

G. Sequence reweighting

Maximum-likelihood inference of Potts models relies– as discussed above – on the assumption that the Bsample configurations in our data set are independentlygenerated from Eq. (6). This assumption is not correctfor biological sequence data, which have a phylogeneticbias. In particular, in the databases there are many pro-tein sequences from related species, which did not haveenough time of independent evolution to reach statisticalindependence. Furthermore, the selection of sequencedspecies in the genomic databases is dictated by humaninterest, and not by the aim to have an as independentas possible sampling in the space of all functional amino-acid sequences. A way to mitigate effects of uneven sam-pling, employed in [2], is to equip each sequence σ(b) witha weight wb which regulates its impact on the parameterestimates. Sequences deemed unworthy of independent-sample status (too similar to other sequences) can thenhave their weight lowered, whereas sequences which arequite different from all other sequences will contributewith a higher weight to the amino-acid statistics.

A simple but efficient way (cf. [2]) is to measurethe similarity sim(σ(a),σ(b)) of two sequences σ

(a) andσ

(b) as the fraction of conserved positions (i.e., identicalamino acids), and compare this fraction to a preselectedthreshold x , 0 < x < 1. The weight given to a sequenceσ

(b) can then be set to wb =1mb

, where mb is the number

of sequences in the MSA similar to σ(b):

mb = |a ∈ 1, ..., B : sim(σ(a),σ(b)) ≥ x|. (20)

In [2], a suitable threshold x was found to be 0.8, re-sults only weakly dependent on this choice throughout0.7 < x < 0.9. We have here followed the same procedureusing threshold x = 0.9. The corresponding reweightedfrequency counts then become

fi(k) =1

λ+Beff

[

λ

q+

B∑

b=1

wb δ(σ(b)i , k)

]

, (21)

fij(k, l) =1

λ+Beff

[

λ

q2+

B∑

b=1

wb δ(σ(b)i , k) δ(σ

(b)j , l)

]

,

whereBeff =∑B

b=1 wb becomes a measure of the numberof effectively nonredundant sequences.

In the pseudolikelihood we use the direct analog ofEq. (21), i.e.,

lpseudo(h,J) (22)

=−1

Beff

B∑

b=1

wb

N∑

r=1

log[

Phr ,Jr(σr = σ(b)r |σ\r = σ

(b)\r )]

.

As in the frequency counts, each sequence is consideredto contribute a weight wb, instead of the standard weightone used in i.i.d. samples.

7

H. Interaction scores

In the inverse Ising problem, each interaction is scoredby one scalar coupling strength Jij . These can easilybe ordered, e.g. by absolute size. In the inverse Pottsproblem, each position pair (i, j) is characterized by awhole q×q matrix Jij , and some scalar score Sij is neededin order to evaluate the ‘coupling strength’ of two sites.In [1] and [2] the score used is based on direct infor-

mation (DI), i.e., the mutual information of a restrictedprobability model not including any indirect coupling ef-fects between the two positions to be scored. Its con-struction goes as follows: For each position pair (i, j),(the estimate of) Jij is used to set up a ’direct distribu-tion’ involving only nodes i and j,

P(dir)ij (k, l) ∼ exp

(

Jij(k, l) + h′i,k + h′

j,l

)

. (23)

h′i,k and h′

j,l are new fields, computed as to ensure agree-ment of the marginal single-site distributions with theempirical individual frequency counts fi(k) and fj(l).The score SDI

ij is now calculated as the mutual infor-

mation of P (dir):

SDIij =

q∑

k,l=1

P(dir)ij (k, l) log

(

P(dir)ij (k, l)

fi(k) fj(l)

)

. (24)

A nice characteristic of SDIij is its invariance with re-

spect to the gauge freedom of the Potts model, i.e., bothchoices Eqs. (8) and (9) (or any other valid choice) gen-erate identical SDI

ij .In the pseudolikelihood approach, we prefer not to use

SDIij , as this would require a pseudocount λ to regularize

the frequencies in (24), introducing a third regulariza-tion variable in addition to λh and λJ . Another possiblescoring function, already mentioned but not used in [1],is the Frobenius norm (FN)

‖Jij‖2 =

q∑

k,l=1

Jij(k, l)2. (25)

Unlike SDIij , (25) is not independent of gauge choice, so

one must be a bit careful. As was noted in [1], the zerosum gauge (8) minimizes the Frobenius norm, in a sensemaking (8) the most appropriate gauge choice for thescore (25). Recall from above that our pseudolikelihooduses the full representation and fixes the gauge by theregularization terms Rl2 . Our procedure is therefore tofirst infer the interaction parameters using the pseudo-likelihood and the regularization, and then to change tothe zero-sum gauge:

J ′ij(k, l) = Jij(k, l)− Jij(·, l)− Jij(k, ·) + Jij(·, ·), (26)

where ‘·’ denotes average over the concerned position.One can show that (26) preserves the probabilities of (6)

(after altering the fields appropriately) and that J ′ij(k, l)

satisfy (8). A possible Frobenius norm score is hence

SFNij = ‖J′

ij‖2 =

q∑

k,l=1

J ′ij(k, l)

2. (27)

Lastly, we borrow an idea from Jones et al. [9], whosePSICOV method also uses a norm rank (1-norm insteadof Frobenius norm), but adjusted by an average-productcorrection (APC) term. APC was introduced in [51] tosuppress effects from phylogenetic bias and insufficientsampling. Incorporating also this correction, we haveour scoring function

SCNij = SFN

ij −SFN·j SFN

SFN··

, (28)

where CN stands for ‘corrected norm’. An imple-mentation of plmDCA in MATLAB is available athttp://plmdca.csc.kth.se/.

IV. EVALUATING THE PERFORMANCE OF

MFDCA AND PLMDCA ACROSS PROTEIN

FAMILIES

We have performed numerical experiments usingmfDCA and plmDCA on a number of domain familiesfrom the Pfam database; here we report and discuss theresults.

A. Domain families, native structures, and

true-positive rates

The speed of mfDCA enabled Morcos et al. [2] toconduct a large-scale analysis using 131 families. PLMis computationally more demanding than NMFI, so wechose to start with a smaller collection of 17 families,listed in Table I. To ease the numerical effort, we chosefamilies with relatively small N and B. More precisely,we selected families out of the first 115 Pfam entries (lowPfam ID), which have (i) at most N = 130 residues, (ii)between 2,000 and 22,000 sequences, and (iii) reliablestructural information (cf. the PDB entries provided inthe table). No selection based on DCA performance wasdone. In the appendix, a number of longer proteins isstudied. The mfDCA performance on the selected fami-lies was found to be coherent with the larger data set ofMorcos et al..To reliably assess how good a contact prediction is,

something to regard as a ’gold standard’ is helpful. Foreach of the 17 families we have therefore selected one rep-resentative high-resolution X-ray crystal structure (reso-lution below 3A), see Table I for the corresponding PDBidentification.From these native protein structures, we have ex-

tracted position-position distances d(i, j) for each pair of

8

Family ID N B Beff (90%) PDB ID UniProt entry UniProt residues

PF00011 102 5024 3481 2bol TSP36 TAESA 106-206

PF00013 58 6059 3785 1wvn PCBP1 HUMAN 281-343

PF00014 53 2393 1812 5pti BPT1 BOVIN 39-91

PF00017 77 2732 1741 1o47 SRC HUMAN 151-233

PF00018 48 5073 3354 2hda YES HUMAN 97-144

PF00027 91 12129 9036 3fhi KAP0 BOVIN 154-238

PF00028 93 12628 8317 2o72 CADH1 HUMAN 267-366

PF00035 67 3093 2254 1o0w RNC THEMA 169-235

PF00041 85 15551 10631 1bqu IL6RB HUMAN 223-311

PF00043 95 6818 5141 6gsu GSTM1 RAT 104-192

PF00046 57 7372 3314 2vi6 NANOG MOUSE 97-153

PF00076 70 21125 14125 1g2e ELAV4 HUMAN 48-118

PF00081 82 3229 1510 3bfr SODM YEAST 27-115

PF00084 56 5831 4345 1elv C1S HUMAN 359-421

PF00105 70 2549 1277 1gdc GCR RAT 438-507

PF00107 130 17864 12114 1a71 ADH1E HORSE 203-338

PF00111 78 7848 5805 1a70 FER1 SPIOL 58-132

TABLE I. Domain families included in our study, listed with Pfam ID, length N , number of sequences B (after removal ofduplicate sequences), number of effective sequences Beff (under x = 0.9, i.e., 90% threshold for reweighting), and the PDBand UniProt specifications for the structure used to access the DCA prediction quality.

0 8.5 20 400

500

1000

Numberofoccurrences

Distance (Å)0 8.5 20 400

200

400

600

Distance (Å)0 8.5 20 400

100

200

300

400

500

Distance (Å)

FIG. 2. (Color online) Histograms of crystal-structure distances pooled from all 17 families, when including all pairs (left),pairs with |j − i| > 4 (center), and pairs with |j − i| > 14 (right). The vertical line is our contact cutoff 8.5A.

sequence positions, by measuring the minimal distancebetween any two heavy atoms belonging to the aminoacids present in these positions. The left panel of Fig. 2shows the distribution of these distances in all consideredfamilies. Three peaks protrude from the background dis-tribution: one at small distances below 1.5A, a second atabout 3-5A and a third at about 7-8A. The first peak cor-responds to the peptide bonds between sequence neigh-bors, whereas the other two peaks correspond to non-trivial contacts between amino acids, which may be dis-tant along the protein backbone, as can be seen from thecenter and right panels of Fig. 2, which collect only dis-tances between positions i and j with minimal separation|j − i| ≥ 5 resp. |j − i| ≥ 15. Following [2], we take the

peak at 3-5A to presumably correspond to short-rangeinteractions like hydrogen bonds or secondary-structurecontacts, whereas the last peak likely corresponds tolong-range, possibly water-mediated interactions. Thesepeaks contain the nontrivial information we would liketo extract from sequence data using DCA. In order toaccept the full second peak, we have chosen a distancecutoff of 8.5A for true contacts, slightly larger than thevalue of 8A used in [2].

Accuracy results are here reported primarily usingtrue-positive (TP) rates, also the principal measurementin [2] and [9]. The TP rate for p is the fraction of the p

strongest-scored pairs which are actually contacts in thecrystal structure, defined as described above. To exem-

9

plify TP rates, let us jump ahead and look at Fig. 3. ForPLM and protein family PF00076, the TP rate is 1 upto p = 80, which means that all 80 top-SCN

ij pairs aregenuine contacts in the crystal structure. At p = 200,the TP rate has dropped to 0.78, so 0.78 · 200 = 156 ofthe top 200 top-SCN

ij pairs are contacts, while 44 are not.

B. Parameter settings

To set the stage for comparison, we started by runninginitial trials on the 17 families using both NMFI andPLM with many different regularization and reweightingstrengths. Reweighting indeed raised the TP rates, and,as reported in [2] for 131 families, results seemed robusttoward the exact choice of the limit x around 0.7 ≤ x ≤0.9. We chose x = 0.9 to use throughout the study.In what follows, NMFI results are reported using the

same list of pseudocounts as in Fig. S11 in [2]: λ =w·Beff with w = 0.11, 0.25, 0.43, 0.67, 1.0, 1.5, 2.3, 4.0,9.0. During our analysis we also ran intermediate values,and we found this covering to be sufficiently dense. Wegive outputs from two versions of NMFI: NMFI-DI andNMFI-DI(true). The former uses pseudocounts for allcalculations, whereas the latter switches to true frequen-cies when it gets to the evaluations of SDI

ij . We append’DI’ to the NMFI name, since, later on, we will also trythe SCN

ij score for NMFI (which we will call NMFI-CN).With l2 regularization in the PLM algorithm, out-

comes were robust against the precise choice of λh; TPrates were almost identical when λh was changed between0.001 and 0.1. We therefore chose λh = 0.01 for all ex-periments. What mattered, rather, was the coupling reg-ularization parameters λJ , for which we did a systematicscan from λJ = 0 and up using step-size 0.005.So, to summarize, the results reported here are based

on x = 0.9, cutoff 8.5A, λh = 0.01, and λ and λJ drawnfrom collections of values as described above.

C. Main comparison of mfDCA and plmDCA

Fig. 3 shows TP rates for the different families andmethods. We see that the TP rates of plmDCA (PLM)are consistently higher than those of mfDCA (NMFI),especially for families with large Beff . For what con-cerns the two NMFI versions: NMFI-DI(true) avoids thestrong failure seen in NMFI-DI for PF00084, but for mostother families, see in particular PF00014 and PF00081,the performance instead drops using marginals withoutpseudocounts in the SDI

ij calculation. For both NMFI-DIand NMFI-DI(true), the best regularization was found tobe λ = 1 ·Beff , the same value as used in [2]. For PLM,the best parameter choice was λJ = 0.01. Interestingly,this same regularization parameter was optimal for ba-sically all families. This is somewhat surprising, sinceboth N and Beff span quite wide ranges (48-130 and1277-14125 respectively).

1 10 1000.50.60.70.80.91.0

PF00076N=70, B

eff=14125

1 10 1000.50.60.70.80.91.0

PF00107N=130, B

eff=12114

1 10 1000.50.60.70.80.91.0

PF00041N=85, B

eff=10631

1 10 1000.50.60.70.80.91.0

PF00027N=91, B

eff=9036

1 10 1000.50.60.70.80.91.0

PF00028N=93, B

eff=8317

1 10 1000.50.60.70.80.91.0

PF00111N=78, B

eff=5805

1 10 1000.50.60.70.80.91.0

PF00043N=95, B

eff=5141

1 10 1000.50.60.70.80.91.0

PF00084N=56, B

eff=4345

1 10 1000.50.60.70.80.91.0

PF00013N=58, B

eff=3785

1 10 1000.50.60.70.80.91.0

PF00018N=48, B

eff=3554

1 10 1000.50.60.70.80.91.0

PF00011N=102, B

eff=3481

1 10 1000.50.60.70.80.91.0

PF00046N=57, B

eff=3314

1 10 1000.50.60.70.80.91.0

PF00035N=67, B

eff=2254

1 10 1000.50.60.70.80.91.0

PF00014N=53, B

eff=1812

1 10 1000.50.60.70.80.91.0

PF00017N=77, B

eff=1741

1 10 1000.50.60.70.80.91.0

PF00081N=82, B

eff=1510

1 10 1000.50.60.70.80.91.0

PF00105N=70, B

eff=1277

PLM

NMFI−DI

NMFI−DI(true)

FIG. 3. (Color online) Contact-detection results for the 17families, sorted by Beff . The y-axes are TP rates and x-axes are the number of predicted contacts p, based on pairswith |j − i| > 4. The three curves for each method are thethree regularization levels yielding highest TP rates across allfamilies. The thickened curve highlights the best one out ofthese three (λ = Beff for NMFI and λJ = 0.01 for PLM).

In the following discussion, we leave out all results forNMFI-DI(true) and focus on PLM vs. NMFI-DI, i.e., theversion used in [2]. All plots remaining in this section usethe optimal regularization values: λ = Beff for NMFIand λJ = 0.01 for PLM.TP rates only classify pairs as contacts (d(i, j) < 8.5A)

or noncontacts (d(i, j) ≥ 8.5A). To give a more detailedview of how scores correlate with spatial separation, weshow in Fig. 4 a scatter plot of the score vs. distance forall pairs in all 17 families. PLM and NMFI-DI both man-age to detect the peaks seen in the true distance distribu-tion of Fig. 2, in the sense that high scores are observedalmost exclusively at distances below 8.5A. Both meth-ods agree that interactions get, on average, progressivelyweaker going from peak one, to two, to three, and finallyto the bulk. We note that the dots scatter differentlyacross the PLM and NMFI-DI figures, reflecting the twoseparate scoring techniques: SDI

ij are strictly nonnega-tive, whereas APC corrected norms can assume negativevalues. We also observe how sparse the extracted signalis: most spatially close pairs do not show elevated scores.

10

FIG. 4. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-DI (right).The vertical line is our contact cutoff at 8.5A.

However, from the other side, almost all strongly coupledpairs are close, so the biological hypothesis of Sec. II iswell supported here.

Fig. 5 shows scatter plots of scores for PLM and NMFI-DI for some selected families. Qualitatively the same pat-terns were observed for all families. The points are clearlycorrelated, so, to some extent, PLM and NMFI-DI agreeon the interaction strengths. Due to the different scoringschemes, we would not expect numerical coincidence ofscores. Many of PLM’s top-scoring position pairs havealso top scores for NMFI-DI and vice versa. The largestdiscrepancy is in how much more strongly NMFI-DI re-sponds to pairs with small |j − i|; the crosses tend toshoot out to the right. PLM agrees that many of theseneighbor pairs interact strongly, but, unlike NMFI-DI, italso shows rivaling strengths for many |j − i| > 4-pairs.

An even more detailed picture is given by consideringcontact maps, see Fig. 6. The tendency observed in thelast scatter plots remains: NMFI-DI has a larger por-tion of highly scored pairs in the neighbor zone, whichare the middle stretches in these figures. An importantobservation is, however, that clusters of contacting pairswith long 1D sequence separation are captured by bothalgorithms.

Note, that only a relatively small fraction of contactsis uncovered by DCA, before false positives start to ap-pear, and that many native contacts are missed. How-ever, the aim of DCA cannot be to reproduce a com-plete contact map: It is based on sequence correlationsalone (e.g. it cannot predict contacts for 100% conservedresidues), it does not consider any physico-chemical prop-erty of amino acids (they are seen as abstract letters), itdoes not consider the need to be embeddable in 3D. Fur-thermore, proteins may undergo conformational changes(e.g. allostery) or homo-dimerize, so coevolution may be

induced by pairs which are not in contact in the X-raycrystal structure used for evaluating prediction accuracy.The important point – and this seems to be a distinguish-ing feature of maximum-entropy models as compared tosimpler correlation-based methods – is to find a sufficientnumber of well-distributed contacts to enable de-novo 3Dstructure prediction [3–5, 40–42]. In this sense, it is moreimportant to achieve high accuracy of the first predictedcontacts, than intermediate accuracy for many predic-tions.In summary, the results suggest that the PLM method

offers some interesting progress compared to NMFI. How-ever, let us also note that in the comparison we also hadto change both scoring and regularization styles. It isthus conceivable that a NMFI with the new scoring andregularization could be more competitive with PLM. In-deed, upon further investigation, detailed in Appendix B,we found that part of the improvement in fact does stemfrom the new score. In Appendix C, where we extend ourresults to longer Pfam domains, we therefore add resultsfrom NMFI-CN, an updated version of the code used in[2] which scores by SCN

ij instead of SDIij .

D. Run times

In general, NMFI, which is merely a matrix inversion,is very quick compared with PLM; most families in thisstudy took only seconds to run through the NMFI code.In contrast to the message-passing based method used

in [1], a DCA using PLM is nevertheless feasible for allprotein families in Pfam. The objective function in PLMis a sum over nodes and samples and its execution timeis therefore expected to depend both on B (number ofmembers of a protein family in Pfam) and N (length of

11

0.1 0.2 0.3 0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4PF00018

NMFI−DI (DI)

PLM

(C

N)

0.1 0.2 0.3 0.4 0.5−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4PF00028

0.2 0.4 0.6−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

PF00043

0.1 0.2 0.3 0.4−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

PF00111

Neighbor pair: |i−j|<5

Contact: 0 ≤ true distance<8.5 (and |i−j|≥ 5)

Non contact: true distance ≥ 8.5 (and |i−j|≥5)

FIG. 5. (Color online) Scatter plots of interaction scores forPLM and NMFI-DI from four families. For all plots, the axesare as indicated by the top left figure. The distance unit inthe top box is A.

the aligned sequences in a protein family).

On a standard desktop computer, using a basicMATLAB-interfaced C-implementation of conjugate gra-dient descent, the run times for PF00014, PF00017, andPF00018 (small N and B) were 50, 160 and 90 respec-tively. For PF00041 (small N but larger B) one runtook 15 min. For domains with larger N , like those inAppendix C, run times grow approximately apace. Forexample, the run times for PF00026 (N = 314) andPF00006 (N = 215) were 80 and 65 min respectively.

A well-known alternative use of pseudolikelihoods is tominimize each gr separately. While slightly more crude,such an ’asymmetric’ variant of plmDCA would amountto N independent (multiclass) logistic regression prob-lems, which would make parallel execution trivial on upto N cores. A rigorous examination of the asymmetricversion’s performance is beyond the scope of the presentwork, but initial tests suggest that even on a single pro-cessor it requires convergence times almost an order ofmagnitude smaller than the symmetric one (which weused in this work), while still yielding almost exactly thesame TP rates. UsingN processors, the above mentionedrun times could thus, in principle, be dropped by a factor

0

10

20

30

40

50

0

10

20

30

40

50

PF00046

NMFI−DI PLM

010

2030

4050

6070

8090

010

2030

4050

6070

8090

PF00043

NMFI−DI PLM

010

2030

4050

60

010

2030

4050

60

PF00035

NMFI−DI PLM

010

2030

4050

6070

8090

010

2030

4050

6070

8090

PF00028

NMFI−DI PLM

FIG. 6. (Color online) Predicted contact maps for PLM (rightpart, black symbols) and NMFI-DI (left part, red symbolsonline) for four families. A pair (i, j)’s placement in the plotsis found by matching positions i and j on the axes. Nativecontacts are indicated in gray. True and false positives arerepresented by circles and crosses, respectively. Each figureshows the 1.5N strongest ranked pairs (including neighbors)for that family.

as large as 10N , suggesting that plmDCA can be madecompetitive not only in terms of accuracy, but also interms of computational speed.Finally, all of these times were obtained cold-starting

with all fields and couplings at 0. Presumably, one canimprove by using an appropriate initial guess obtained,say, from NMFI. This has however not been implementedhere.

V. DISCUSSION

In this work, we have shown that a direct-couping anal-ysis built on pseudolikelihood maximization (plmDCA)consistently outperforms the previously described mean-field based analysis (mfDCA), as assessed across a num-ber of large protein-domain families. The advantage ofthe pseudolikelihood approach was found to be partiallyintrinsic, and partly contingent on using a sampling-corrected Frobenius norm to score inferred direct statis-

12

tical coupling matrices.On one hand, this improvement might not be sur-

prising: it is known that for very large data sets PLMbecomes asymptotically equivalent to full maximum-likelihood inference, whereas mean-field inference re-mains intrinsically approximate, and this may result inan improved PLM performance also for finite data sets[23].On the other hand, the above advantage holds if and

only if the following two conditions are fulfilled: Dataare drawn independently from a probability distribution,and this probability distribution is the Boltzmann dis-tribution of a Potts model. None of these two con-ditions actually hold for real protein sequences. Onartificial data, refined mean-field methods (Thouless-Anderson-Palmer equations, Bethe approximation) alsolead to improved model inference as compared to NMFI,cf. e.g. [14, 16, 17, 21], but no such improvement hasbeen observed in real protein data [2]. The results ofthe paper are therefore interesting and highly nontriv-ial. They also suggest that other model-learning meth-ods from statistics such as ’Contrastive Divergence’ [52]or the more recent ’Noise-Contrastive Estimation’ [53],could be explored to further increase our capacity to ex-tract structural information from protein sequence data.Disregarding the improvements, we find that overall

the predicted contact pairs for plmDCA and mfDCA arehighly overlapping, illustrating the robustness of DCAresults with respect to the algorithmic implementation.This observations suggests that, in the context of model-

ing the sequence statistics by pairwise Potts models, mostextractable information might already be extracted fromthe MSA. However, it may well also be that there is al-ternative information hidden in the sequences, for whichwe would need to go beyond pairwise models, or inte-grate the physico-chemical properties of different aminoacids into the procedure, or extract even more informa-tion from large sets of evolutionarily related amino-acidsequences. DCA is only a step in this direction.In our work we have seen that simple sampling cor-

rections, more precisely sequence reweighting and theaverage-product correction of interaction scores, lead toan increased accuracy in predicting 3D contacts of aminoacids, which are distant on the protein’s backbone. It is,however, clear that these somewhat heuristic statisticalfixes cannot correct for the complicated hierarchical phy-logenetic relationships between proteins, and that moresophisticated methods would be needed to disentanglephylogenetic from functional correlations in massive se-quence data. To do so is an open challenge, which wouldleave the field of equilibrium inverse statistical mechan-ics, but where methods of inverse statistical mechanicsmay still play a useful role.

ACKNOWLEDGMENTS

This work was supported by the Academy of Finlandas part of its Finland Distinguished Professor program,project 129024/Aurell, and through the Center of Excel-lence COIN. Discussions with S. Cocco and R. Monassonare gratefully acknowledged.

[1] M. Weigt, R. White, H. Szurmant, J. Hoch, and T. Hwa,Proc. Natl. Acad. Sci. U. S. A. 106, 67 (2009)

[2] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, Proc. Natl. Acad. Sci. U. S. A. 108,E1293 (2011)

[3] D. S. Marks, L. J. Colwell, R. P. Sheridan, T. A. Hopf,A. Pagnani, R. Zecchina, and C. Sander, PLoS One 6,e28766 (2011)

[4] J. Sulkowska, F. Morcos, M. Weigt, T. Hwa, andJ. Onuchic, Proc. Natl. Acad. Sci. U. S. A. 109, 10340(2012)

[5] T. Nugent and D. T. Jones, Proc. Natl. Acad. Sci. U. S.A. 109, E1540 (2012)

[6] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry,J. G. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric,J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer,S. R. Eddy, A. Bateman, and R. D. Finn, Nucleic AcidsRes. 40, D290 (2012)

[7] L. Burger and E. van Nimwegen, PLoS Comput. Biol. 6,E1000633 (2010)

[8] S. Balakrishnan, H. Kamisetty, J. Carbonell, S. Lee, andC. Langmead, Proteins: Struct., Funct., Bioinf. 79, 1061(2011)

[9] D. T. Jones, D. W. A. Buchan, D. Cozzetto, and M. Pon-til, Bioinformatics 28, 184 (2012)

[10] H. J. Kappen and F. B. Rodriguez, in Advances in Neural

Information Processing Systems (The MIT Press, 1998)pp. 280–286

[11] E. Schneidman, M. Berry, R. Segev, and W. Bialek, Na-ture 440, 1007 (2006)

[12] A. Braunstein and R. Zecchina, Phys. Rev. Lett. 96,030201 (2006)

[13] A. Braunstein, A. Pagnani, M. Weigt, and R. Zecchina,J. Stat. Mech. 2008, P12001 (2008)

[14] Y. Roudi, J. A. Hertz, and E. Aurell, Front.Comput. Neurosci. 3 (2009), doi:“bibinfo doi10.3389/neuro.10.022.2009

[15] S. Cocco, S. Leibler, and R. Monasson, Proc. Natl. Acad.Sci. U. S. A. 106, 14058 (2009)

[16] V. Sessak and R. Monasson, J. Phys. A: Math. Theor.42, 055001 (2009)

[17] M. Mezard and T. Mora, J. Physiol. (Paris) 103, 107(2009)

[18] E. Marinari and V. V. Kerrebroeck, J. Stat. Mech.(2010),doi:“bibinfo doi 10.1088/1742-5468/2010/02/P02008

[19] S. Cocco and R. Monasson, Phys. Rev. Lett. 106, 090601(2011)

[20] F. Ricci-Tersenghi, J. Stat. Mech.(2012)[21] H. Nguyen and J. Berg, J. Stat. Mech.(2012)[22] H. Nguyen and J. Berg, Phys. Rev. Lett. 109 (2012)

13

[23] E. Aurell and M. Ekeberg, Phys. Rev. Lett. 108, 090201(2012)

[24] A. Hyvarinen, J. Karhunen, and E. Oja, Independent

Component Analysis, Adaptive and Learning Systems forSignal Processing, Communications, and Control (JohnWiley & Sons, 2001)

[25] J. Rissanen, Information and Complexity in Statistical

Modeling (Springer, 2007)[26] M. J. Wainwright and M. I. Jordan,

Foundations and Trends in Machine Learning 1, 1(2008)

[27] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, Ann.Stat. 38, 1287 (2010)

[28] H. Ackley, E. Hinton, and J. Sejnowski, Cogn. Sci. 9, 147(1985)

[29] G. Parisi, Statistical Field Theory (Addison Wesley,1988)

[30] L. Peliti, Statistical Mechanics in a Nutshell (PrincetonUniversity Press, 2011) ISBN ISBN-13: 9780691145297

[31] K. Fischer and J. Hertz, Spin Glasses (Cambridge Uni-versity Press, 1993) ISBN 0521447771, 9780521447775

[32] I. Pagani, K. Liolios, J. Jansson, I. Chen, T. Smirnova,B. Nosrat, and M. V.M., Nucleic Acids Res. 40, D571(2012)

[33] The Uniprot Consortium, Nucleic Acids Res. 40, D71(2012)

[34] H. Berman, G. Kleywegt, H. Nakamura, and J. Markley,Structure 20, 391 (2012)

[35] U. Gobel, C. Sander, R. Schneider, and A. Valencia,Proteins: Struct., Funct., Genet. 18, 309 (1994)

[36] S. W. Lockless and R. Ranganathan, Science 286, 295(1999)

[37] A. A. Fodor and R. W. Aldrich, Proteins: Struct., Funct.,Bioinf. 56, 211 (2004)

[38] A. S. Lapedes, B. G. Giraud, L. Liu, and G. D. Stormo,Lecture Notes-Monograph Series: Statistics in MolecularBiology and Genetics 33, 236 (1999)

[39] J. Besag, The Statistician 24, 179195 (1975)[40] A. Schug, M. Weigt, J. Onuchic, T. Hwa, and H. Szur-

mant, Proc. Natl. Acad. Sci. U. S. A. 106, 22124 (2009)[41] A. E. Dago, A. Schug, A. Procaccini, J. A. Hoch,

M. Weigt, , and H. Szurmant, Proc. Natl. Acad. Sci. U.S. A. 109, 10148 (2012)

[42] T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost,C. Sander, and D. S. Marks, Cell 149, 1607 (2012)

[43] E. T. Jaynes, Physical Review Series II 106, 620630(1957)

[44] E. T. Jaynes, Physical Review Series II 108, 171190(1957)

[45] G. Darmois, C.R. Acad. Sci. Paris 200, 12651266 (1935),in French

[46] E. Pitman and J. Wishart,Math. Proc. Cambridge Philos. Soc. 32, 567579 (1936)

[47] B. Koopman, Trans. Am. Math. Soc. 39, 399409 (1936)[48] A. Kholodenko, J. Stat. Phys. 58, 357 (1990)[49] T. Tanaka, Neural Comput. 12, 19511968 (2000)[50] S. ichi Amari, S. Ikeda, and H. Shimokawa, “Informa-

tion geometry and mean field approximation: the alpha-projection approach,” in Advanced Mean Field Methods

– Theory and Practice (MIT Press, 2001) Chap. 16, pp.241–257, ISBN 0-262-15054-9

[51] S. D. Dunn, L. M. Wahl, and G. B. Gloor, Bioinformatics24, 333 (2008)

[52] G. Hinton, Neural Comput. 14, 17711800 (2002)

[53] M. Gutmann and A. Hyvarinen, J. Mach. Learn. Res. 13,307 (2012)

Appendix A: Circle plots

To get a sense of how false positives distribute acrossthe domains, we draw interactions into circles in Fig. 7.Among erroneously predicted contacts there is some ten-dency towards loopiness, especially for NMFI-DI; theblack lines tend to ‘bounce around’ in the circles. Ithence seems that relatively few nodes are responsible formany of the false positives. We performed an explicitcheck of the data columns belonging to these ‘bad’ nodes,and we found that they often contained strongly biaseddata, i.e., had a few large f i(k). In such cases, it seemedthat NMFI-DI was more prone than PLM to report a(predicted) interaction.

PF

0004

6P

F00

043

PF

0003

5

PLM NMFI−DI

PF

0002

8

Crystal structure

FIG. 7. (Color online) Connections for four families overlaidon circles. Position ‘1’ is indicated by a dash. The leftmostcolumn shows contacts in the crystal structure (dark gray ford(i, j) < 5A and light gray for 5A≤ d(i, j) < 8.5A). The othertwo columns show the top 1.5N strongest ranked |j − i| > 4-pairs for PLM and NMFI-DI, with gray for true positives andblack for false positives.

14

Appendix B: Other scores for naive mean-field

inversion

We also investigated NMFI performance using theAPC term for the SDI

ij scoring and using our new SCNij

score. In the second case we first switch the parameterconstraints from (9) to (8) using (26). Mean TP rates us-ing the modified score are shown in Fig. 8. We observethat APC in SDI

ij scoring increases TP rates slightly,

while SCNij scoring can improve TP rates overall. We

remark, however, that for the second-highest ranked in-teraction (p = 2) NMFI with the original SDI

ij (NMFI-DI)

ties with NMFI with SCNij (NMFI-CN).

1 10 1000.5

0.6

0.7

0.8

0.9

1.0

Number of predicted pairs

Mea

n TP

rate

NMFI−DI

NMFI−CDI

NMFI−CN

FIG. 8. (Color online) Mean TP rates, using pairs with |j −i| > 4, for NMFI with old score SDI

ij , new APC score SCDIij ,

and the norm score SCNij . Each curve corresponds to the best

λ for that particular score.

Motivated by the results of Fig. 8, we decided to com-pare NMFI and PLM under the SCN

ij score. All figuresin this paragraph show the best regularization for eachmethod, unless otherwise stated. Figure 9 shows score vs.distance for all pairs in all families. Unlike Fig. 4, thetwo plots now show very similar profiles. We note, how-ever, that NMFI’s SCN

ij scores trend two to three timeslarger than PLM’s (the scales on the vertical axes aredifferent). Perhaps this is an inherent feature of thesemethods, or simply a consequence of the different typesof regularization.

Figure 10 shows the same situation as Fig. 3, but us-ing SCN

ij to score NMFI. The three best regularizationchoices for NMFI-CN turned out the same as before, i.e.,λ = 1 · Beff , λ = 1.5 · Beff and λ = 2.3 · Beff , butthe best out of these three was now λ = 2.3 · Beff (in-stead of λ = 1 ·Beff ). Comparing Fig. 3 and Fig. 10 onecan see that the difference between the two methods isnow smaller; for several families, the prediction quality

is in fact about the same for both methods. Still, PLMmaintains a somewhat higher TP rates overall.

1 10 1000.50.60.70.80.91.0

PF00076N=70, B

eff=14125

1 10 1000.50.60.70.80.91.0

PF00107N=130, B

eff=12114

1 10 1000.50.60.70.80.91.0

PF00041N=85, B

eff=10631

1 10 1000.50.60.70.80.91.0

PF00027N=91, B

eff=9036

1 10 1000.50.60.70.80.91.0

PF00028N=93, B

eff=8317

1 10 1000.50.60.70.80.91.0

PF00111N=78, B

eff=5805

1 10 1000.50.60.70.80.91.0

PF00043N=95, B

eff=5141

1 10 1000.50.60.70.80.91.0

PF00084N=56, B

eff=4345

1 10 1000.50.60.70.80.91.0

PF00013N=58, B

eff=3785

1 10 1000.50.60.70.80.91.0

PF00018N=48, B

eff=3554

1 10 1000.50.60.70.80.91.0

PF00011N=102, B

eff=3481

1 10 1000.50.60.70.80.91.0

PF00046N=57, B

eff=3314

1 10 1000.50.60.70.80.91.0

PF00035N=67, B

eff=2254

1 10 1000.50.60.70.80.91.0

PF00014N=53, B

eff=1812

1 10 1000.50.60.70.80.91.0

PF00017N=77, B

eff=1741

1 10 1000.50.60.70.80.91.0

PF00081N=82, B

eff=1510

1 10 1000.50.60.70.80.91.0

PF00105N=70, B

eff=1277

PLM

NMFI−CN

FIG. 10. (Color online) Contact-detection results for all thefamilies in our study (sorted by Beff ), now with the SCN

ij

score for NMFI. The y-axes are TP rates and the x-axesare the number of predicted contacts p, based on pairs with|j − i| > 4. The three curves for each method are the threeregularization levels yielding highest TP rates across all fam-ilies. The thickened curve highlights the best one out of thesethree (λ = 2.3 ·Beff for NMFI-CN and λJ = 0.01 for PLM).

Figure 11 shows scatter plots for the same families asin Fig. 5 but using the SCN

ij scoring for NMFI. The pointsnow lie more clearly on a line, from which we concludethat the bends in Fig. 5 were likely a consequence ofdiffering scores. Yet, the trends seen in Fig. 5 remain:NMFI gives more attention to neighbor pairs than doesPLM.In Fig. 12, we recreate the contact maps of Fig. 6 with

NMFI-CN in place of NMFI-DI and find that the plotsare more symmetric. As expected, asymmetry is seenprimarily for small |j − i|; NMFI tends to crowd theseregions with lots of loops.To investigate why NMFI assembles so many top-

scored pairs in certain neighbor regions, we performedan explicit check of the associated MSA columns. A rel-evant regularity was observed: when gaps appear in asequence, they tend to do so in long strands. The pic-

15

FIG. 9. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-CN(right). The vertical line is our contact cutoff at 8.5A.

ture can be illustrated by the following hypothetical MSA(in our implementation, the gap state is 1):

· · ·

· · · 6 5 9 7 2 6 8 7 4 4 2 2 · · ·

· · · 1 1 1 1 1 1 1 1 1 1 2 8 · · ·

· · · 6 5 2 7 2 3 8 9 5 4 2 3 · · ·

· · · 3 7 4 7 2 6 8 7 9 4 2 3 · · ·

· · · 3 7 4 7 2 3 8 8 9 4 2 9 · · ·

· · · 1 1 1 1 1 1 1 4 5 4 2 9 · · ·

· · · 8 5 9 7 2 9 8 7 4 4 2 4 · · ·

· · · 1 1 1 1 1 1 1 1 1 1 2 4 · · ·

· · ·

.

We recall that gaps (’1’ states) are necessary for sat-isfactory alignment of the sequences in a family andthat in our procedure we treat gaps just another aminoacid, with its associated interaction parameters. We thenmake the obvious observation that independent samplesfrom a Potts model will only contain long subsequencesof the same state with low probability. In other words,the model to which we fit the data cannot describe longstretches of ’1’ states, which is a feature of the data. It ishence quite conceivable that the two methods handle thisdiscrepancy between data and models differently since wedo expect this gap effect to generate large Jij(1, 1) for atleast some pairs with small |j − i|.

0 2 4

0

0.5

1

1.5

PF00018

NMFI−CN (CN)

PL

M (

CN

)

0 1 2 3 4

0

0.5

1

1.5

PF00028

0 1 2 3 4

0

0.5

1 PF00043

0 1 2 3 4

0

0.5

1 PF00111

Neighbor pair: |i−j|<5

Contact: 0 ≤ true distance < 8.5 (and |i−j| ≥ 5)

Non contact: true distance ≥ 8.5 (and |i−j| ≥ 5)

FIG. 11. (Color online) Scatter plots of interaction scores forPLM and NMFI-CN from four families. For all plots, the axesare as indicated by the top left figure. The distance unit inthe top box is A.

16

0

10

20

30

40

50

0

10

20

30

40

50

PF00046

NMFI−CN PLM

010

2030

4050

6070

8090

010

2030

4050

6070

8090

PF00043

NMFI−CN PLM

010

2030

4050

60

010

2030

4050

60

PF00035

NMFI−CN PLM

010

2030

4050

6070

8090

010

2030

4050

6070

8090

PF00028

NMFI−CN PLM

FIG. 12. (Color online) Predicted contact maps for PLM(right part, black symbols) and NMFI-CN (left part, greensymbols online) from four families. A pair (i, j)’s placementin the plots is found by matching positions i and j on theaxes. Contacts are indicated by gray (dark for d(i, j) < 5Aand light for 5A≤ d(i, j) < 8.5A). True and false positives arerepresented by circles and crosses, respectively. Each figureshows the 1.5N strongest ranked pairs (including neighbors)for that family.

Figure 13 shows scatter plots for all coupling param-eters Jij(k, l) in PF00014, which has a modest amountof gap sections, and in PF00043, which has relativelymany. As outlines above, the Jij(1, 1)-parameters areamong the largest in magnitude, especially for PF00043.We also note that the black dots steer to the right; NMFIclearly reacts more strongly to the gap-gap interactionsthan PLM.

FIG. 13. Scatter plots of estimated Jij,kl = Jij(k, l) fromPF00014 (top) and PF00043 (bottom). Black dots are ‘gap–gap’ interactions (k = l = 1), dark gray dots are ‘gap–amino-acid’ interactions (k = 1 and l 6= 1, or k 6= 1 and l = 1),and light gray dots are ‘amino-acid–amino-acid’ interactions(k 6= 1 and l 6= 1).

Jones et al. [9] disregarded contributions from gaps intheir scoring by simply skipping the gap state when do-ing their norm summations. We tried this but found nosignificant improvement for either method. The changeseemed to affect only pairs with small |j − i| (which isreasonable), and our TP rates are based on pairs with|j − i| > 4. If gap interactions are indeed responsible forreduced prediction qualities, removing their input duringscoring is just a Band-Aid type solution. A better waywould be to suppress them already in the parameter es-timation step. That way, all interplay would have to beaccounted for without them. Whether or not there areways to effectively handle the inference problem in PSPby ignoring gaps or treating them differently, is an issuewhich goes beyond the scope of this work.

We also investigated whether the gap effect depends onthe sequence similarity reweighting factor x, which up to

17

here was chosen x = 0.9. Perhaps the gap effect canbe dampened by stricter definition of sequence unique-ness? In Fig. 14 we show another set of TP rates, butnow for x = 0.75. We also include results for NMFI runon alignment files from which all sequences with morethan 20% gaps have been removed. The best regulariza-tion choice for each method turned out the same as inFig. 10: λ = 2.3 · Beff for NMFI-CN and λJ = 0.01 forPLM. Overall, PLM maintains the same advantage overNMFI-CN it had in Fig. 10. Removing gappy sequencesseems to trim down more TP rates than it raises, prob-ably since useful information in the nongappy parts isdiscarded unnecessarily.

1 10 1000.50.60.70.80.91.0

PF00076N=70, B

eff=8298

1 10 1000.50.60.70.80.91.0

PF00041N=85, B

eff=7718

1 10 1000.50.60.70.80.91.0

PF00027N=91, B

eff=7082

1 10 1000.50.60.70.80.91.0

PF00107N=130, B

eff=7050

1 10 1000.50.60.70.80.91.0

PF00028N=93, B

eff=5474

1 10 1000.50.60.70.80.91.0

PF00043N=95, B

eff=3467

1 10 1000.50.60.70.80.91.0

PF00084N=56, B

eff=3176

1 10 1000.50.60.70.80.91.0

PF00111N=78, B

eff=3110

1 10 1000.50.60.70.80.91.0

PF00011N=102, B

eff=2367

1 10 1000.50.60.70.80.91.0

PF00013N=58, B

eff=2095

1 10 1000.50.60.70.80.91.0

PF00018N=48, B

eff=2030

1 10 1000.50.60.70.80.91.0

PF00035N=67, B

eff=1645

1 10 1000.50.60.70.80.91.0

PF00014N=53, B

eff=1319

1 10 1000.50.60.70.80.91.0

PF00046N=57, B

eff=1300

1 10 1000.50.60.70.80.91.0

PF00017N=77, B

eff=1133

1 10 1000.50.60.70.80.91.0

PF00081N=82, B

eff=645

1 10 1000.50.60.70.80.91.0

PF00105N=70, B

eff=644

PLM (75% rew.)

NMFI−CN (75% rew.)

NMFI−CN (90% rew., no >20%−gap alignments)

FIG. 14. (Color online) Contact-detection results for all thefamilies in our study. The y-axes are TP rates and x-axes arethe number of predicted contacts p, based on pairs with |j −i| > 4. Two curves are included for reweighting margin x =0.75, and one for reweighting margin x = 0.9 after deletionof all sequences with more than 20% gaps. Results for eachmethod correspond to the regularization level yielding highestTP rates across all families (λ = 2.3 ·Beff for both NMFI-CNand λJ = 0.01 for PLM).

Appendix C: Extension to 28 protein families

To sample a larger set of families, we conducted an ad-ditional survey of 28 families, now covering lengths across

the wider range of 50-400. The list is given in Table II.We here kept the reweighting level at x = 0.8 as in [2],while the TP rates were again calculated using the cutoff8.5A. The pseudocount strength for NMFI was varied inthe same interval as in the main text. We did not tryto optimize the PLM regularization parameters for thistrial, but merely used λh = λJ = 0.01 as determined forthe smaller families in the main text.

Figure 15 shows qualitatively the same behavior as inthe smaller set of families: TP rates increase partly fromchanging from the SDI

ij score to the SCNij score, and partly

from changing from NMFI to PLM. Our positive resultsthus do not seem to be particular to short-length families.

Apart from the average TP rate for each value of p(p’th strongest predicted interactions) one can also eval-uate performance by different criteria. In this larger sur-vey we investigated the distribution of values of p suchthat the TP rate in a family is one. Fig. 16 shows thehistograms of the number of families for which the topp predictions are correct, clearly showing that the differ-ence between PLM and NMFI (using the two scores) pri-marily occurs at the high end. The difference in averageperformance between PLM and NMFI at least partiallystems from PLM getting more strongest contact predic-tions with 100% accuracy.

1 10 1000.5

0.6

0.7

0.8

0.9

1

Number of predicted pairs

Me

an

TP

ra

te

PLMNMFI−DINMFI−CN

FIG. 15. (Color online) Mean TP rates over the larger set of28 families for PLM with λJ = 0.01 and varying regularizationvalues for NMFI-CN and NMFI-DI.

18

5 13 21 29 37 45 53 61 69 770

1

2

3

4

5

6

7

8

9

10

Number of pairs

Num

ber o

f fam

ilies

PLMNMFI−DINMFI−CN

FIG. 16. (Color online) Distribution of ’perfect’ accuracy forthe three methods. The x-axis shows the number of top-ranked pairs for which the TP rates stays at one, and they-axis shows the number of families.

19

ID N B Beff (80%) PDB ID UniProt entry UniProt residues

PF00006 215 10765 641 2r9v ATPA THEMA 149-365

PF00011 102 5024 2725 2bol TSP36 TAESA 106-206

PF00014 53 2393 1478 5pti BPT1 BOVIN 39-91

PF00017 77 2732 1312 1o47 SRC HUMAN 151-233

PF00018 48 5073 335 2hda YES HUMAN 97-144

PF00025 175 2946 996 1fzq ARL3 MOUSE 3-177

PF00026 314 3851 2075 3er5 CARP CRYPA 105-419

PF00027 91 12129 7631 3fhi KAP0 BOVIN 154-238

PF00028 93 12628 6323 2o72 CADH1 HUMAN 267-366

PF00032 102 14994 684 1zrt CYB RHOCA 282-404

PF00035 67 3093 1826 1o0w RNC THEMA 169-235

PF00041 85 15551 8691 1bqu IL6RB HUMAN 223-311

PF00043 95 6818 4052 6gsu GSTM1 RAT 104-192

PF00044 151 6206 1422 1crw G3P PANVR 1-148

PF00046 57 7372 1761 2vi6 NANOG MOUSE 97-153

PF00056 142 4185 1120 1a5z LDH THEMA 1-140

PF00059 108 5293 3258 1lit REG1A HUMAN 53-164

PF00071 161 10779 3793 5p21 RASH HUMAN 5-165

PF00073 171 9524 487 2r06 POLG HRV14 92-299

PF00076 70 21125 10113 1g2e ELAV4 HUMAN 48-118

PF00081 82 3229 890 3bfr SODM YEAST 27-115

PF00084 56 5831 3453 1elv C1S HUMAN 359-421

PF00085 104 10569 6137 3gnj VWF HUMAN 1691-1863

PF00091 216 8656 917 2r75 FTSZ AQUAE 9-181

PF00092 179 3936 1786 1atz VWF HUMAN 1691-1863

PF00105 70 2549 816 1gdc GCR RAT 438-507

PF00108 264 6839 2688 3goa FADA SALTY 1-254

TABLE II. Domain families included in our extended study, listed with Pfam ID, length N , number of sequences B (afterremoval of duplicate sequences), number of effective sequences Beff (under x = 0.8, i.e., 80% threshold for reweighting), andthe PDB and UniProt specifications for the structure used to access the DCA prediction quality.