a domain interaction map based on phylogenetic profiling

16
A Domain Interaction Map Based on Phylogenetic Profiling Philipp Pagel 1 , Philip Wong 1 and Dmitrij Frishman 2 * 1 Institute for Bioinformatics GSF-National Research Center for Environment and Health Ingolsta ¨dter Landstraße 1 85764 Neuherberg, Germany 2 Department of Genome Oriented Bioinformatics Technical University of Munich Wissenschaftszentrum Weihenstephan, 85350 Freising Germany Phylogenetic profiling is a well established method for predicting functional relations and physical interactions between proteins. We present a new method for finding such relations based on phylogenetic profiling of conserved domains rather than proteins, avoiding computationally expensive all versus all sequence comparisons among genomes. The resulting domain interaction map (DIMA) can be explored directly or mapped to a genome of interest. We demonstrate that the performance of DIMA is comparable to that of classical phylogenetic profiling and its predictions often yield information that cannot be detected by profiling of entire protein chains. We provide a list of novel domain associations predicted by our method. q 2004 Elsevier Ltd. All rights reserved. Keywords: protein–protein interactions; phylogenetic profiling; protein domains; genome analysis *Corresponding author Introduction Similarity-free methods for protein function pre- diction explore genomic context to establish relations between genes which are not detectable by standard sequence alignment techniques. In very general terms, genomic context can be described as any statistical, physical or biological property of genes, which can be observed or measured, such as chromosomal location, expression patterns, and taxonomic distribution. Genes displaying statisti- cally significant resemblance of genomic context usually act together in some cellular process, typically a metabolic or regulatory pathway. 1 One of these methods, termed phylogenetic profiling, relies on the correlation of protein occurrence across a set of genomes to predict functional associations. 2 Proteins in genomes are assigned a 1 if an ortholog occurs in a genome and 0 otherwise. A string of 1s and 0s, a phylogenetic profile, is generated when the technique is applied across genomes. When two or more proteins have similar patterns of protein occurrence, this may indicate that the proteins interact with each other directly or share a common functional role. 3 The underlying idea is that many pathways or com- plexes require all their members to be present in order to fulfil their functions. This “all or none” pattern of occurrence tends to be characteristic for many interacting genes. 4 One disadvantage of phylogenetic profiling, however, is its high computational cost. In order to assess existence or absence of proteins across genomes, all-against-all comparison of entire gen- omes by similarity searching techniques, such as BLAST, 5 is required. With the number of available sequenced genomes in our PEDANT 6 genome database approaching 300, and the total number of genes in these genomes exceeding one million, this process requires an astronomic number of pairwise sequence comparisons and enormous disk space to save resulting alignments. Maintain- ing such an all-against-all system and updating it to include new genomes represents a major technical challenge. We are aware of only two publicly available web resources offering such service; 7,8 no portable compact tools to perform phylogenetic profiling exist to our knowledge. Molecular interactions are mediated by a great variety of widely spread interaction domains that are frequently combined in proteins in a compli- cated mosaic fashion. 9 Quite often a protein A will use one of its domains to interact with protein B, and another domain to interact with protein C. It is those domains, and not entire protein chains, that often represent major functional entities in cellular interaction networks. Several high-quality sequence domain databases exist (PFAM, 10 SMART 11 ), more recently integrated in the Interpro resource. 12 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. Abbreviations used: DIMA, domain interaction map; TPP, thiamine pyrophosphate. E-mail address of the corresponding author: [email protected] doi:10.1016/j.jmb.2004.10.019 J. Mol. Biol. (2004) 344, 1331–1346

Upload: philipp-pagel

Post on 21-Oct-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Domain Interaction Map Based on Phylogenetic Profiling

doi:10.1016/j.jmb.2004.10.019 J. Mol. Biol. (2004) 344, 1331–1346

A Domain Interaction Map Based on PhylogeneticProfiling

Philipp Pagel1, Philip Wong1 and Dmitrij Frishman2*

1Institute for BioinformaticsGSF-National Research Centerfor Environment and HealthIngolstadter Landstraße 185764 Neuherberg, Germany

2Department of GenomeOriented BioinformaticsTechnical University of MunichWissenschaftszentrumWeihenstephan, 85350 FreisingGermany

0022-2836/$ - see front matter q 2004 E

Abbreviations used: DIMA, domTPP, thiamine pyrophosphate.E-mail address of the correspond

[email protected]

Phylogenetic profiling is a well established method for predictingfunctional relations and physical interactions between proteins. We presenta new method for finding such relations based on phylogenetic profiling ofconserved domains rather than proteins, avoiding computationallyexpensive all versus all sequence comparisons among genomes. Theresulting domain interaction map (DIMA) can be explored directly ormapped to a genome of interest. We demonstrate that the performance ofDIMA is comparable to that of classical phylogenetic profiling and itspredictions often yield information that cannot be detected by profiling ofentire protein chains. We provide a list of novel domain associationspredicted by our method.

q 2004 Elsevier Ltd. All rights reserved.

Keywords: protein–protein interactions; phylogenetic profiling; proteindomains; genome analysis

*Corresponding author

Introduction

Similarity-free methods for protein function pre-diction explore genomic context to establishrelations between genes which are not detectableby standard sequence alignment techniques. In verygeneral terms, genomic context can be described asany statistical, physical or biological property ofgenes, which can be observed or measured, such aschromosomal location, expression patterns, andtaxonomic distribution. Genes displaying statisti-cally significant resemblance of genomic contextusually act together in some cellular process,typically a metabolic or regulatory pathway.1

One of these methods, termed phylogeneticprofiling, relies on the correlation of proteinoccurrence across a set of genomes to predictfunctional associations.2 Proteins in genomes areassigned a 1 if an ortholog occurs in a genome and 0otherwise. A string of 1s and 0s, a phylogeneticprofile, is generated when the technique is appliedacross genomes. When two or more proteins havesimilar patterns of protein occurrence, this mayindicate that the proteins interact with each otherdirectly or share a common functional role.3 Theunderlying idea is that many pathways or com-plexes require all their members to be present in

lsevier Ltd. All rights reserve

ain interaction map;

ing author:

order to fulfil their functions. This “all or none”pattern of occurrence tends to be characteristic formany interacting genes.4

One disadvantage of phylogenetic profiling,however, is its high computational cost. In orderto assess existence or absence of proteins acrossgenomes, all-against-all comparison of entire gen-omes by similarity searching techniques, such asBLAST,5 is required. With the number of availablesequenced genomes in our PEDANT6 genomedatabase approaching 300, and the total numberof genes in these genomes exceeding one million,this process requires an astronomic number ofpairwise sequence comparisons and enormousdisk space to save resulting alignments. Maintain-ing such an all-against-all system and updating it toinclude new genomes represents a major technicalchallenge. We are aware of only two publiclyavailable web resources offering such service;7,8 noportable compact tools to perform phylogeneticprofiling exist to our knowledge.Molecular interactions are mediated by a great

variety of widely spread interaction domains thatare frequently combined in proteins in a compli-cated mosaic fashion.9 Quite often a protein A willuse one of its domains to interact with protein B,and another domain to interact with protein C. It isthose domains, and not entire protein chains, thatoften represent major functional entities in cellularinteraction networks. Several high-quality sequencedomain databases exist (PFAM,10 SMART11), morerecently integrated in the Interpro resource.12

d.

Page 2: A Domain Interaction Map Based on Phylogenetic Profiling

1332 Domain Interaction Map

Extensive web sites offer rich data mining capabili-ties and allow the study of domain combinations inprotein chains and their taxonomic distribution.

Here, we sought to explore phylogenetic profilingof individual protein domains, rather than entireprotein chains, to build a map of domain–domainrelations. While implementing our method, wedrew inspiration from several innovative compu-tational approaches to protein function predictionand analysis developed in recent years:

(1)

The original method of protein phylogeneticprofiling2,13 and a conceptually related tech-nique exploiting the similarity of phylogenetictrees.14

(2)

Exploration of gene fusion events wherebyseparate amino acid chains encoded in one(typically prokaryotic) genome are merged intoa single gene product in another (eukaryotic)genome.15,16

(3)

Analysis and representation of the taxonomicdistribution of sequence domains availablethrough domain database web sites, such asSMART,11 PFAM,10 or CDART.17

(4)

Investigation of domain combinations inproteins.18

(5)

Analysis of occurrence patterns of structuraldomains19 as described in the SCOP database.20

(6)

Genome occurrence in Clusters of OrthologousGroups using principal component analysis.21

(7)

Inferring domain interactions from knownprotein–protein interactions.22

Our method, called domain interaction map(DIMA) represents a synthesis using many of theapproaches and ideas listed above. The basicalgorithm of phylogenetic profiling is combinedwith domain detection to delineate clusters of indi-vidual domains, rather than complete gene pro-ducts, that occur in a coordinated fashion. Theseclusters may be represented in the form of domain–domain interaction networks, yielding novelinsights into the complex interplay of proteinmodules in cellular processes. In particular, DIMAmay provide hints about potential interactiondomains. In addition to enhanced capabilities forpredicting biomolecular interactions, DIMA has alot of technical advantages over traditional proteinprofiling. It does not require exhaustive all-against-all comparison of genomic proteins. Detection ofsequence domains needs to be conducted only oncefor each genome added to the system, a task whichis linear with the number of gene products in thegenome. As soon as domain finding in freshlyadded genomes is finished phylogenetic profilesand resulting domain clusters can be re-calculatedinstantly. Updating such a profiling system is onlynecessary when new releases of domain databasesare made available.

In Figure 1 we provide a graphical overview ofthe DIMA technique and highlight the key differ-ences between DIMA and whole-protein phylo-genetic profiling. In our example we consider six

genomes and eight gene products, consisting of oneor more structural domains (Figure 1(a)). Figure1(b) represents the phylogenetic profiles describingthe occurrence of the five individual domains ingenomes, and the resulting domain interactionnetwork. Figure 1(c) illustrates the results obtainedfor the same example using standard phylogeneticprofiling. Clearly, the two methods consideredproduce fundamentally different association net-works and are in fact complementary.

Results

Here, throughout we explored the properties ofDIMA in direct comparison to the classical proteinprofiling method, which we term as CLASSIC. TheCLASSIC approach predicts relations betweenproteins, but DIMA predicts relations betweendomains. To facilitate comparison between thetwo methods, the predicted domain relations fromDIMA were mapped to proteins containing therespective domains (see Methods). Each methodcould then be evaluated against functional annota-tion and interaction data of whole proteins. For thisstudy, we evaluated DIMA based on the Saccharo-myces cerevisiae proteome.

Distribution of profile distances, entropyfiltering and clustering

A preliminary performance assessment of DIMAand CLASSIC was done by comparing the distri-bution of profile distances between interactingproteins and those belonging to control setsconsisting of random protein pairs (see Methods).In the case of DIMA, given an interacting pair, atleast one PFAM domain needs to be present in eachprotein in order to determine the profile distance.Therefore, only those pairs are included in thedistance distributions for DIMA while the fulldataset can be used for CLASSIC. Figure 2(a)–(d)shows the results for the BORKH dataset, whichcontains only high-confidence interaction pairs ofproteins in S. cerevisiae (see Methods).

With both methods there is a pronounced biastowards low bit distances in the experimentaldataset (Figure 2(a) and (b)). In the correspondingrandom controls, a similar bias is present althoughto a lesser extent. Inspection of the data revealedthat many of the protein pairs with similar profilesin the control group consist of ubiquitous proteinsthat are present in almost all genomes (withphylogenetic profiles consisting of all 1s) andproteins specific to one genome (with profiles ofall 0s except for that genome). As a result, proteins,which are not likely to interact or share a commonfunctional role have near identical profiles, whichwould cause a large number of false-positives to bepredicted.

From the information theory point of view, theinformation content of a profile depends on thefrequency of each possible letter (in our case 0 and

Page 3: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 1.General overview of the DIMAmethod and comparison with whole-protein phylogenetic profiling. (a) Inputdata. Six genomes (Genome 1, Genome 2,., Genome 6) encode different combinations of eight proteins, denotedP1,.,P8. Each protein may consist of one or two structural domains; there are a total of five different domains involved(A, B, C, D, E). (b) Results obtained using the DIMA technique. In the upper part of the domain phylogenetic profiles arepresented, with 1 and 0 indicating the presence or absence of a given domain in a particular genome. For example,domain A is present in genomes 1 to 4, and is absent in genomes 5 and 6. In the bottom part, the resulting domaininteraction network is shown using a maximal allowed bit difference of 2 between phylogenetic profiles. This means thattwo domains are considered interacting (and get joined into a cluster) if their profiles are different in no more than twopositions. In our example, this is the case with A, B, and C, as well as with C and D and also with D and E. There is noconnection between B and D, for instance, because their profiles (111100 and 011111) are different in three positions.(c) Results obtained by the standard, whole-protein phylogenetic profiling. Phylogenetic profiles for the eight proteinsconsidered are shown in the upper part, the resulting protein interaction network in genome 3 (also using the bitdifference threshold equal 2) below.

Domain Interaction Map 1333

1) in the string.23 Figure 2(e) depicts the entropy h(information content) of a 46 bit profile (correspond-ing to our profiles of 46 genomes) depending on thenumber of bits set to 1. Clearly, the informationcontent is maximal when the profile contains asmany 0s as 1s and rapidly decreases as the ratio ofthe two is shifted too much towards one side. Inorder to reduce the number of false positives basedon low information content, we applied an entropythreshold, excluding all profiles with entropies ofless than 0.3 from our analysis. After this filteringstep, the bias towards near identical profiles in theexperimental dataset remains significant while thedistances in the control dataset almost follow anormal distribution for both DIMA and CLASSIC

(Figure 2(c) and (d)). We find the same effect in allPPI datasets used to varying extents (see Supple-mentary Material).The distance distributions of DIMA and CLASSIC

suggest that both methods should be well suitedto make valid predictions. But can we expectDIMA to generate predictions not produced byCLASSIC profiling? To provide an answer to thisquestion, we plotted CLASSIC profile distancesagainst those generated by DIMA for each proteinpair in each PPI dataset used, excluding thosepairs without any detectable PFAM domains.Figure 3 shows the resulting scatter plots for theBORKH dataset and its random control beforeand after entropy filtering. The Pearson

Page 4: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 2.Distance distributions and entropy filtering. (a)–(d) The profile distance distributions for interacting proteins(C) and random controls (,) based on the BORKH PPI dataset before (a) and (b), and after (c) and (d) entropy filtering.(a) DIMA distances: approximately 30% of all interaction pairs that have at least one PFAM domain in either proteinhave a DIMA distance of 0 while onlyz10% of the random pairs have identical DIMA profiles. As described in the text,many of those are low information profiles, which should be removed from the dataset. (b) CLASSIC distances: as forDIMAwe see a bias towards identical and near identical profiles in both the PPI and the control pairs, which is strongerfor the PPIs. (c) and (d) After discarding all profiles with an entropy below 0.3, the control distances in the pairs show nonoticeable bias. In the PPI pairs the preference for similar profiles remains strong for both DIMA and CLASSIC.(e) Illustration of the entropy in a profile of 46 bits based on the number of bits set to 1. Filtering with an entropythreshold of hZ0.3 (dotted line) will eliminate all profiles with less than three zero/one bits in the profile.

1334 Domain Interaction Map

correlation coefficients of the scatter plots rangefrom 0.35 to 0.53 in the BORKH dataset and from0.23 to 0.70 in all datasets (see SupplementaryMaterial). Clearly, DIMA bit distances do not

correlate well with those generated by CLASSIC.Thus, a weak overlap between DIMA and CLASSICpredictions is expected.

To predict functional associations between

Page 5: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 3. Only weak correlation between DIMA and CLASSIC distances. The graphs plot the DIMA distance againstthe CLASSIC distance of protein pairs in the BORKH dataset (a) and (c), and the corresponding control set (b) and (d)before and after entropy filtering. Visually, there appears to be only a very weak correlation between both measures, ifany. This is confirmed by the Pearson correlation coefficients which indicate only a weak correlation (a: rZ0.53, b: rZ0.46, c: rZ0.35, d: rZ0.39). DIMA distances are largely independent of CLASSIC distances.

Domain Interaction Map 1335

proteins using DIMA, we grouped PFAM domainsinto clusters using a distance threshold of three bitsand applying the entropy filter and entropy baseddistanceweighting (seeMethods).ThePFAMdomaindatabase used in our analysis contained 5049 distinctdomain entries, 2947 ofwhichwere detected in the 46genomes used in this study. A total of 1229 domainswere excluded based on low entropy (h!0.3). Out ofthe remaining 1718 domains, 562 were assigned to atotal of 175 unique clusters,which contained between2 and 63 members (medianZ5, meanZ11.6). Thesedomain clusters were then mapped onto the pro-teome of S. cerevisiae, producing 68 distinct proteinclusters containing between 1 and 145 proteins(meanZ23.4, medianZ10; 503 proteins total).

Out of 6723 proteins from S. cerevisiae 1663 hadCLASSIC profiles above the entropy threshold of0.3. Of those, 923 were grouped in 239 clusters of 2to 236 proteins (meanZ16, medianZ6) using thesame distance threshold and weighting as forDIMA. Figure 4 shows the cluster size distributionsfor DIMA domain clusters, the correspondingprotein clusters in S. cerevisiae and CLASSIC proteinclusters.

Proteins in clusters are predicted to be

functionally related. The weak correlation betweenbit distances generated by DIMA and those gener-ated by CLASSIC suggest that there will be pooroverlap between the two methods in terms of theserelations. DIMA produced a total of 14,873 pre-dicted relations. Only 3616 (24%) of those werepredicted by CLASSIC profiling and 1332 (10%) aredue to the presence of common domains. 70% areindependent of both. Vice versa, CLASSIC yielded27,464 predicted relations, 13.2% of which werepredicted by DIMA. The majority of predictionsmade by DIMA and CLASSIC are exclusive to eachmethod. Thus, the two techniques are complemen-tary to each other.

Dissection of example DIMA predictions

DIMA clusters proteins with identical domains

An example in which DIMA excels over theCLASSIC method involves the copper chaperoneAtx1 (YNL259C). Atx1 delivers copper to Ccc2(YDR270W or Menkes and Wilsons disease proteinin humans), a P-type ATPase present in themembranes of secretory vesicles, which mediates

Page 6: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 4. Cluster size distributions. (a) DIMA domainclusters. (b) DIMA protein clusters in S. cerevisiae.(c) CLASSIC protein clusters. The overall distribution ofcluster sizes is similar for all three cases. After mappingdomain clusters to the yeast proteome, the total numberof clusters decreases while the cluster size increases.CLASSIC profiling produces more clusters, which onaverage are smaller than the DIMA protein clusters.

1336 Domain Interaction Map

the import of copper into the secretory compart-ments from the cytosol. Atx1 physically binds theN-terminal domain of Ccc2 as shown by two-hybridexperiments.24 The N termini of Ccc2 and Atx1 both

contain the HMA (PF00403) domains, which inter-act with each other to allow copper transferbetween the two proteins. This mechanism issupported by analysis of the crystal structure ofthe human Hah1 complex containing two HMAdomains linked by a copper ion.25 The asso-ciation of Atx1 and Ccc2 is easily predicted byDIMA as both contain HMA domains (Table 1A).However, Ccc2 also contains an E1-E2_ATPase(PF00122) and a Hydrolase (PF00702) domain,both of which are much more conserved than theHMA domain and occupy over 40% of the entireprotein sequence. Because of this difference insequence conservation, a 31 bit difference betweenthe profiles of Atx1 and Ccc2 is generated by theCLASSIC method, making association of theseproteins extremely difficult.

It is easy to see that DIMA will cluster proteinswith identical domains, simply because the profilebit distance would be 0. Changes in cellular copperion concentration will likely affect all proteins withexposed HMAdomains. Thus, DIMA clusters manyproteins that are functionally connected at thelevel of the common domain. However, if proteinssharing the same domains are required to beseparated on the basis of function defined bysequences not in such domains, the CLASSICapproach will be more suitable.

DIMA clusters proteins without common domains

Out of the 68 S. cerevisiae DIMA clusters 42contain proteins that do not share any commondomain. Two such proteins clustered by DIMAare the DNA mismatch repair proteins, Msh6 andMlh1 (Table 1B). Msh6 contains the MutS_V(PF00488) domain, which has a 1 bit differencewith the DNA_mis_repair (PF01119) domain con-tained in Mlh1. Both proteins have been shown toform a complex on mismatch containing DNA.26

Because of conservation differences in sequencesoutside the MutS_V and the DNA_mis_repair,Msh6 and Mlh1 show a CLASSIC profile distanceof 9.

Suggestions of an ancient connection by DIMA

In Table 1C, we show a cluster of three domains(PF00475, PF00815, PF00977) belonging to threedifferent proteins, which are related to histidinebiosynthesis: His3, His4, and His7. This is anexample of a cluster linking enzymes functionallyrelated by their roles in a metabolic pathway. DIMAalso clusters these enzymes with those containingthiamine pyrophosphate (TPP; PF02776) domains.TPP is an essential coenzyme27 and has beenassociated with histidine biosynthesis pathwaysby mutational28 and radioactive incorporationstudies.29 The profiles of TPP and histidine biosyn-thesis related domains show that these domains arespread across all three phylogenetic domains. Theirubiquity suggests an ancient connection betweenthese domains. This connection is not as apparent if

Page 7: A Domain Interaction Map Based on Phylogenetic Profiling

Table 1. Illustrations of DIMA predictions

Profile basis Protein Profile Bit dif

A. The copper transporting P-type ATPase Ccc2 and copper chaperone Atx1 have been shown to interact, experimentallyCLASSIC YDR270W (Ccc2) 1111011111111010101111111111011110101111010111 31

YNL259C (Atx1) 0000000000000000000000000000001010100100000011

PF00403 (HMA) YDR270W (Ccc2) 1111011111111010100111111111011110111111100111 0YNL259C (Atx1) 1111011111111010100111111111011110111111100111

B. DNA mismatch repair proteins Msh6 and Mlh1 form a complex on mismatch containing DNACLASSIC YDR097C (Msh6) 0000000000001111011100101110011111101111000111 9

YMR167W (Mlh1) 0010000001011111101101101110011111111101001111

PF00488 (MutS_V) YDR097C (Msh6) 0010000001111111111111111110011111111111101111 1PF01119(DNA_mis_repair)

YMR167W (Mlh1) 0010000001111111111111101110011111111111101111

C. Enzymes related to TPP and the biosynthesis of histidine are grouped together by DIMACLASSIC YDL080C (Thi3) 0000001000000000000000001001000000101001000011 34

YDR380W (Aro10) 0000000000000000000000001001000000101000000011YEL020C 0101101100110000000001101101010100101011001111

YGR087C (Pdc6) 0101000000000000000000001001000000101001000011YJR085C 0000000000000000000000000000000000000000000011

YLR044C (Pdc1) 0000000000000000000000001001000000101010000011YLR134W (Pdc5) 0101001000100000000000001001001100101001000011YMR108W (Ilv2) 1101111111111011100111101111011110111011001111YOR202W (His3) 0111110101111011100011101111011110111011001111YCL030C (His4) 0111100101111011100011101111011110111011001111YBR248C (His7) 0000001000000000000000001001000000101001000011

PF02776(TPP_enzyme_N)

YDL080C (Thi3) 1111111111111011100111101111011110111011001111 4YDR380W (Aro10) 1111111111111011100111101111011110111011001111

YEL020C 1111111111111011100111101111011110111011001111YGR087C (Pdc6) 1111111111111011100111101111011110111011001111

YJR085C 1111111111111011100111101111011110111011001111YLR044C (Pdc1) 1111111111111011100111101111011110111011001111YLR134W (Pdc5) 1111111111111011100111101111011110111011001111YMR108W (Ilv2) 1111111111111011100111101111011110111011001111

PF00475 (IGPD) YOR202W (His3) 0111110101111011100011101111011110111011001111PF00815(Histidinol_dh)

YCL030C (His4) 0111110101111011100011101111011110111011001111

PF00977(His_biosynth)

YBR248C (His7) 0111110101111011100011101111011110111011001111

D. A group of domains of unknown function from hypothetical T. maritima proteins; based on the DIMA prediction we expect those proteins to share afunctional role which is yet to be elucidatedPF03750 (DUF310) Hypothetical 0001100000000000000000000000000000000001000000 1PF03757 (DUF314) Hypothetical 0001100000000000000000000001000000000001000000PF03787 (DUF324) Hypothetical 0001100000000000000000000001000000000001000000

Phylogenetic profiles based on the CLASSIC method as well as DIMA based on PFAM domains (PFXXXXX) are shown. The profiles arelisted in the same order as genomes shown in Table 4. The entry in the first column is a PFAM domain or CLASSIC, indicating whetherthe profile is generated from DIMA or CLASSIC profiling, respectively. The maximum bit distance corresponding to these profiles isshown in the last column, for each example. The first three examples involve proteins from S. cerevisiae and show cases in which DIMAexcels over CLASSIC. The fourth example shows DIMA linking uncharacterized domains.

Domain Interaction Map 1337

one examines the CLASSIC profiles of associatedwhole proteins because many large bit distancesexist between them.

Relating domains of unknown function

The PFAM database contains many entriesdescribed as “domain of unknown function”,which are found in known and/or hypotheticalproteins. In this situation, DIMA may producerelations that may lead to functional characteriz-ation. For example, the uncharacterized domainsDUF310 (PF03750), DUF314 (PF03757), and DUF324(PF03787) are clustered together by DIMA (Table1D). Proteins containing these domains are pre-dicted to have in common a yet unknown physio-logical role.

Protein and domain interaction networks

It is important to note that clusters derivedfrom both DIMA and CLASSIC methods areredundant in the sense that individual domainsor proteins may be grouped into more than onecluster (see Methods). Each cluster represents agroup of direct neighbors in a complex networkof relations. Figure 5 shows a network ofdomains based on DIMA domain clusters.Many of the graphs in the network contain asmall number of connected nodes, representingdomains predicted to be functionally related. Asmall number of graphs contain a large numberof nodes. Some of these graphs represent largecomplexes such as those containing domainsfound in the ribosomal complex or theproteasome.

Page 8: A Domain Interaction Map Based on Phylogenetic Profiling

s

Figure 5. Domain interaction network. The Figure shows the complete network of domain relations without singlets.Domains are shown as nodes and edges represent predicted relations. Most graphs contain a small number of domainswith related functions. The few very large graphs represent broad categories of nodes involved in, e.g. the proteasome orthe ribosomal complex. Sometimes such large clusters consist of a big core of highly connected nodes and one or moreloosely connected satellite clusters which represent different functional entities.

1338 Domain Interaction Map

We map the domain network to S. cerevisiaeproteins containing the respective domains(Figure 6). The resulting protein network issimilar in appearance to the domain network.Because certain domains are found in multipleproteins, however, the mapping of the domainrelationships to these proteins yields slightlylarger sub-graphs than those found in thedomain graph.

For comparison, we show the protein interactionnetwork obtained by CLASSIC profiling in Figure 7.Overall, this network is structured similarly incomparison to the DIMA based network. SinceDIMA and CLASSIC predicted clusters have smalloverlaps, both DIMA and CLASSIC based networksare complementary and can be merged to obtain amore complete picture of the S. cerevisiaeinteractome.

Validation against protein–protein interactiondatasets

An important benchmark for the prediction ofrelated proteins is the evaluation based on experi-mental datasets of interacting proteins. Whilefunctionally related proteins do not necessarilyengage in physical binding, the reverse can beassumed to be true: if two proteins interact in vivowe can be reasonably certain that they are involvedin at least one common cellular function. Therefore,PPI data represent a subset of functional links whichis well suited for the validation of our predictiontechnique.

For each of the datasets described in Methods, wegenerated a negative-control of 100,000 randompairs, tabulated the DIMA and CLASSIC predic-tions against the known interactions and controls,

Page 9: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 6. DIMA protein interaction network in S. cerevisiae. The domain interaction map in Figure 5 can be mappedonto a specific genome of interest based on the domains found in each protein. The resulting protein relation graphrepresents predicted functional relations between the respective proteins. The Figure shows the respective graph inS. cerevisiae. Nodes correspond to proteins and edges represent predicted relations. Edges drawn in blue can also befound by CLASSIC profiling. Since many domains (and proteins) are specific for certain species or groups of organisms,some domain clusters cannot be mapped to the selected genome. For example, the bacterial flagellum is not found inyeast.

Domain Interaction Map 1339

respectively, and computed sensitivity and speci-ficity values from the resulting contingency tables.Positive and negative predictive values could not becomputed from these data because of the fixed sizesof the control and experimental groups. As shownin Table 2A, depending on the dataset, realinteractions are 6 to 27 times more likely to bepredicted by DIMA compared with randomlygenerated ones. c2-Tests show these ratios to besignificant at the 0.1% level in all cases. Manyinteractions cannot be correctly predicted by DIMAbecause of the lack of PFAM domains in theproteins involved. As more domains are character-ized and included in the PFAM database, we expectthe performance of DIMA to improve. In order toestimate DIMA performance independent of theincompleteness of PFAM, we repeated the analysisof each dataset with the exclusion of all proteinswithout any PFAM domains detected (Table 2B).This filtering step substantially increased thesensitivity of our predictions.

It is interesting to examine the change ofsensitivity and “fold” (see Methods) at differentreliability levels in a dataset. Both numbers appearto increase as we raise the confidence thresholdfrom low to medium and high in the BORKHdataset.

With respect to all proteins, averaging over alldatasets, CLASSIC is slightly more sensitive than

DIMA while DIMA produces better “fold” ratiosthan CLASSIC (Table 2A). If we exclude proteinswithout PFAM domains, DIMA becomes moresensitive than CLASSIC and the difference in foldbetween the two methods is attenuated (Table 2B).We would like to point out that our estimates are

heavily influenced by the size and quality of the PPIdata, which serve as the standard of truth. For a trueestimate of performance, one would need a repre-sentative random sample from the proteome withcomplete information about protein interactions(and functional links). Such a sample does not existand will hardly become available in the near future.All current datasets are highly incomplete and atthe same time contaminated by false positives.Incompleteness results in overestimation of false-positive predictions while false-positive inter-actions in the validation data inflate the number ofpredictions counted as false-negative.

Validation against functional annotation

Functional links that do not involve physicalinteractions may be predicted by DIMA andCLASSIC, but are not represented in datasets ofphysical protein–protein interactions. To evaluatethe performance of these two methods with respectto the prediction of functional associations, wetested the performance of both profiling techniques

Page 10: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 7. CLASSIC protein interaction network in S. cerevisiae. Nodes represent proteins and edges depict predictedrelations. Edges drawn in blue were predicted by both CLASSIC and DIMA. The protein network derived fromCLASSIC profiling shows the same global structure as the one produced by DIMA. Because of the higher sensitivity ofCLASSIC the resulting network is larger. Nevertheless, the DIMA predictions add substantial information since theoverlap between both networks is surprisingly small.

1340 Domain Interaction Map

in terms of their ability to reconstruct functionalroles of S. cerevisiae proteins as described in theMIPS functional catalogue (FUNCAT).30 FUNCATis a hierarchical classification of yeast proteinsaccording to their function. Each of the 16 mainclasses (e.g., metabolism, energy) may be furtherdivided into subclasses of up to six hierarchy levels.The numeric designator of a functional class caninclude up to six numbers. For example, the yeastgene product YGL237c is attributed to the func-tional category 04.05.01.04, where the numbers,from left to right, mean transcription, mRNAtranscription, mRNA synthesis, and transcriptionalcontrol. An essential feature of FUNCAT is its multi-dimensionality, meaning that any protein can beassigned to multiple categories. Thus, while manyproteins of unknown function lack FUNCATannotation others are assigned to multiple cate-gories according to their role in various aspects ofcellular function.

In order to test how well DIMA and CLASSIC

predict functional relations we picked randomsamples of 100,000 protein pairs from each of thefollowing sets of proteins: the whole yeast proteome(ALL), the proteins with at least one FUNCATannotation (FUNCAT), the proteins with at leastone PFAM domain (PFAM) and the proteins, whichhave at least one FUNCAT annotation and at leastone PFAM domain (BOTH). DIMA and CLASSICpredictions were tabulated against common func-tional categories using the upper two levels of theFUNCAT hierarchy. From the resulting contingencytables, we calculated sensitivity, specificity, positive(ppv) and negative predictive value (npv) (Table 3).In c2-tests we found all cases to be significant at the0.1% level. As with the validation against PPIdatasets we found the sensitivity to be low forDIMA and only slightly better for CLASSIC. DIMAsensitivity improves gradually when validated inthe PFAM and BOTH datasets. While part of thepoor coverage of functional links must be ascribedto the technique itself, part of this is due to the

Page 11: A Domain Interaction Map Based on Phylogenetic Profiling

Table 2. Accuracy of PPI prediction evaluated against experimental PPI datasets

Size Size control Sensitivity Specificity Fold

Dataset Pairs Proteins Pairs Proteins DIMA CLASSIC DIMA CLASSIC DIMA CLASSIC

A. Unfiltered datasetsBORKH 2455 988 100,000 988 5.54 7.21 99.7 99.3 17.6 10.6BORKM 11,855 2617 100,000 2617 3.12 3.65 99.7 99.5 9.9 8.0BORKL 78,390 5321 100,000 5321 0.94 1.16 99.9 99.8 11.5 7.1DIP 14,850 4713 100,000 4713 1.34 1.57 99.9 99.8 19.1 9.1DIPC 4696 2191 100,000 2191 3.02 3.49 99.8 99.5 18.9 7.4MIPSP 1857 1276 100,000 1276 3.72 2.96 99.8 99.5 16.2 6.3MIPSG 1332 960 100,000 960 1.43 2.85 99.8 99.4 6.4 5.0ITO 4394 3277 100,000 3277 0.52 0.52 99.9 99.8 6.4 2.8UETZ 907 989 100,000 989 3.09 2.76 99.9 99.7 26.8 9.0Sensitivity is low for both CLASSIC and DIMA. The fold is higher for DIMA, on average. Since many proteins do not have any PFAM domains, DIMA cannot make predictions about them whichcontributes to its low sensitivityB. Only proteins with PFAM hits (p-value %1!10K5)BORKH 497 308 15,784 392 27.4 27.0 98.0 98.0 13.8 13.8BORKM 2874 921 16,651 1068 12.9 8.3 98.1 98.7 6.8 6.4BORKL 13,859 1437 8628 1572 5.3 3.5 99.1 99.2 5.6 4.5DIP 1917 989 8346 1349 10.4 8.1 99.2 99.2 12.4 9.8DIPC 731 521 10,002 690 19.4 16.7 98.4 98.0 12.1 8.4MIPSP 275 250 8783 378 25.1 13.5 97.4 97.6 9.6 5.7MIPSG 165 193 9642 298 11.5 7.3 97.7 98.2 5.0 4.1ITO 342 417 7274 882 6.7 5.9 98.9 99.1 6.0 6.2UETZ 104 126 8695 293 26.9 21.2 98.7 98.3 20.4 12.8When considering only proteins, which DIMA can make predictions about, the sensitivity increases significantly and is now equivalent to or even better than for the CLASSIC method

Sensitivity and specificity are given in percent. Fold is the number of true positives in a PPI dataset divided by the expectation based on the control data (see Methods). Differences betweenproportions were found to be significant on the 0.1% level in c2-tests for all datasets.

Table 3. Accuracy of functional-link prediction evaluated against the MIPS FUNCAT annotation

Sample size Sensitivity Specificity NPV PPV

Dataset Pairs Proteins DIMA CLASSIC DIMA CLASSIC DIMA CLASSIC DIMA CLASSIC

ALL 100,000 6707 0.36 0.75 99.97 99.9 62.9 58.4 89.3 89.4FUNCAT 100,000 4426 0.30 0.53 99.93 99.9 57.8 55.1 75.4 75.5PFAM 100,000 3461 0.54 0.97 99.84 99.8 51.4 56.2 76.5 76.5BOTH 100,000 3121 0.56 0.87 99.83 99.7 56.8 56.5 71.0 71.0

Sensitivity, specificity, positive (ppv) and negative predictive value (npv) in percent for the prediction of functional relations as validated against the two upper levels of the FUNCATclassification ofS. cerevisiae proteins. Random protein pairs (100,000) were sampled from all yeast proteins (ALL), yeast proteins with at least one functional annotation (FUNCAT), at least one PFAM domain (PFAM)and both a domain and functional annotation (BOTH), respectively, and tabulated against the DIMA and CLASSIC predictions. c2-Test of the resulting contingency tables yielded p-values of!0.1% inall cases. Sensitivity is low but predictive values are encouraging for both methods.

(continued on next page)

Page 12: A Domain Interaction Map Based on Phylogenetic Profiling

1342 Domain Interaction Map

incompleteness of our current knowledge aboutdomains. The predictive values, on the other hand,turned out to be more promising with similarperformance of both techniques. Again, the func-tional annotations of proteins as well as the PFAMdomain database are highly incomplete, biasing theperformance parameters. Based on these results,DIMA covers only a small subset of all functionalrelations, but predicted pairs are likely to share acommon function.

Discussion

We have developed a new method we call DIMAfor studying the associations between proteins andprotein domains based on phylogenetic profiling ofconserved domains. The process involving domaindetection, profile generation and clustering yieldsprotein/domain pairs predicted to be functionallyrelated and/or physically interacting (Figure 8). Wehave demonstrated that the predictions producedby DIMA are complementary to those produced byCLASSIC profiling and therefore represent a truegain of information. The domain interaction net-work contains information from many differentgenomes without specifically relating to one ofthem. In other words, DIMA links domains withcommon functional roles, which need not bepresent together in a specific target organism. Forexample, DIMA connects several domains from thebacterial flagellar apparatus (data not shown),which clearly is not present in yeast. If a projectiononto a specific species is desired the data can easilybe mapped to the respective genome based on thepresence of domains in its protein sequences. Fromthe practical point of view, the technique is easy toapply and maintain when adding additional organ-isms to the analysis. No exhaustive all versus allcomparisons are needed: a single run of the domaindetection software produces all the data requiredfor profile update and re-clustering.

Because of a lack of systematic functionalannotation of the PFAM domains themselvesvalidation was done after mapping to S. cerevisiae.This organism is a natural choice for validationpurposes, since it is one of the most extensivelystudied and best annotated models used today forwhich a wealth of experimental evidence for PPIfrom different sources is available. It appears safe toassume that any two proteins which engage inphysical interaction usually do so in order to fulfil acommon functional role whose physiological rele-vance may or may not be known. Therefore,protein–protein interactions represent a specialcase of experimentally verified functional associ-ation well suited for validation purposes. Theresults of our validation against PPI and functionalannotation data (FUNCAT) demonstrated the suit-ability of our method for successful functionprediction. Performance was found to be similarfor DIMA and CLASSIC profiling in our tests. Both

methods produced predictions significantly betterthan random expectation.

DIMA and CLASSIC methods share most of theirlimitations because both rely on homology detec-tion. Especially in phylogenetically very distantorganisms, the ortholog/domain finding may faildue to weak E-values or even lack of hits. Forexample, the ATP synthase delta subunit (PF00213)in Helicobacter pylori is only detected with an E-valueof 0.024 or above, which is above our threshold of1!10K5.

A major downside of both profiling approaches istheir low sensitivity although CLASSIC profilingperformed slightly better in this respect. Clearly,some valid associations get lost in the entropyfiltering process, which on the other hand quiteeffectively eliminates many false positives. Essen-tially, methods based on phylogenetic profilingcannot make predictions about entities with lowinformation profiles. In the case of ubiquitousproteins or domains we may never be able toovercome this barrier because of the species lackingthe respective proteins being long extinct. Forhighly specific proteins the situation is different.As more genomes are sequenced, one can expectmany of them to appear in the new genomesimproving the entropy of their profiles.

An important reason for the lower sensitivity ofDIMA is the inevitable incompleteness of the PFAMdomain database. As of today only 53% of allproteins in the S. cerevisiae genome contain PFAMdomains with a p-value of %10K5. That means thatwe cannot make any predictions for almost half ofthe proteins in this genome. Accordingly, thesensitivity increases dramatically when we onlyconsider proteins with at least one PFAM domain(Table 2). Nevertheless, even this filtered datasetdoes not come close to an ideal situation in whichall conserved domains are known. Therefore, weare confident that the sensitivity of our predictionmethod will increase gradually as the underlyingdomain database grows. Eventually, such a domaindatabase will probably cover the vast majority ofprotein sequences, with only a few highly special-ized proteins which lack detectable conserveddomains missing.

We would also like to point out that the choice ofgenomes used for profiling will influence theresulting predictions by influencing the profileentropies. Therefore, if we were interested indetecting functional relations between proteins/domains in a certain group of organisms, we mayrestrict profiling to those organisms in order tofocus on proteins/domains mainly found in thisgroup. A systematic study of parameter influenceson the results of profiling would be worth futurestudy.

Although protein–protein interactions sometimeshappen with great specificity many conservedinteraction domains and motifs have been identi-fied in a large variety of different interaction pairs.Examples of such interaction domains include, e.g.SH3, PDZ, and WW domains and their respective

Page 13: A Domain Interaction Map Based on Phylogenetic Profiling

Figure 8. The DIMA process. In the first step HMMER is used to identify all known PFAM domains in each protein ofthe chosen genomes. Based on the domain information, a phylogenetic profile for every PFAM domain is built. Bits ineach position of the profile indicate the presence or absence of the respective domain in that genome. After removing lowentropy domains from the list, the pairwise bit distance is calculated for all possible domain pairs. Based on thosedistances clusters of domains with similar profiles are generated. The domain clusters can be analyzed directly or, ifdesired, protein clusters can be generated by mapping the clusters back onto a genome of interest.

Domain Interaction Map 1343

binding motifs. The specificity of such interactionsis established by substitutions of a few amino acidresidues in both domains/motifs while preservingthe overall structure and sequence of the respectivemodule. The presence of interaction domains in thePFAM database makes them an interesting case forour prediction: rather than viewing the domains asbearers of function we can treat them as bindingmodules in some cases. That is, in cases of a func-tional link via physical binding we may actually bedetecting the interaction sites. Unfortunately, wecannot distinguish these from pure functional linkswithout integrating additional information. Never-theless, it appears reasonable to expect at least someof the proteins/domains in our clusters to engage inphysical contact.

In summary, the DIMA technique represents anovel method for finding functionally relatedand/or physically interacting domains, whichtruly adds to the predictions obtained by othermethods. We hope that researchers in experimentallaboratories will adopt those methods to help themidentify interesting targets for their work.

† http://pedant.gsf.de‡ http://www.ncbi.nlm.nih.gov/Taxonomy/

taxonomyhome.html§ http://pedant.gsf.de/credits.htmls http://www.aisee.com

Methods

Software environment and genome data

The basis for the present study was the PEDANT

genome analysis system.6,31 The PEDANT database†contains exhaustive functional and structural annotationof all completely sequenced genomes. In particular,detection of PFAM domains10 is conducted using theHMMER software.32 Gene products are also automati-cally assigned to yeast functional categories,30 SCOPfolds,20 and enzyme classes33 based on similaritysearches.Out of z300 finished genomic sequences available at

the time of writing, we selected 46 genomes fromsufficiently distant species, guided by the NCBI-taxon-omy database‡. The complete list of genomes used isshown in Table 4 (URLs of the respective sequencingcentres are available§). Here, throughout these genomesare always used in this particular order to facilitatecomparison of bit vectors representing individual phylo-genetic profiles.DIMA was implemented in the perl programming

language. For analysis, data were processed using perlscripts and analyzed in R version 1.9.1 under LINUX.Graphs were created using aiSees.

Phylogenetic profiling using DIMA and the CLASSICmethod

The DIMA profiling process involves the generation of

Page 14: A Domain Interaction Map Based on Phylogenetic Profiling

Table 4. Genomes used in this study

Kingdom Number Species Genes

Archaea 1 Aeropyrum pernix 26942 Archaeoglobus fulgidus 24073 Halobacterium sp. NRC-1 20584 Methanothermobacter

thermautotrophicus1869

5 Methanocaldococcusjannaschii

1715

6 Pyrobaculum aerophilum 26057 Pyrococcus abyssi 17658 Sulfolobus solfataricus 29779 Thermoplasma acidophilum 1507

Bacteria 10 Agrobacterium tumefaciens 272111 Anabaena sp. PCC 7120 612912 Aquifex aeolicus VF5 152213 Bacillus subtilis 411214 Borrelia burgdorferi 85015 Brucella melitensis 319816 Buchnera sp. APS 57417 Caulobacter crescentus 373718 Chlamydia muridarum 90919 Chlamydophila pneumoniae 106920 Clostridium perfringens 272321 Deinococcus radiodurans 318222 Escherichia coli 428923 Haemophilus influenzae 170924 Helicobacter pylori 26695 157625 Lactococcus lactis 226626 Listeria monocytogenes 284627 Mesorhizobium loti 727528 Mycobacterium tuberculosis 392429 Mycoplasma pneumoniae 68930 Neisseria meningitidis 198931 Pasteurella multocida 201432 Pseudomonas aeruginosa 556533 Ralstonia solanacearum 511634 Rickettsia conorii 137435 Salmonella Typhi 476736 Sinorhizobium meliloti 334137 Staphylococcus aureus 271438 Streptococcus pyogenes 169639 Synechocystis sp. PCC 6803 316940 Thermotoga maritima 184641 Treponema pallidum 103142 Ureaplasma urealyticum 61343 Xylella fastidiosa 276644 Yersinia pestis 4083

Eukaryota 45 Saccharomyces cerevisiae 670746 Schizosaccharomyces pombe 5010

We performed phylogenetic profiling in 46 completely sequencedgenomes; nine from the kingdom of archaea, 35 bacteria and twoeukaryotes. The number column indicates the position of therespective genome in the profile string.We picked a diverse set ofgenomes for our study to get results representative for manyorganisms. Rare domains/proteins will be eliminated by theentropy filtering under these conditions. If we wanted to learnmore about such domains/proteins we may have to select agroup of more closely related genomes leading to profiles withbetter entropy values for profiles of specialized proteins presentin part of the group.

1344 Domain Interaction Map

a large matrix in which rows and columns representPFAM domains and organisms under study, respectively.Each cell in this matrix can contain one of the two values,1 or 0, indicating the presence or absence of the respectivedomain in at least one gene product of the genomecorresponding to the respective column. Thus, each rowrepresents the phylogenetic profile of a certain domain.Domain detection was done with HMMER in thePEDANT system for the full set of proteins of all 46

genomes used in this study (Table 2). An E-value cut-offof 10K5 was used in order to avoid false-positive domainhits.Classical profiling was performed using the precom-

puted protein similarity map SIMAP34 based on FASTAalignments. We applied an E-value cut-off of 10K4 with ascore to self-score ratio threshold of 0.1.Bit distances between protein pairs represent the

Hamming distance (Zbit difference) of the profiles forthe CLASSIC method. In contrast, the DIMA bit distancevalue is defined as the minimal bit distance between anytwo PFAM domain pairs, one in each protein.

Entropy filtering and weighting of phylogeneticprofiles

The information content or Shannon entropy ofeach domain or protein profile was calculated ashZ

P1iZ0 pi log2 pi, where pi is the relative frequency of

the ith symbol of the “alphabet” ([0,1]) in the profile.23 Weexcluded all profiles with low information content using athreshold of hZ0.3. Additionally, entropy weighting wasperformed when comparing profiles: the idea is to allowthe maximal distance threshold of kZ3 bit in cases whereboth profiles have maximal information content whileapplying stricter thresholds for low entropy profiles. Theweighting was performed by multiplying the maximalthreshold by the lower of the two entropies of bothprofiles. For example, when comparing two profiles withh1Z0.8 and h2Z0.5 a threshold of kwZminðh1; h2ÞkZ0:5!3 bitZ1:5 bit is used.

Clustering

Each domain/protein left after entropy filtering repre-sented a “cluster-centre”. All other domain profiles werecompared to all cluster centres and added to therespective cluster if their bit-distance d to the centre wasless or equal to the entropy-weighted threshold kw.Therefore, the maximal distance between any twomembers of a cluster is 2kw. Identical clusters and clustersrepresenting subsets of other clusters were removed. Theresulting unique clusters show some overlap reflectingthe fact that a given domain can play a role in more thanone functional group. Each cluster contains a group ofdomains/proteins predicted to be direct neighbors.Figure 8 gives an overview of the entire DIMA

procedure from genome selection to clustering andmapping.

Datasets of protein–protein interactions

To evaluate our predictions, data on known PPIpredicted domain interactions were mapped to theproteome of S. cerevisiae. Each PFAM domain present inthe resulting yeast-specific domain network was replacedby all yeast proteins containing this domain.The following PPI datasets were used for evaluation:

(i) 1575 physical and 1327 genetic interactions describedin the literature and manually annotated at MIPS (furtherreferred to as datasets MIPSP and MIPSG), (ii) 907interactions identified by Uetz et al.35 by genome-widetwo-hybrid screen (dataset UETZ), (iii) 4353 interactionspublished by Ito et al.,36 also based on two-hybrid analysis(dataset ITO), (iv) 14,811 interactions in the DIP37

database and its filtered “core set” of 4685 interactions(datasets DIP and DIPC),38 (v) the combined datasetpublished by von Mering et al.,39 split into threeconfidence levels (BORKH: high confidence, 2455

Page 15: A Domain Interaction Map Based on Phylogenetic Profiling

Domain Interaction Map 1345

interactions; BORKM: high and medium confidence,11,844 interactions; BORKL: high, medium and lowconfidence, 77,536 interactions). As opposed to physicalinteractions which involve direct molecular binding,genetic interactions (dataset MIPSG) do not necessarilyimply such contact. Instead, they represent functionalrelationships such as synthetic lethality in which the lossof one protein can be compensated by the other whilemutation or loss of both kills the organism.We limited ouranalysis to binary interactions and did not consider multi-protein complexes where particular contacts betweenproteins are unknown, such as those publishedrecently.40,41 Each of the datasets was refined by eliminat-ing interaction pairs where one or both partners had nosystematic yeast ORF code assigned, those describingself-interactions (homodimers), as well as redundantentries including permuted interactions (i.e., a with band b with a).For the purpose of our study it would also be quite

desirable to have a negative dataset, i.e. information onproteins that are known not to interact. Since such dataare not available at present (and are unlikely to beavailable in the future) we generated negative control setsas follows: for each PPI dataset we generated 100,000random pairs sampling from the set of proteins present inthe parent interaction dataset and thus replicating anybias towards certain proteins. Pairs that were present inthe parent set as well as “homodimers” were excludedfrom the controls. While such random data certainlycontain pairs that do in fact interact, the number of suchcases has been estimated to be very low (under 1%).38

Statistical evaluation

For evaluation of our predictions we created 2!2contingency tables using protein–protein interaction pairsand functional annotations as our standard of truth. Weperformed c2-tests and calculated sensitivity, specificity,positive and negative predictive values based on thecontingency tables in order to assess the performance ofthe prediction. We also calculated the normalized true-positive to false-positive ratio (called relative risk inepidemiology; “Fold” in Table 2):

FoldZTP

FP

size of control set

size of PPI set

where TP is the number of true positives and FP, thenumber of false positives.

Availability

All relations predicted by DIMA and CLASSIC areavailable as Supplementary Material from the journalweb page.

Acknowledgements

We are indebted to Grigory Kolesov and MartinMokrejs for their assistance with the PEDANTdatabase. Thomas Rattei and Roland Arnold wereextremely helpful with the usage of SIMAP. Thiswork was funded by a grant of the German FederalMinistry of Education and Research (BMBF) withinthe BFAM framework (031U112C).

Supplementary Data

Supplementary data associated with this articlecan be found, in the online version, at doi:10.1016/j.jmb.2004.10.019

References

1. Huynen, M. A., Snel, B., von Mering, C. & Bork, P.(2003). Function prediction and protein networks.Curr. Opin. Cell Biol. 15, 191–198.

2. Pellegrini, M., Marcotte, E. M., Thompson, M. J.,Eisenberg, D. & Yeates, T. O. (1999). Assigning proteinfunctions by comparative genome analysis: proteinphylogenetic profiles. Proc. Natl Acad. Sci. USA, 96,4285–4288.

3. Valencia, A. & Pazos, F. (2002). Computationalmethods for the prediction of protein interactions.Curr. Opin. Struct. Biol. 12, 368–373.

4. Pagel, P., Mewes, H.-W. & Frishman, D. (2004).Conservation of protein protein–protein inter-actions—lessons from ascomycota. Trends Genet. 2,72–76.

5. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J.,Zhang, Z., Miller, W. & Lipman, D. J. (1997). GappedBLAST and PSI-BLAST: a new generation of proteindatabase search programs. Nucl. Acids Res. 25, 3389–3402.

6. Frishman, D., Mokrejs, M., Kosykh, D., Kastenmuller,G., Kolesov, G., Zubrzycki, I. et al. (2003). ThePEDANT genome database. Nucl. Acids Res. 31,207–211.

7. von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S.,Bork, P. & Snel, B. (2003). STRING: a database ofpredicted functional associations between proteins.Nucl. Acids Res. 31, 258–261.

8. Wong, P., Kolesov, G., Frishman, D. & Houry, W. A.(2003). Phylogenetic web profiler. Bioinformatics, 19,782–783.

9. Pawson, T. & Nash, P. (2003). Assembly of cellregulatory systems through protein interactiondomains. Science, 300, 445–452.

10. Bateman, A., Birney, E., Cerruti, L., Durbin, R.,Etwiller, L., Eddy, S. R. et al. (2002). The Pfam proteinfamilies database. Nucl. Acids Res. 30, 276–280.

11. Letunic, I., Goodstadt, L., Dickens, N. J., Doerks, T.,Schultz, J., Mott, R. et al. (2002). Recent improvementsto the SMART domain-based sequence annotationresource. Nucl. Acids Res. 30, 242–244.

12. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch,A., Barrell, D., Bateman, A. et al. (2003). The InterProDatabase, 2003 brings increased coverage and newfeatures. Nucl. Acids Res. 31, 315–318.

13. Gaasterland, T. & Ragan, M. A. (1998). Microbialgenescapes: phyletic and functional patterns of ORFdistribution among prokaryotes. Microb. Comp. Geno-mics, 3, 199–217.

14. Pazos, F. & Valencia, A. (2001). Similarity of phylo-genetic trees as indicator of protein–protein inter-action. Protein Eng. 14, 609–614.

15. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W.,Yeates, T. O. & Eisenberg, D. (1999). Detecting proteinfunction and protein–protein interactions from gen-ome sequences. Science, 285, 751–753.

16. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. &Ouzounis, C. A. (1999). Protein interaction maps forcomplete genomes based on gene fusion events.Nature, 402, 86–90.

Page 16: A Domain Interaction Map Based on Phylogenetic Profiling

1346 Domain Interaction Map

17. Geer, L. Y., Domrachev, M., Lipman, D. J. & Bryant,S. H. (2002). CDART: protein homology by domainarchitecture. Genome Res. 12, 1619–1623.

18. Apic, G., Huber, W. & Teichmann, S. A. (2003). Multi-domain protein families and domain pairs: compari-son with known structures and a random model ofdomain recombination. J. Struct. Funct. Genomics, 4,67–78.

19. Hegyi, H., Lin, J., Greenbaum, D. & Gerstein, M.(2002). Structural genomics analysis: characteristics ofatypical, common, and horizontally transferred folds.Proteins: Struct. Funct. Genet. 47, 126–141.

20. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard,T. J., Chothia, C. & Murzin, A. G. (2004). SCOPdatabase in 2004: refinements integrate structure andsequence family data. Nucl. Acids Res. 32, D226–D229.

21. Tatusov, R. L., Natale, D. A., Garkavtsev, I. V.,Tatusova, T. A., Shankavaram, U. T., Rao, B. S. et al.(2001). The COG database: new developments inphylogenetic classification of proteins from completegenomes. Nucl. Acids Res. 29, 22–28.

22. Deng, M., Mehta, S., Sun, F. & Chen, T. (2002).Inferring domain–domain interactions from protein–protein interactions. Genome Res. 12, 1540–1548.

23. Shannon, C. E. (1948). A mathematical theory ofcommunication. Bell Syst. Tech. J. 27, 379–423.

24. Pufahl, R. A., Singer, C. P., Peariso, K. L., Lin, S. J.,Schmidt, P. J., Fahrni, C. J. et al. (1997). Metal ionchaperone function of the soluble Cu (I) receptorAtx1. Science, 278, 853–856.

25. Wernimont, A. K., Huffman, D. L., Lamb, A. L.,O’Halloran, T. V. & Rosenzweig, A. C. (2000).Structural basis for copper transfer by the metallo-chaperone for the Menkes/Wilson disease proteins.Nature Struct. Biol. 7, 766–771.

26. Habraken, Y., Sung, P., Prakash, L. & Prakash, S.(1998). ATP-dependent assembly of a ternary complexconsisting of a DNA mismatch and the yeast MSH2-MSH6 and MLH1-PMS1 protein complexes. J. Biol.Chem. 273, 9837–9841.

27. Rodionov, D. A., Vitreschak, A. G., Mironov, A. A. &Gelfand, M. S. (2002). Comparative genomics ofthiamin biosynthesis in prokaryotes. New genesand regulatory mechanisms. J. Biol. Chem. 277,48949–48959.

28. Allen, S., Zilles, J. L. & Downs, D. M. (2002). Metabolicflux in both the purine mononucleotide and histidinebiosynthetic pathways can influence synthesis of thehydroxymethyl pyrimidine moiety of thiamine inSalmonella enterica. J. Bacteriol. 184, 6130–6137.

29. Zeidler, J., Sayer, B. G. & Spenser, I. D. (2003).Biosynthesis of vitamin B1 in yeast. Derivation ofthe pyrimidine unit from pyridoxine and histidine.Intermediacy of urocanic acid. J. Am. Chem. Soc. 125,13094–13105.

30. Mewes, H.W., Albermann, K., Bahr, M., Frishman, D.,Gleissner, A., Hani, J. et al. (1997). Overview of theyeast genome. Nature, 387, 7–65.

31. Frishman, D., Albermann, K., Hani, J., Heumann, K.,Metanomski, A., Zollner, A. & Mewes, H. W. (2001).Functional and structural genomics using PEDANT.Bioinformatics, 17, 44–57.

32. Eddy, S. R. (1998). Profile hidden Markov models.Bioinformatics, 14, 755–763.

33. Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A.(2002). The KEGG databases at GenomeNet. Nucl.Acids Res. 30, 42–46.

34. Mewes, H. W., Amid, C., Arnold, R., Frishman, D.,Guldener, U., Mannhaupt, G. et al. (2004). MIPS:analysis and annotation of proteins from wholegenomes. Nucl. Acids Res. 32, D41–D44.

35. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson,R. S., Knight, J. R. et al. (2000). A comprehensiveanalysis of protein–protein interactions in Saccharo-myces cerevisiae. Nature, 403, 623–627.

36. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. &Sakaki, Y. (2001). A comprehensive two-hybridanalysis to explore the yeast protein interactome.Proc. Natl Acad. Sci. USA, 98, 4569–4574.

37. Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K.,Bowie, J. U. & Eisenberg, D. (2004). The database ofinteracting proteins: 2004 update. Nucl. Acids Res. 32,D449–D451.

38. Deane, C. M., Salwinski, L., Xenarios, I. & Eisenberg,D. (2002). Protein interactions: two methods forassessment of the reliability of high-throughputobservations. Mol. Cell. Prot. 1, 349–356.

39. von Mering, C., Krause, R., Snel, B., Cornell, M.,Oliver, S. G., Fields, S. & Bork, P. (2002). Comparativeassessment of large-scale data sets of protein–proteininteractions. Nature, 417, 399–403.

40. Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore,L., Adams, S. L. et al. (2002). Systematic identificationof protein complexes in Saccharomyces cerevisiae bymass spectrometry. Nature, 415, 180–183.

41. Gavin, A. C., Bosche, M., Krause, R., Grandi, P.,Marzioch, M., Bauer, A. et al. (2002). Functionalorganization of the yeast proteome by systematicanalysis of protein complexes. Nature, 415, 141–147.

Edited by J. Thornton

(Received in revised form 20 July 2004; accepted 12 October 2004)