panditplus: toward better integration of evolutionary view on molecular sequences with supplementary...

6
[Trends in Evolutionary Biology 2010; 2:e1] [page 1] PANDITplus: toward better integration of evolutionary view on molecular sequences with supplementary bioinformatics resources Slavica Dimitrieva, 1 Maria Anisimova Department of Computer Science, Swiss Federal Institute of Technology (ETH Zurich) and Swiss Institute of Bioinformatics (SIB), Switzerland 1 current address: Swiss Institute for Experimental Cancer Research (ISREC) and Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland Abstract Recent comparative genomic and other large-scale bioinformatics studies increasing- ly have been using gene annotations, func- tional classifications, and complementary data from the emerging “-omics” disciplines. Indeed, such analyses have better chances to uncover hidden patterns in complex multidi- mensional and heterogeneous biological sys- tems data. On the other hand, inferences from such studies are extremely sensitive to data samples and quality, and are more diffi- cult to compare or replicate owing to differ- ences in supplementary data sources at times not publicly available. As a contribution toward the unification and integration of good quality data from heterogeneous bioin- formatics resources, we present here an inte- grated data bank PANDITplus. It is built as an extension of PANDIT, the database of PFAM alignments and phylogenetic trees for known protein domains and families spanning line- ages from the three domains of life. PANDITplus is a relational database contain- ing information on functional categories, metabolic pathways, protein–protein interac- tions, disease associations, gene expression, three-dimensional structure, as well as esti- mates from evolutionary analyses of selective pressures. User-friendly interface enables customized queries and fast data access. We recommend PANDITplus as a common bioin- formatics platform for testing evolutionary hypotheses, which go beyond the mere infer- ences from molecular data by incorporating supplementary gene information. Equally, PANDITplus provides an excellent resource for the development, testing, and comparison of statistical models of substitution and prob- abilistic dependencies between a molecular sequence and its various attributes. The data- base may be accessed via http://www.pandit- plus.org. Introduction With advances in experimental techniques, recent years observed a rapid growth not only in molecular sequence data but also in com- plementary gene and protein information such as gene expression, numbers of protein- protein interactions, three-dimensional structure, etc. Gene annotation equally is gaining accuracy and speed. Large-scale availability of such multi-facet data led to a common trend to incorporate the supplemen- tary gene information with more convention- al analyses of molecular sequences in order to reach more insightful conclusions. However, the results of many such studies are not easy to compare owing to their use of different data sources, produced by different laborato- ries, not always available to the public. Data quality and biases in gene or species samples may influence the inference. Whenever possi- ble, it is desirable to test biological hypothe- ses using the same well-maintained and well- structured integrated database solution. Here we present a relational database PANDITplus that makes a step towards the integration of data from a variety of reliable and curated bioinformatics sources. Along with DNA and amino acid sequence data for homologs, PANDITplus provides access to precomputed estimates from evolutionary codon models, data on protein interactions, functional and chemical pathway annotation, gene expression, and association with dis- ease. The underlying database PANDIT 1,2 con- tains homologous amino acid and protein- coding sequence alignments from Pfam, 3,4 a comprehensive and accurate collection of pro- tein domains and families. The idea behind PANDIT database was to encourage the “evolution-centric” analyses of protein domains and families, based on reli- able sets of HMM-based alignments and asso- ciated phylogenetics trees. Both Pfam and PANDIT have been updated since their first publications and became popular for large- scale studies of protein-coding genes or as testing platforms. 5-11 Among some classic examples of using Pfam/PANDIT for evolu- tionary model development are studies pre- senting novel DNA, amino acid and codon substitution models, such as WAG, 12 LG, 13 ECM, 10 SDT, 14 and THMM. 15 Recently, accuracy of the multiple protein-coding alignment method (implemented in MAGNOLIA) was tested on PANDIT data. 16 Searching for data biases and universal trends also requires large well-maintained collections of data. Multiple alignments from PANDIT and func- tional classification from Gene Ontology (GO) have been used to study trends relating to positive selection, and to verify and extend the complexity hypothesis. 17 Bofkin and Goldman 18 used PANDIT to demonstrate that substitution patterns at three codon positions often are strikingly different, thus necessitat- ing suitable statistical treatment during the phylogenetic analyses. Several authors developing statistical methodology and software for bioinformatics recently have stressed the importance of maintaining the resources like PANDIT. 11,19-21 For example, PANDIT is included as a test database by the xREI tool for phylo-grammar visualization and development, 11 which is cur- rently a promising area in evolutionary methodology. Indeed, issues of model devel- opment, validation, and comparison may be better assessed based on a standardized data collection such as that jointly provided by PANDIT and PANDITplus. The inclusion of supplementary gene information allows for better classification, filtering, and pattern discovery, contributing to the development of better statistical models. This potentially leads to greater predictive power and a better understanding of underlying evolutionary processes. Besides automatic large-scale scans, together PANDIT and PANDITplus provide a unique integrated resource for empirical case studies of sets of selected proteins. While PANDIT already has been used to study select- ed genes, 22-24 a wealth of supplementary gene information is readily available now via PANDITplus, saving valuable time required for data mining (and, as a bonus, contribut- ing to the reduction of web traffic). Anyone Trends in Evolutionary Biology 2010; volume 2:e1 Correspondence: Maria Anisimova, CAB H 82.1, Institute of Computational Science, ETH Zurich, Universitaetsstrasse 6, 8092 Zurich, Switzerland. E-mail: [email protected] Key words: comparative genomics, evolution, phylogenetics, gene family, protein domain, func- tional annotation. Acknowledgements: SD was supported by the Swiss State Secretariat for Education and Research. MA was supported by the Swiss Federal Institute of Technology (ETH Zurich). We are grateful to Steven Armstrong at ETH Zurich for help with setting up the database. Received for publication: 11 July 2009. Revision received: 26 August 2009. Accepted for publication: 26 August 2009. This work is licensed under a Creative Commons Attribution 3.0 License (by-nc 3.0). ©Copyright S. Dimitrieva and M. Anisimova, 2010 Licensee PAGEPress, Italy Trends in Evolutionary Biology 2010; 2:e1 doi:10.4081/eb.2010.e1 Non-commercial use only

Upload: zhaw

Post on 22-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

[Trends in Evolutionary Biology 2010; 2:e1] [page 1]

PANDITplus: toward betterintegration of evolutionaryview on molecular sequenceswith supplementarybioinformatics resourcesSlavica Dimitrieva,1 Maria AnisimovaDepartment of Computer Science, SwissFederal Institute of Technology (ETHZurich) and Swiss Institute ofBioinformatics (SIB), Switzerland1current address: Swiss Institute forExperimental Cancer Research (ISREC)and Swiss Federal Institute of TechnologyLausanne (EPFL), Switzerland

Abstract

Recent comparative genomic and otherlarge-scale bioinformatics studies increasing-ly have been using gene annotations, func-tional classifications, and complementarydata from the emerging “-omics” disciplines.Indeed, such analyses have better chances touncover hidden patterns in complex multidi-mensional and heterogeneous biological sys-tems data. On the other hand, inferencesfrom such studies are extremely sensitive todata samples and quality, and are more diffi-cult to compare or replicate owing to differ-ences in supplementary data sources at timesnot publicly available. As a contributiontoward the unification and integration ofgood quality data from heterogeneous bioin-formatics resources, we present here an inte-grated data bank PANDITplus. It is built as anextension of PANDIT, the database of PFAMalignments and phylogenetic trees for knownprotein domains and families spanning line-ages from the three domains of life.PANDITplus is a relational database contain-ing information on functional categories,metabolic pathways, protein–protein interac-tions, disease associations, gene expression,three-dimensional structure, as well as esti-mates from evolutionary analyses of selectivepressures. User-friendly interface enablescustomized queries and fast data access. Werecommend PANDITplus as a common bioin-formatics platform for testing evolutionaryhypotheses, which go beyond the mere infer-ences from molecular data by incorporatingsupplementary gene information. Equally,PANDITplus provides an excellent resourcefor the development, testing, and comparisonof statistical models of substitution and prob-abilistic dependencies between a molecularsequence and its various attributes. The data-base may be accessed via http://www.pandit-plus.org.

Introduction

With advances in experimental techniques,recent years observed a rapid growth not onlyin molecular sequence data but also in com-plementary gene and protein informationsuch as gene expression, numbers of protein-protein interactions, three-dimensionalstructure, etc. Gene annotation equally isgaining accuracy and speed. Large-scaleavailability of such multi-facet data led to acommon trend to incorporate the supplemen-tary gene information with more convention-al analyses of molecular sequences in order toreach more insightful conclusions. However,the results of many such studies are not easyto compare owing to their use of differentdata sources, produced by different laborato-ries, not always available to the public. Dataquality and biases in gene or species samplesmay influence the inference. Whenever possi-ble, it is desirable to test biological hypothe-ses using the same well-maintained and well-structured integrated database solution.Here we present a relational database

PANDITplus that makes a step towards theintegration of data from a variety of reliableand curated bioinformatics sources. Alongwith DNA and amino acid sequence data forhomologs, PANDITplus provides access toprecomputed estimates from evolutionarycodon models, data on protein interactions,functional and chemical pathway annotation,gene expression, and association with dis-ease. The underlying database PANDIT1,2 con-tains homologous amino acid and protein-coding sequence alignments from Pfam,3,4 acomprehensive and accurate collection of pro-tein domains and families.The idea behind PANDIT database was to

encourage the “evolution-centric” analyses ofprotein domains and families, based on reli-able sets of HMM-based alignments and asso-ciated phylogenetics trees. Both Pfam andPANDIT have been updated since their firstpublications and became popular for large-scale studies of protein-coding genes or astesting platforms.5-11 Among some classicexamples of using Pfam/PANDIT for evolu-tionary model development are studies pre-senting novel DNA, amino acid and codonsubstitution models, such as WAG,12 LG,13

ECM,10 SDT,14 and THMM.15 Recently, accuracyof the multiple protein-coding alignmentmethod (implemented in MAGNOLIA) wastested on PANDIT data.16 Searching for databiases and universal trends also requireslarge well-maintained collections of data.Multiple alignments from PANDIT and func-tional classification from Gene Ontology (GO)have been used to study trends relating topositive selection, and to verify and extendthe complexity hypothesis.17 Bofkin and

Goldman18 used PANDIT to demonstrate thatsubstitution patterns at three codon positionsoften are strikingly different, thus necessitat-ing suitable statistical treatment during thephylogenetic analyses.Several authors developing statistical

methodology and software for bioinformaticsrecently have stressed the importance ofmaintaining the resources like PANDIT.11,19-21

For example, PANDIT is included as a testdatabase by the xREI tool for phylo-grammarvisualization and development,11 which is cur-rently a promising area in evolutionarymethodology. Indeed, issues of model devel-opment, validation, and comparison may bebetter assessed based on a standardized datacollection such as that jointly provided byPANDIT and PANDITplus. The inclusion ofsupplementary gene information allows forbetter classification, filtering, and patterndiscovery, contributing to the development ofbetter statistical models. This potentiallyleads to greater predictive power and a betterunderstanding of underlying evolutionaryprocesses.Besides automatic large-scale scans,

together PANDIT and PANDITplus provide aunique integrated resource for empirical casestudies of sets of selected proteins. WhilePANDIT already has been used to study select-ed genes,22-24 a wealth of supplementary geneinformation is readily available now viaPANDITplus, saving valuable time requiredfor data mining (and, as a bonus, contribut-ing to the reduction of web traffic). Anyone

Trends in Evolutionary Biology 2010; volume 2:e1

Correspondence: Maria Anisimova, CAB H 82.1,Institute of Computational Science, ETH Zurich,Universitaetsstrasse 6, 8092 Zurich, Switzerland.E-mail: [email protected]

Key words: comparative genomics, evolution,phylogenetics, gene family, protein domain, func-tional annotation.

Acknowledgements: SD was supported by theSwiss State Secretariat for Education andResearch. MA was supported by the Swiss FederalInstitute of Technology (ETH Zurich). We aregrateful to Steven Armstrong at ETH Zurich forhelp with setting up the database.

Received for publication: 11 July 2009.Revision received: 26 August 2009.Accepted for publication: 26 August 2009.

This work is licensed under a Creative CommonsAttribution 3.0 License (by-nc 3.0).

©Copyright S. Dimitrieva and M. Anisimova, 2010Licensee PAGEPress, ItalyTrends in Evolutionary Biology 2010; 2:e1doi:10.4081/eb.2010.e1

Non-co

mmercial

use o

nly

[page 2] [Trends in Evolutionary Biology 2010; 2:e1]

looking for real data examples of protein-cod-ing genes will benefit equally from usingPANDITplus. Using PANDITplus, subsets ofgenes or proteins may be selected accordingto defined criteria and then examined sepa-rately for a particular trend.

Structure of PANDITplus

PANDITplus is a MySQL database builtupon the PANDIT version 17.0,2 Protein andAssociated Nucleotide Domains with InferredTrees derived from the seed alignments ofPfam-A. Each PANDIT entry includes aminoacid and codon based alignments (with relia-bility estimates for each column) and esti-mates of phylogenetic trees inferred fromthese alignments. The overview of biologicaldata resources collected for each PANDITalignment is presented in Figure 1.

Pfam at the coreAnnotations for each PANDIT entry were

taken from the Pfam database version 24.0.4

Pfam is a large collection of protein families,each represented by multiple sequence align-ments and profile hidden Markov models. Weused Pfam-A families that each consist of acurated seed alignment containing a small setof representative members of the family, pro-file hidden Markov models (profile HMMs)built from the seed alignment, and an auto-matically generated full alignment, whichcontains all detectable protein sequencesfrom the family as defined by profile HMMsearches of primary sequence databases. Thesequences in the Pfam-A entries are takenfrom the most recent releases of UniProtKB25

and NCBI GenPept.26 PANDITplus includesgeneral information from the Pfam databaseabout protein families and domains with cor-responding literature references, and group-ings of related families and domains intoclans. Information on family interactions andstructural complexes that Pfam members canform as well as links to the correspondingPDB structures are taken from the Pfam data-base. More detailed presentation of the 3Dstructural information will be made in futureupdates.

Precomputed estimates from Markovcodon substitution modelsPositive or negative selection studies

based on codon substitution models are pow-erful means of studying the evolution ofgenes and gene families (for review seeAnisimova and Kosiol 2009).27 Thus, webelieve that maximum likelihood estimatesfrom several standard codon models may pro-vide illuminating details about the mode ofevolution of a protein family or domain.Positive selection pressure is measured bythe ratio ω of nonsynonymous to synony-mous substitution rates dN/dS. PANDITplusincludes the estimates and log-likelihoodscores from models M0, M1, M2, M7, and M8,28

implemented in PAML v.4.129 as well as fromcodon models with constant and variable syn-onymous rates,30 implemented in HYPHY.31

Family entries in PANDITplus are classifiedas positively selected or conserved based onmaximum likelihood (ML) estimations andusing likelihood ratio tests (LRTs) comparingcouples of nested models, one of which (thenull hypothesis) does not allow for positiveselection (so ω≤1), whereas another onedoes (the alternative hypothesis). Positive

selection is detected if a model allowing sitesor lineages under positive selection fits datasignificantly better than the model restrictingselective pressure to ω≤1 at all sites and lin-eages. Positions identified to be under posi-tive selection with different models are storedin the database also. All optimizations weredone assuming PANDIT trees and branchlengths were fixed to their ML estimatesunder the constant selective pressure modelM0. Such practice commonly is used to reducecomputational times, since M0 typically pro-vides robust estimates of branch lengths. Tominimize the effect of inaccuracies in auto-matic alignments, the analyses of selectivepressure were performed only for sites wherealignment was deemed reliable based onHMM analyses.2

The availability of precomputed resultsaims to increase the transparency and repro-ducibility of statistical analyses, and the con-tinuity of biological studies. Note that someconcerns were expressed in the literatureabout the suitability of codon models for evo-lutionary estimation on alignments of distantsequences,32 as for deep divergences synony-mous substitutions may become saturated.We therefore recommend users to discountthe entries where divergence is too large; forexample, average branch length >2 expectedamino acid substitutions per branch.However, such high divergences are encoun-tered rarely in PANDITplus: only eight entriesfall beyond the above-mentioned threshold.Equally, we recommend a cautious use of ωestimates when either dN or dS are close tozero (see also PAML manual). Note thatAnisimova et al.33 found that likelihood ratiotest for positive selection remained accuratefor both saturated and very similar data,although the power of the test decreased.Moreover, Seo and Kishino34 investigated theeffect of applying a codon model to divergentdata sets. Surprisingly, they found that thecodon model never performed worse than theamino acid model, despite large saturation ofsynonymous substitutions in their data.

Gene annotationPANDITplus includes gene annotation from

GO and from Kyoto Encyclopedia of Genes andGenomes (KEGG). GO is a database of stan-dardized biological terms for gene products.35

GO contains more than 26,000 terms, dividedin three branches: Molecular Function,Biological Process, and Cellular Component.Each branch can be represented as a directedacyclic graph relating terms of differentdegrees of specificity, with directed linksfrom less specific to more specific terms.Each node in the graph can have several par-ents (broader related terms) and children(more specific related terms). GO annota-tions for each PANDIT member were taken

Slavica Dimitrieva and Maria Anisimova

Figure 1. The mapping schema describing the integration of biological resources inPANDITplus. Arrows represent the direction of the mapping between different sources.For example, an arrow leading from KEGG to Human Proteinpedia means that the map-ping uses gene ID from KEGG to find associated gene information in HumanProteinpedia. For more detail on the design see the ER diagram in the Online Suppl -ement ary figure.

Non-co

mmercial

use o

nly

[Trends in Evolutionary Biology 2010; 2:e1] [page 3]

from the Pfam database.KEGG is a knowledge base for systematic

analysis of gene functions, linking genomicinformation with higher order functionalinformation.36,37 PANDITplus integrates infor-mation from the KEGG PATHWAY database,which contains manually drawn pathwaymaps representing the knowledge on themolecular interaction and reaction networks.KEGG pathways are structured as a directedacyclic graph hierarchy of three flat levels.The top level consists of the following five cat-egories: Metabolism, Genetic InformationProcessing, Environmental InformationProcessing, Cellul ar Processes, and HumanDiseases. The next levels divide the five func-tional categories into finer subcategories.The KEGG database provides direct mappingfrom KEGG gene IDs to Pfam entries as wellas to many other biological data sources. OneKEGG gene could be mapped to many Pfamentries. An example of such mapping is illus-trated in Figure 2. Each PANDITplus familyrecord contains pathway information of allthe genes that are associated with the family.

Expression dataGene expression data provide further

insight into the dynamics of protein functionacross different species or over a series of tis-sues and conditions (for examples, seeAkashi 2001; Khaitovich, et al. 2006; Gilad, etal. 2006; Yang, et al. 2009).38-41 Information onexpression patterns of all the genes that codefor one protein or protein domain could shedlight on the function of the domain in specif-ic tissues. For example, one can evaluate thefunctional importance of the domain of inter-est in a specific tissue by observing the num-ber of coding genes with this domain, whichare expressed in that tissue. Furthermore,when species-specific expression data areavailable, they can be useful for studies ofevolutionary patterns in genes and gene fam-ilies. Together with the evolutionary modelestimation, contrasting gene expression overseveral species or among tissues also maygive insights into functional variation andevolutionary forces acting upon a gene. PANDITplus contains expression data from

two recent resources, Bgee database andHuman Proteinpedia, together with an

overview of the number of genes expressed invarious tissues. Gene expression data for fivespecies (human, zebrafish, drosophila,mouse, and xenopus) were extracted fromBgee release 6.0.42 Bgee focuses on the devel-opmental aspect of gene expression and con-tains EST data from UniGene, Affymetrix datafrom ArrayExpress, and in situ hybridizationdata from ZFIN and MGI. Each datum entry isannotated manually by Bgee curators andmapped onto anatomical and developmentalontologies. The mapping to the Bgee databasewas done via UniProt, such that individualsequences from the seed alignment fromPANDIT were mapped to UniProt IDs, and theUniProt IDs were mapped to the IDs used inBgee, with a full path PANDIT–>UniProt–>Bgee (Figure 1). Each PANDITplus familyrecord presents tissue expression data fromBgee of the genes that code for sequencesfrom the PANDIT seed alignment, which alsowas used to obtain the ML estimates fromcodon models (see above). Therefore, Bgeeexpression data are displayed only for thespecies that contribute a sequence in the cor-responding PANDIT alignment. These expres-sion data may be used in combination withML estimates from phylogenetics analyses.Because one Pfam domain can be present inseveral genes, each with their own expres-sion patterns, PANDITplus counts the numberof these genes expressed in each tissue, andsorts the tissues by the number of associatedgenes expressed in it. Note that interpreta-tion of inferences based on links of gene-wiseinformation to protein domains may not bestraightforward. However, we believe that cer-tain observations from such links may be veryuseful (e.g. prevalence of a domain in genesexpressed in a particular tissue). Gene expression data for healthy and dis-

ease human tissues were extracted fromHuman Proteinpedia, a community portal forsharing and integration of human proteindata.43,44 Human Proteinpedia allows researchlaboratories to contribute and maintain pro-tein annotations. All human data in HumanProteinpedia are derived experimentally andcome from different laboratories. The map-ping to Human Proteinpedia gene IDs wasdone for the sequences from the full Pfamalignment via the KEGG database with the

whole mapping path being PANDITÆP-FAMÆKEGGÆ Human Proteinpedia (Figure1). Each PANDITplus family record presentstissue expression data from HumanProteinpedia of all the genes that are associ-ated with the full alignment of the Pfam fam-ily. For each tissue the number of genesexpressed in that tissue is displayed also. Theexpression data in PANDITplus derived fromHuman Proteinpedia is only for human tis-sues. Since the mapping of the expressiondata from this resource is based on thesequences from the full alignment, such dataare shown when the full alignment containsat least one human sequence, even if the seedalignment does not contain any. Owing to thefact that our Markov codon model estimatesare based on the PANDIT seed alignment, werecommend that Human Proteinpedia expres-sion data are used only when a humansequence is contained in the PANDIT seedalignment. The families that do not containany human sequence in the PANDIT seedalignment are marked in PANDITplus.

Associations with diseasePANDITplus also incorporates information

on associations of human genes with geneticdiseases and disorders extracted from theGenetic Association Database.45 The data inthis database come from published scientificpapers and this database serves as an archiveof human genetic association studies of com-plex diseases and disorders. Note that infor-mation on associations with human diseasesis available also via Online MendelianInheritance in Man (OMIM).46 Links to OMIMrecords were taken from the KEGG database.Interpreting the disease information shouldbe taken with caution, since one Pfamdomain can be present in several genes, eachlinked to different diseases and disorders.Owing to this fact PANDITplus counts thenumber of these genes per disease, and sortsthe disease records by the number of domaincoding genes linked with them. Furthermore,PANDITplus includes literature referencesproviding information for each gene linked toa certain disease record.

User-interface and website

PANDITplus is developed with MySQL.Precomputed estimates from Markov substi-tution models together with other relevantdata from eight different databases are organ-ized into a compact relational database struc-ture (See Online Supplementary material forER diagram). PANDITplus currently includesa total of 7738 families, corresponding to therecords in the current version of PANDIT.

Article

Figure 2. One-to-many mapping between a KEGG gene hsa: 100101467 and correspon-ding Pfam and PANDIT entries.

Non-co

mmercial

use o

nly

[page 4] [Trends in Evolutionary Biology 2010; 2:e1]

Table 1 shows statistics about the number ofrecords in the current version of PANDITplus.The web interface of PANDITplus was

developed with PHP and JavaScript. It pro-vides free access to the database to all aca-demic and commercial organizations. Thewebsite of PANDITplus proposes several waysto retrieve data easily. Data may be queriedfor families, clans, genes, gene ontologies,pathways, tissue of expression, or diseaseassociation, based on their names, identi-fiers, or descriptions. Quick searches forPfam entries or clans may be done based ontheir accession identifiers and names,according to the nomenclature in PANDIT andPfam. Quick search for genes may be donebased on KEGG or NCBI identifiers. Proteinfamily and clan records may be browsed basedon their name. Furthermore, the web inter-face offers a possibility for browsing only thepositively selected or conserved familiesunder the corresponding Markov codonmodel, or browsing the twenty largest fami-lies in terms of number of sequences.PANDITplus is accessible freely fromhttp://www.panditplus.org/.

The potential of PANDITplusand the outlook

The database PANDITplus is a powerfulresource for evolutionary studies of proteindomains and gene families. Clearly, the PAN-DIT database, once described as “apt” and“remarkably creative”,47 remains a valuableresource for evolutionary studies.PANDITplus extends its potential by offeringfurther gene and protein product informationfor integrated bioinformatics and molecularevolution studies. Indeed, such data becomeessential for the deeper exploration of therelationship between the molecular sequenceand its attributes, such as sequence–func-tion–structure, genotype–phenotype, andevolutionary rate vs. expression studies. PANDITplus has proved a useful resource

for our studies of evolutionary patterns ingenes and gene families, such as the investi-gating prevalence of positive selection, thesynonymous rate variation (Dimitrieva andAnisimova, in preparation), and evolutionarypatterns in disordered protein regions(Szalkowski and Anisimova, in preparation).Precomputed results from these studies willbe available also from PANDITplus on theirpublication. Expert biologists studying singlegenes and proteins or subclasses of proteinsmay find PANDITplus very helpful. Extractingsubsets of protein families according todefined criteria is easy with PANDITplusqueries. Precomputed estimates of rates fromMarkov substitution models may help to for-

mulate some important selection criteria, andwill be equally useful for cross comparison ofresults from different methods. While our database is a useful bioinformat-

ics resource for discovery of universal pat-terns and trends in protein domains and genefamilies, it is also an excellent test-base forthe development of statistical models andmethods. Soon evolutionary models will beprogressing toward accommodating knowl-edge from the emerging “-omics” disciplines(e.g. transcriptomics, metabolomics, pro-teomics). Probabilistic machine learningapproaches may be used to test various ideasabout complex networks of interacting pro-teins, or to discover complex patterns.PANDITplus provides an excellent ground forvalidating such new methods. We see furtherdevelopment of PANDITplus to be synchro-nized with updates of PANDIT. We are plan-ning to continue extending PANDITplus withnew estimates from statistical inferences(e.g. ML trees, ancestral state reconstruction,etc.) and new supplementary data from vari-ous sources deemed robust. In the future weaim to dedicate time and effort to the designof the automatic upgrading module, so thatboth PANDIT and PANDITplus keep up with

updates in Pfam and other supplementarydatabases. We actively encourage user feed-back and contributions so that we can expandthe range and the quality of the informationcurrently presented in PANDITplus.

Other resources that attempt to unifydata from difference sources Many recognized the need for the integra-

tion of biological resources. In fact, alreadysome of the key database players have madesteps to widen the spectrum of informationcovered by the repository (among these areNCBI, Pfam, UniProt, Ensembl, KEGG).However, these resources still are too special-ized to contain sufficient information neces-sary to unify diverse molecular informationwith evolutionary modeling at the sequencelevel. Other smaller databases also attemptedto integrate biological data, yet these tend torelate to particular questions or specific(smaller) samples of molecular sequences.For example, the MAGNUM database wasdesigned to aid studies focusing on the evolu-tion of protein structure:48 it includes align-ments mapped to crystal structure, estimatesof phylogenetic trees and ancestral sequencesat internal tree nodes. This information is

Slavica Dimitrieva and Maria Anisimova

Table 1. PANDITplus database statistics.

PANDIT alignments 7738Domains 1800Protein families 5637Motifs 56Repeats 148

Aligned sequences 181448Families with estimates from Markov codon models

PAML (M0, M1, M2, M7, M8) 7712HYPHY (Dual and Non-synonymous) 7348

Interactions 5959PDB structures 156803Gene Ontology (GO)

GO associations 10277Families with at least one GO association 4286

KEGGKEGG pathways associations 124748Families with at least one KEGG pathway association 4621Number of associated human genes from KEGG 18041KEGG pathways 310

Human Proteinpedia expression records (family – tissue)healthy tissues 101626disease tissues 22611

Bgee expression records (family – tissue)Homo sapiens 477544Mus musculus 298494Drosophila melanogaster 3007Danio rerio 2178Xenopus tropicalis 1281

Disease information (family – disease)Genetic Association Database 11490Links to OMIM 31769

Non-co

mmercial

use o

nly

[Trends in Evolutionary Biology 2010; 2:e1] [page 5]

collected only for protein families with atleast one crystal structure (around 1800).Another example is the SOURCE databasethat contains information on human, mouse,and rat genes, including gene ontology, func-tional annotations, and gene expressiondata.49

We anticipate that the number of databasesolutions offering the integration ofresources and tools will grow rapidly in com-ing years. This trend is a healthy and impor-tant process for current bioinformatics and“–omics” fields, and will eventually lead tothe creation of the best integrated solutionsfacilitating efficient data mining, data classi-fication, and normalization.

Availability

The database is hosted on theComputational Biochemistry Research Groupserver and is accessible for browsing anddownloads via http://www.panditplus.org/.

References

1. Whelan S, de Bakker PI, Goldman N.Pandit: a database of protein and associ-ated nucleotide domains with inferredtrees. Bioinformatics 2003;19:1556-63.

2. Whelan S, de Bakker PI, Quevillon E, et al.PANDIT: an evolution-centric database ofprotein and associated nucleotidedomains with inferred trees. NucleicAcids Res 2006;34:D327-31.

3. Sonnhammer EL, Eddy SR, Durbin R.Pfam: a comprehensive database of pro-tein domain families based on seed align-ments. Proteins 1997;28:405-20.

4. Finn RD, Tate J, Mistry J, et al. The Pfamprotein families database. Nucleic AcidsRes 2008;36:D281-8.

5. Brenner SE. A tour of structuralgenomics. Nat Rev Genet 2001;2:801-9.

6. Babu MM, Luscombe NM, Aravind L, et al.Structure and evolution of transcriptionalregulatory networks. Curr Opin StructBiol 2004;14:283-91.

7. Mosavi LK, Cammett TJ, Desrosiers DC,et al. The ankyrin repeat as moleculararchitecture for protein recognition.Protein Sci 2004;13:1435-48.

8. Bortolussi N, Durand E, Blum M, et al.apTreeshape: statistical analysis of phylo-genetic tree shape. Bioinformatics 2006;22:363-4.

9. Gopalan V, Qiu WG, Chen MZ, et al.Nexplorer: phylogeny-based explorationof sequence family data. Bioinformatics2006;22:120-1.

10. Kosiol C, Holmes I, Goldman N. An empir-ical codon model for protein sequenceevolution. Mol Biol Evol 2007;24:1464-79.

11. Barquist L, Holmes I. xREI: a phylo-gram-mar visualization webserver. NucleicAcids Res 2008;36:W65-9.

12. Whelan S, Goldman N. A general empiri-cal model of protein evolution derivedfrom multiple protein families using amaximum-likelihood approach. Mol BiolEvol 2001;18:691-9.

13. Le SQ, Gascuel O. An improved generalamino acid replacement matrix. Mol BiolEvol 2008;25:1307-20.

14. Whelan S, Goldman N. Estimating the fre-quency of events that cause multiple-nucleotide changes. Genetics 2004;167:2027-43.

15. Whelan S. Spatial and temporal hetero-geneity in nucleotide sequence evolution.Mol Biol Evol 2008;25:1683-94.

16. Fontaine A, de Monte A, Touzet H. MAG-NOLIA: multiple alignment of protein-coding and structural RNA sequences.Nucleic Acids Res 2008;36:W14-8.

17. Aris-Brosou S. Determinants of adaptiveevolution at the molecular level: theextended complexity hypothesis. Mol BiolEvol 2005;22:200-9.

18. Bofkin L, Goldman N. Variation in evolu-tionary processes at different codon posi-tions. Mol Biol Evol 2007;24:513-21.

19. Kidd DM, Ritchie MG. Phylogeographicinformation systems: putting the geogra-phy into phylogeography. J Biogeogr 2006;33:1851-65.

20. Hartmann K. Biodiversity conservationand evolutionary models. University ofCanterbury, Christchurch, New Zealand,2008.

21. Huelsenbeck JP, Joyce P, Lakner C, et al.Bayesian analysis of amino acid substitu-tion models. Philos Trans R Soc Lond BBiol Sci 2008;363:3941-53.

22. Massingham T, Goldman N. Detectingamino acid sites under positive selectionand purifying selection. Genetics 2005;169:1753-62.

23. Collins K, Gu H, Field C. Examining pro-tein structure and similarities by spectralanalysis technique. Stat Appl Genet MolBiol 2006;5:Article23.

24. Morrison DA. A framework for phyloge-netic sequence alignment. Plant Syst Evol2008. DOI 10.1007/s00606-008-0072-5.

25. The UniProt Consortium. The UniversalProtein Resource (UniProt). NucleicAcids Res 2008;36:D190-5.

26. Burks C, Cassidy M, Cinkosky MJ, et al.GenBank. Nucleic Acids Res 1991;19:S2221-5.

27. Anisimova M, Kosiol C. Investigating pro-tein-coding sequence evolution withprobabilistic codon substitution models.

Mol Biol Evol 2009;26:255-71.28. Yang Z, Nielsen R, Goldman N, et al.

Codon-substitution models for heteroge-neous selection pressure at amino acidsites. Genetics 2000;155:431-49.

29. Yang Z. PAML 4: phylogenetic analysis bymaximum likelihood. Mol Biol Evol 2007;24:1586-91.

30. Kosakovsky Pond SL, Muse SV. Site-to-site variation of synonymous substitutionrates. Mol Biol Evol 2005;22:2375-85.

31. Kosakovsky Pond SL, Frost SD, Muse SV.HyPhy: hypothesis testing using phyloge-nies. Bioinformatics 2005;21:676-9.

32. Maynard Smith J, Smith NH. Synonymousnucleotide divergence: what is satura-tion? Genetics 1996;142:1033-6.

33. Anisimova M, Bielawski JP, Yang Z.Accuracy and power of the likelihood ratiotest to detect adaptive molecular evolu-tion. Mol Biol Evol 2001;18:1585-92.

34. Seo T-K, Kishino H. Synonymous substi-tutions substantially improve evolution-ary inference from highly diverged pro-teins. Syst Biol 2008;57:367-77.

35. The Gene Ontology Consortium. Geneontology: tool for the unification of biolo-gy. Nat Genet 2000;25:25-9.

36. Kanehisa M, Goto S. KEGG: kyoto encyclo-pedia of genes and genomes. NucleicAcids Res 2000;28:27-30.

37. Kanehisa M, Goto S, Hattori M, et al.From genomics to chemical genomics:new developments in KEGG. NucleicAcids Res 2006;34:D354-7.

38. Akashi H. Gene expression and molecularevolution. Curr Opin Genet Dev 2001;11:660-6.

39. Khaitovich P, Enard W, Lachmann M, et al.Evolution of primate gene expression.Nat Rev Genet 2006;7:693-702.

40. Gilad Y, Oshlack A, Smyth GK, et al.Expression profiling in primates reveals arapid evolution of human transcriptionfactors. Nature 2006;440:242-5.

41. Yang D, Jiang Y, He F. An integrated viewof the correlations between genomic andphenomic variables. J Genet Genomics2009;36:645-51.

42. Bastian F, Parmentier G, Roux J, et al.Bgee: Integrating and Comparing Hetero -geneous Transcriptome Data AmongSpecies. Data Integration in the LifeSciences, 2008.

43. Mathivanan SM, Ahmed NG, Ahn H, et al.Human Proteinpedia enables sharing ofhuman protein data. Nat Biotechnol2008;26:164-7.

44. Kandasamy K, Keerthikumar S, Goel R, etal. Human Proteinpedia: a unified discov-ery resource for proteomics research.Nucleic Acids Res 2009;37:D773-81.

45. Becker KG, Barnes KC, Bright TJ, et al.The genetic association database. Nat

Article

Non-co

mmercial

use o

nly

[page 6] [Trends in Evolutionary Biology 2010; 2:e1]

Genet 2004;36:431-2.46. McKusick VA. Mendelian Inheritance in

Man and its online version, OMIM.Baltimore: Johns Hopkins UniversityPress, 1998.

47. Galperin MY. The Molecular BiologyDatabase Collection: 2005 update.Nucleic Acids Res 2005;33:D5-24.

48. Bradley ME, Benner SA. Phylogenomic

approaches to common problems encoun-tered in the analysis of low copy repeats:the sulfotransferase 1A gene familyexample. BMC Evol Biol 2005;5:22.

49. Diehn M, Sherlock G, Binkley G, et al.SOURCE: a unified genomic resource offunctional annotations, ontologies, andgene expression data. Nucleic Acids Res2003;31:219-23.

Article

Non-co

mmercial

use o

nly