img 4 version of the integrated microbial genomes...

8
IMG 4 version of the integrated microbial genomes comparative analysis system Victor M. Markowitz 1, *, I-Min A. Chen 1 , Krishna Palaniappan 1 , Ken Chu 1 , Ernest Szeto 1 , Manoj Pillay 1 , Anna Ratner 1 , Jinghua Huang 1 , Tanja Woyke 2 , Marcel Huntemann 2 , Iain Anderson 2 , Konstantinos Billis 2 , Neha Varghese 2 , Konstantinos Mavromatis 2 , Amrita Pati 2 , Natalia N. Ivanova 2 and Nikos C. Kyrpides 2, * 1 Biological Data Management and Technology Center, Computational Research Division Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, 94720 USA and 2 Department of Energy, Microbial Genome and Metagenome Program, Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, 94598 USA Received September 15, 2013; Accepted September 30, 2013 ABSTRACT The Integrated Microbial Genomes (IMG) data ware- house integrates genomes from all three domains of life, as well as plasmids, viruses and genome frag- ments. IMG provides tools for analyzing and review- ing the structural and functional annotations of genomes in a comparative context. IMG’s data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG’s annotation and data in- tegration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img. jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img. jgi.doe.gov/edu). DATA SOURCES AND PROCESSING The Integrated Microbial Genomes (IMG) system inte- grates genomes from all three domains of life, as well as viruses, plasmids and genome fragments (partial sequences of genomic regions of interest, such as biosyn- thetic clusters). Until 2012, IMG used NCBI’s RefSeq resource (1) as its main source of public genome sequence data and annotations consisting of predicted genes and protein products, with a RefSeq-specific pipeline used for retrieving new genomes from RefSeq’s ftp site. For non-public (i.e. ‘private’) datasets, the IMG ER Submission system allowed scientists to select their sequencing projects in GOLD (2) and then submit their genome sequence data for annotation and integration into the ‘Expert Review’ version of IMG, IMG/ER (http:// img.jgi.doe.gov/er). Public and private genomes were pro- cessed using different annotation and data integration pipelines, and recorded in different databases. In an effort to improve the efficiency of data processing and tracking, IMG’s genome submission, annotation and integration pipelines were consolidated in November 2012. The IMG ER Submission system (http://img.jgi.doe.gov/ submit) and associated (submission, gene prediction, func- tional annotation and data integration) data processing pipelines were extended to handle both public and private genomes in a uniform manner. The pipelines use a common mechanism for tracking the processing status of genome datasets, GOLD provides the information needed for retrieving new public genomes from RefSeq or GenBank (3) and both public and private genomes are recorded in a common IMG data warehouse. For every genome, the IMG data warehouse records primary genome sequence information including its organization into chromosomal replicons (for finished genomes) and scaffolds and/or contigs (for draft genomes), together with predicted protein-coding sequences, some RNA-coding genes and protein product names that are provided by the genome sequence centers or generated by IMG’s functional annotation pipeline. Public and private genomes submitted for annotation and integration by IMG’s pipelines are first associated with sequencing projects in GOLD. Custom tools and metadata about the topology of contigs and scaffolds are used to identify the origin of replication of circular replicons and permute the corresponding scaffold or contig if necessary. To ensure accurate identification of partial genes bordering the gaps, gene models and other *To whom correspondence should be addressed. Tel: +1 925 296 5718; Fax: +1 925 296 5666; Email: [email protected] Correspondence may also be addressed to Victor M. Markowitz. Tel: +1 510 486 7073; Fax: +1 510 486 5812; Email: [email protected] D560–D567 Nucleic Acids Research, 2014, Vol. 42, Database issue Published online 27 October 2013 doi:10.1093/nar/gkt963 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] by guest on June 5, 2015 http://nar.oxfordjournals.org/ Downloaded from

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

IMG 4 version of the integrated microbial genomescomparative analysis systemVictor M Markowitz1 I-Min A Chen1 Krishna Palaniappan1 Ken Chu1 Ernest Szeto1

Manoj Pillay1 Anna Ratner1 Jinghua Huang1 Tanja Woyke2 Marcel Huntemann2

Iain Anderson2 Konstantinos Billis2 Neha Varghese2 Konstantinos Mavromatis2

Amrita Pati2 Natalia N Ivanova2 and Nikos C Kyrpides2

1Biological Data Management and Technology Center Computational Research Division Lawrence BerkeleyNational Laboratory 1 Cyclotron Road Berkeley 94720 USA and 2Department of Energy Microbial Genomeand Metagenome Program Joint Genome Institute 2800 Mitchell Drive Walnut Creek 94598 USA

Received September 15 2013 Accepted September 30 2013

ABSTRACT

The Integrated Microbial Genomes (IMG) data ware-house integrates genomes from all three domains oflife as well as plasmids viruses and genome frag-ments IMG provides tools for analyzing and review-ing the structural and functional annotations ofgenomes in a comparative context IMGrsquos datacontent and analytical capabilities have increasedcontinuously since its first version released in2005 Since the last report published in the 2012NAR Database Issue IMGrsquos annotation and data in-tegration pipelines have evolved while new toolshave been added for recording and analyzingsingle cell genomes RNA Seq and biosyntheticcluster data Different IMG datamarts providesupport for the analysis of publicly availablegenomes (IMGW httpimgjgidoegovw) expertreview of genome annotations (IMGER httpimgjgidoegover) and teaching and training in the areaof microbial genome analysis (IMGEDU httpimgjgidoegovedu)

DATA SOURCES AND PROCESSING

The Integrated Microbial Genomes (IMG) system inte-grates genomes from all three domains of life as well asviruses plasmids and genome fragments (partialsequences of genomic regions of interest such as biosyn-thetic clusters) Until 2012 IMG used NCBIrsquos RefSeqresource (1) as its main source of public genomesequence data and annotations consisting of predictedgenes and protein products with a RefSeq-specificpipeline used for retrieving new genomes from RefSeqrsquosftp site For non-public (ie lsquoprivatersquo) datasets the IMG

ER Submission system allowed scientists to select theirsequencing projects in GOLD (2) and then submit theirgenome sequence data for annotation and integration intothe lsquoExpert Reviewrsquo version of IMG IMGER (httpimgjgidoegover) Public and private genomes were pro-cessed using different annotation and data integrationpipelines and recorded in different databases

In an effort to improve the efficiency of data processingand tracking IMGrsquos genome submission annotation andintegration pipelines were consolidated in November 2012The IMG ER Submission system (httpimgjgidoegovsubmit) and associated (submission gene prediction func-tional annotation and data integration) data processingpipelines were extended to handle both public andprivate genomes in a uniform manner The pipelines usea common mechanism for tracking the processing statusof genome datasets GOLD provides the informationneeded for retrieving new public genomes from RefSeqor GenBank (3) and both public and private genomesare recorded in a common IMG data warehouse

For every genome the IMG data warehouse recordsprimary genome sequence information including itsorganization into chromosomal replicons (for finishedgenomes) and scaffolds andor contigs (for draftgenomes) together with predicted protein-codingsequences some RNA-coding genes and protein productnames that are provided by the genome sequence centersor generated by IMGrsquos functional annotation pipeline

Public and private genomes submitted for annotationand integration by IMGrsquos pipelines are first associatedwith sequencing projects in GOLD Custom tools andmetadata about the topology of contigs and scaffoldsare used to identify the origin of replication of circularreplicons and permute the corresponding scaffold orcontig if necessary To ensure accurate identification ofpartial genes bordering the gaps gene models and other

To whom correspondence should be addressed Tel +1 925 296 5718 Fax +1 925 296 5666 Email nckyrpideslblgovCorrespondence may also be addressed to Victor M Markowitz Tel +1 510 486 7073 Fax +1 510 486 5812 Email VMMarkowitzlblgov

D560ndashD567 Nucleic Acids Research 2014 Vol 42 Database issue Published online 27 October 2013doi101093nargkt963

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc30) which permits non-commercial re-use distribution and reproduction in any medium provided the original work is properly cited For commercialre-use please contact journalspermissionsoupcom

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

features are initially predicted on individual contigs andcombined thereafter to generate scaffold-level structuralannotation CRISPR elements are detected using CRT(4) and PILERCR (5) Predictions from both methodsare concatenated and in case of overlapping elementsthe shorter one is removed Identification of tRNAs isperformed using tRNAScan-SE-123 (6) RibosomalRNA genes (5S 16S and 23S) are predicted usinghmmsearch against the custom models generated foreach type of rRNA in bacteria and archaea (78) Withthe exception of tRNA and rRNA all models fromRfam (9) are used to search the genome sequenceSequences are first compared with a database containingall the non-coding RNA genes in the Rfam database usingBLAST then sequences that have hits to genes belongingto an Rfam model are searched using the programINFERNAL (10) Signal peptides are computed usingSignalP (11) whereas transmembrane helices arecomputed using TMHMM (12) Protein-coding genesare predicted using Prodigal (13) models overlappingwith CRISPRs and certain types of RNAs (eg rRNAs)are removed

After a new genome is processed protein-coding genesare compared with protein families and the proteome ofselected publicly available lsquocorersquo genomes with productnames assigned based on the results of these comparisonsFirst protein sequences are compared with COG (14)using RPS-BLAST Pfam-A (15) using HMMER 30b2executed inside Sangerrsquos pfam_scanpl wrapper scriptand TIGRfam (16) databases using HMMER 30 (8)and associated with KEGG Orthology (KO) terms (17)using USEARCH (18) Genomes in IMG are associatedwith KEGG pathways using the assignment of KO termsto protein-coding genes while their association withMetaCyc pathways (19) is based on correlating enzymeEC numbers in MetaCyc reactions with EC numbersassociated with protein-coding genes via KO termsGenes are further characterized using an IMG native col-lection of generic (protein cluster-independent) functionalroles called IMG terms that are defined by their associ-ation with generic (organism-independent) functionalhierarchies called IMG pathways (20) IMG terms andpathways are specified by domain experts at DOE-JGIas part of the process of annotating specific genomes ofinterest and are subsequently propagated to all thegenomes in IMG using a rule-based methodologyTransporter genes are linked to the TransportClassification Database (21) based on their assignmentto COG Pfam or TIGRfam domains or IMG termsthat correspond to transporter families

The integration of new genomes into IMG involvescomputing protein sequence similarities between theirgenes and genes of all other (new or existing) genomesin the system assigning IMG terms and protein productnames to the genes of the new genomes identifying fusionsand computing conserved gene cassettes (putativeoperons) For each gene IMG provides lists of related(eg homolog paralog and ortholog) genes that arebased on sequence similarities computed usingUSEARCH for protein-coding and RNA genes A fusedgene (fusion) is defined as a gene that is formed from the

composition (fusion) of two or more previously separategenes (22) Fusions are identified based on computingUSEARCH similarities between genes Only genes fromfinished genomes are considered as putative componentsto avoid false predictions from fragmented genes in draftgenomes Furthermore genes that frequently appear asfragmented in finished genomes such as lsquotransposasesrsquoand lsquointegrasesrsquo as well as lsquopseudogenesrsquo are excludedfrom fusion calculations Putative horizontally transferredgenes are identified from the sequence similarity data Thephylogenetic distribution of best hits against a set of ref-erence isolate genomes also provides additional informa-tion on possible horizontal gene transfers for isolatesA lsquochromosomal cassettersquo is defined as a stretch of geneswith intergenic distance 300 bp whereby the genes canbe on the same or different strands of the chromosomeChromosomal cassettes with a minimum size of two genescommon in at least two separate genomes are defined aslsquoconserved chromosomal cassettesrsquo The identification ofcommon genes across organisms is based on two geneclustering methods namely participation in COG andPfam clusters (23)Note that for public and private genomes that are

already associated with genes andor protein productnames the native gene andor product names arepreserved in IMG unless their replacement is explicitlyrequested at the time they are submitted for annotationand integration into IMG

DATA CONTENT

Genomics data

The content of IMG has grown steadily since the firstversion released in March 2005 with the current versionof IMG (as on 10 September 2013) containing 11 568 bac-terial archaeal and eukaryotic genomes an increase ofgt300 since August 2011 (24) IMG also includes 2848viral genomes 1198 plasmids that did not come from aspecific microbial genome sequencing project and 581genome fragments bringing its total content to 16 195genome datasets with gt42 million protein-coding genesThe number of single cell genomes included into IMG

has increased substantially there are 1341 single cellgenomes in the current version of IMG compared withonly 21 in August 2011 Approximately 240 single cellgenomes are part of the Microbial Dark Matter projectthat aims to expand the Genomic Encyclopedia ofBacteria and Archaea by targeting 100 single cell repre-sentatives of uncultured candidate phyla (25)IMG has 13 342 genome datasets that are publicly avail-

able to all users without restrictions via the IMGWdatamart (httpimgjgidoegovw) Genomes that havenot been yet published (also known as lsquoprivatersquo) arepassword-protected and available only to the scientistswho study (lsquoownrsquo) them through the IMGER (lsquoExpertReviewrsquo) datamart (httpimgjgidoegover) Privategenomes are usually publicly released 6 months after thedataset becomes available in IMGIMGER allows individual scientists or groups of scien-

tists to review and curate the functional annotation of

Nucleic Acids Research 2014 Vol 42 Database issue D561

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

microbial genomes in the context of IMGrsquos publicgenomes (26) Since August 2011 hundreds of privategenomes have been reviewed and curated using IMGER a relatively small fraction of the 9000 genomes thatwere processed by IMGrsquos data annotation and integrationpipelines as genome curation is a time-consuming processGenome curation is usually carried out for identifyingmissing genes or for correcting functional annotationseg as part of the process of curating IMG native termsand pathways

Omics data

Proteomics datasets have been gradually included intoIMG starting in 2009 Since August 2011 64 newprotein expression datasets (samples) that are part oftwo studies were included into IMG bringing the totalto 90 samples across five studies The organization andanalysis of proteomic data in IMG are discussed in (24)The first RNAseq (transcriptomic) datasets included

into IMG in 2011 are part of the Synechococcus PCCstudy consisting of 40 samples (BillisK BilliniMKyrpidesNC and MavromatisK submitted for

publication) As of August 2013 IMG contains 99samples across 10 RNASeq studies A typical RNASeqstudy involves the sequencing of cDNA from a genomeunder different experimental conditions with the effect ofeach experimental condition being captured by a sampleAs part of RNASeq sequencing analysis reads aremapped to the reference genome involved in the studyand the expressed genes in each sample are recordedwith their observed read counts mean median andstrand RNA reads are mapped to reference genomesusing Bowtie2 (27) The scope of mapping is determinedby the type of cDNA sample (sscDNAdscDNA) and thedirectionality of the libraries whereby reads may map to asingle strand or both strands of the reference sequenceExpression levels are normalized by computing RPKM(reads per kilobase per million) Quantile or Affine trans-formations and may need to be interpreted based on thetype of cDNA in the sample For genomes involved inRNASeq studies the experimentssamples are recordedin IMG together with experimental conditions andthe read counts are organized per expressed gene asillustrated in Figure 1

Figure 1 RNA-Seq data organization (i) lsquoOmicsrsquo datasets generated can be accessed from lsquoIMG Statisticsrsquo on IMGrsquos front page following theExperiments link available on the lsquoIMG Statisticsrsquo page (ii) An RNA-Seq study is associated with samples and the number of genes expressed acrossall samples (iii) Each sample is associated with the number of expressed genes the total number of reads and the average number of reads per gene(iv) An expressed gene is associated with a read count (total number of reads divided by the size of the gene) and normalized coverage (coverage fora gene in the experiment divided by the total number of reads in that experiment)

D562 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Biosynthetic clusters

IMG contains biosynthetic clusters of genes associatedwith pathways involved in the generation of secondary me-tabolites in isolate prokaryotic genomes Experimentallyvalidated biosynthetic clusters were identified by searchingNCBIrsquos nucleotide database for genome fragments(partially sequenced genomes) containing gene clustersassociated with secondary metabolitesnatural products(28) Additional biosynthetic clusters were predictedusing ClusterFinder (Fischbach submitted for publica-tion) Biosynthetic clusters in IMG are associated withIMG Metacyc and KEGG pathways as well as informa-tion available in GOLD on their natural products

Genomes associated with biosynthetic clusters can beexamined as illustrated in Figure 2 where these genomesare listed in descending order of the number of biosyn-thetic clusters present in them Alternatively IMG can beused to find genomes associated with natural productsassociated with genome fragments but not with biosyn-thetic clusters as illustrated in Figure 2(v) Naturalproducts are small metabolites found in nature and

although the biosynthetic clusters associated with thegeneration of natural products have been identifiedthere are still natural products whose production mechan-isms in prokaryotes remain unknown

ANALYSIS TOOLS

Browsers and search tools allow finding and selectinggenomes genes and functions of interest which can thenbe examined individually or analyzed in a comparativecontext Gene content-based comparison of genomes isprovided by the lsquoPhylogenetic Profilerrsquo and thelsquoPhylogenetic Profiler for Gene Cassettesrsquo tools thatallow identifying genes in a query genome in terms ofpresence or absence of homologs in other genomes orparticipation in conserved gene cassettes across othergenomes (2930) Function-based comparison ofgenomes is provided by the lsquoAbundance ProfileOverviewrsquo and lsquoFunction Profilersquo tools that allowcomparing the relative abundance of protein families(COGs Pfams TIGRfams) and functional families(enzymes) across genomes The composition of analysis

Figure 2 Biosynthetic clusters (i) Genomes associated with biosynthetic clusters can be retrieved and examined using the lsquoGenome Browserrsquo (ii) Thenumber of biosynthetic clusters is provided in the lsquoGenome Statisticsrsquo section of the lsquoOrganism Detailrsquo page of a genome together with a hyperlink to(iii) the list of biosynthetic clusters whereby for each cluster the number of associated genes the evidence type and the corresponding natural productare provided (iv) A biosynthetic cluster can be examined using the lsquoBiosynthetic Cluster Detailrsquo page which includes information about the cluster(v) lsquoNatural Product Listrsquo provides the list of the IMG genomes associated with natural products

Nucleic Acids Research 2014 Vol 42 Database issue D563

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

operations is facilitated by genome scaffold gene andfunction lsquocartsrsquo that handle lists of genomes scaffoldsgenes and functions respectively IMG analysis toolshave been discussed in (24) Tools for identifying and cor-recting annotation anomalies such as dubious proteinproduct names and for filling annotation gaps such asgenes that may have been missed by gene prediction toolsor genes without predicted functions are discussed in (26)IMG analysis tool extensions have addressed performance(30) data quality control such as single cell data decon-tamination (31) and new data types such as RNA-Seqand biosynthetic cluster dataRNA-Seq studies can be accessed from the

lsquoExperiments Statisticsrsquo section of the lsquoIMG Statisticsrsquopage or the lsquoOrganism Detailsrsquo pages of their associatedgenomes For example the lsquoExpression Studiesrsquo link in thelsquoOrganism Detailsrsquo page such as that shown in Figure 3(i)leads to the list of associated RNA-Seq samples as the listshown in Figure 3(ii)

RNA-Seq studies associated with a genome can becompared using pairwise or multiple sample analysistools as illustrated in Figure 4 After samples areselected for comparison (Figure 4(i)) pairs of samplescan be compared in terms of up- or downregulationof genes as illustrated in Figure 4(ii) with a thresh-old specified for the difference in gene expression Thedifference in expression is computed using thelogR= log2(queryreference) or the RelDiff=2(query-reference)(query+reference) metric The comparisoncan be first previewed using a histogram as illustrated inFigure 4(iii) which can help set the thresholds for thesearch of over-expressed or under-expressed genesbetween a pair of samples The result of the comparisoncan be examined at the level of individual up- anddownregulated genes which can be selected for inclusionin the lsquoGene Cartrsquo for further analysis Alternatively theresult of the comparison can be examined in terms of func-tions as illustrated in Figure 4(iv) with genes associated

Figure 3 RNA-Seq data exploration (i) The list of RNA-Seq studies associated with a genome can be accessed from its lsquoOrganism Detailsrsquo witheach study associated with (ii) a list of RNA-Seq experiments (samples) Individual samples can be selected for further analysis such as(iii) examining its expressed genes as a list or using the (iv) chromosome viewer A sample can be also examined in the context of (v) pathwaysthat have at least one enzyme associated with an expressed gene in the sample whereby for each pathway (vi) enzymes are displayed with colorsrepresenting the level of expression for the associated genes mousing over an enzyme shows the number of expressed genes associated with theenzyme

D564 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 2: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

features are initially predicted on individual contigs andcombined thereafter to generate scaffold-level structuralannotation CRISPR elements are detected using CRT(4) and PILERCR (5) Predictions from both methodsare concatenated and in case of overlapping elementsthe shorter one is removed Identification of tRNAs isperformed using tRNAScan-SE-123 (6) RibosomalRNA genes (5S 16S and 23S) are predicted usinghmmsearch against the custom models generated foreach type of rRNA in bacteria and archaea (78) Withthe exception of tRNA and rRNA all models fromRfam (9) are used to search the genome sequenceSequences are first compared with a database containingall the non-coding RNA genes in the Rfam database usingBLAST then sequences that have hits to genes belongingto an Rfam model are searched using the programINFERNAL (10) Signal peptides are computed usingSignalP (11) whereas transmembrane helices arecomputed using TMHMM (12) Protein-coding genesare predicted using Prodigal (13) models overlappingwith CRISPRs and certain types of RNAs (eg rRNAs)are removed

After a new genome is processed protein-coding genesare compared with protein families and the proteome ofselected publicly available lsquocorersquo genomes with productnames assigned based on the results of these comparisonsFirst protein sequences are compared with COG (14)using RPS-BLAST Pfam-A (15) using HMMER 30b2executed inside Sangerrsquos pfam_scanpl wrapper scriptand TIGRfam (16) databases using HMMER 30 (8)and associated with KEGG Orthology (KO) terms (17)using USEARCH (18) Genomes in IMG are associatedwith KEGG pathways using the assignment of KO termsto protein-coding genes while their association withMetaCyc pathways (19) is based on correlating enzymeEC numbers in MetaCyc reactions with EC numbersassociated with protein-coding genes via KO termsGenes are further characterized using an IMG native col-lection of generic (protein cluster-independent) functionalroles called IMG terms that are defined by their associ-ation with generic (organism-independent) functionalhierarchies called IMG pathways (20) IMG terms andpathways are specified by domain experts at DOE-JGIas part of the process of annotating specific genomes ofinterest and are subsequently propagated to all thegenomes in IMG using a rule-based methodologyTransporter genes are linked to the TransportClassification Database (21) based on their assignmentto COG Pfam or TIGRfam domains or IMG termsthat correspond to transporter families

The integration of new genomes into IMG involvescomputing protein sequence similarities between theirgenes and genes of all other (new or existing) genomesin the system assigning IMG terms and protein productnames to the genes of the new genomes identifying fusionsand computing conserved gene cassettes (putativeoperons) For each gene IMG provides lists of related(eg homolog paralog and ortholog) genes that arebased on sequence similarities computed usingUSEARCH for protein-coding and RNA genes A fusedgene (fusion) is defined as a gene that is formed from the

composition (fusion) of two or more previously separategenes (22) Fusions are identified based on computingUSEARCH similarities between genes Only genes fromfinished genomes are considered as putative componentsto avoid false predictions from fragmented genes in draftgenomes Furthermore genes that frequently appear asfragmented in finished genomes such as lsquotransposasesrsquoand lsquointegrasesrsquo as well as lsquopseudogenesrsquo are excludedfrom fusion calculations Putative horizontally transferredgenes are identified from the sequence similarity data Thephylogenetic distribution of best hits against a set of ref-erence isolate genomes also provides additional informa-tion on possible horizontal gene transfers for isolatesA lsquochromosomal cassettersquo is defined as a stretch of geneswith intergenic distance 300 bp whereby the genes canbe on the same or different strands of the chromosomeChromosomal cassettes with a minimum size of two genescommon in at least two separate genomes are defined aslsquoconserved chromosomal cassettesrsquo The identification ofcommon genes across organisms is based on two geneclustering methods namely participation in COG andPfam clusters (23)Note that for public and private genomes that are

already associated with genes andor protein productnames the native gene andor product names arepreserved in IMG unless their replacement is explicitlyrequested at the time they are submitted for annotationand integration into IMG

DATA CONTENT

Genomics data

The content of IMG has grown steadily since the firstversion released in March 2005 with the current versionof IMG (as on 10 September 2013) containing 11 568 bac-terial archaeal and eukaryotic genomes an increase ofgt300 since August 2011 (24) IMG also includes 2848viral genomes 1198 plasmids that did not come from aspecific microbial genome sequencing project and 581genome fragments bringing its total content to 16 195genome datasets with gt42 million protein-coding genesThe number of single cell genomes included into IMG

has increased substantially there are 1341 single cellgenomes in the current version of IMG compared withonly 21 in August 2011 Approximately 240 single cellgenomes are part of the Microbial Dark Matter projectthat aims to expand the Genomic Encyclopedia ofBacteria and Archaea by targeting 100 single cell repre-sentatives of uncultured candidate phyla (25)IMG has 13 342 genome datasets that are publicly avail-

able to all users without restrictions via the IMGWdatamart (httpimgjgidoegovw) Genomes that havenot been yet published (also known as lsquoprivatersquo) arepassword-protected and available only to the scientistswho study (lsquoownrsquo) them through the IMGER (lsquoExpertReviewrsquo) datamart (httpimgjgidoegover) Privategenomes are usually publicly released 6 months after thedataset becomes available in IMGIMGER allows individual scientists or groups of scien-

tists to review and curate the functional annotation of

Nucleic Acids Research 2014 Vol 42 Database issue D561

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

microbial genomes in the context of IMGrsquos publicgenomes (26) Since August 2011 hundreds of privategenomes have been reviewed and curated using IMGER a relatively small fraction of the 9000 genomes thatwere processed by IMGrsquos data annotation and integrationpipelines as genome curation is a time-consuming processGenome curation is usually carried out for identifyingmissing genes or for correcting functional annotationseg as part of the process of curating IMG native termsand pathways

Omics data

Proteomics datasets have been gradually included intoIMG starting in 2009 Since August 2011 64 newprotein expression datasets (samples) that are part oftwo studies were included into IMG bringing the totalto 90 samples across five studies The organization andanalysis of proteomic data in IMG are discussed in (24)The first RNAseq (transcriptomic) datasets included

into IMG in 2011 are part of the Synechococcus PCCstudy consisting of 40 samples (BillisK BilliniMKyrpidesNC and MavromatisK submitted for

publication) As of August 2013 IMG contains 99samples across 10 RNASeq studies A typical RNASeqstudy involves the sequencing of cDNA from a genomeunder different experimental conditions with the effect ofeach experimental condition being captured by a sampleAs part of RNASeq sequencing analysis reads aremapped to the reference genome involved in the studyand the expressed genes in each sample are recordedwith their observed read counts mean median andstrand RNA reads are mapped to reference genomesusing Bowtie2 (27) The scope of mapping is determinedby the type of cDNA sample (sscDNAdscDNA) and thedirectionality of the libraries whereby reads may map to asingle strand or both strands of the reference sequenceExpression levels are normalized by computing RPKM(reads per kilobase per million) Quantile or Affine trans-formations and may need to be interpreted based on thetype of cDNA in the sample For genomes involved inRNASeq studies the experimentssamples are recordedin IMG together with experimental conditions andthe read counts are organized per expressed gene asillustrated in Figure 1

Figure 1 RNA-Seq data organization (i) lsquoOmicsrsquo datasets generated can be accessed from lsquoIMG Statisticsrsquo on IMGrsquos front page following theExperiments link available on the lsquoIMG Statisticsrsquo page (ii) An RNA-Seq study is associated with samples and the number of genes expressed acrossall samples (iii) Each sample is associated with the number of expressed genes the total number of reads and the average number of reads per gene(iv) An expressed gene is associated with a read count (total number of reads divided by the size of the gene) and normalized coverage (coverage fora gene in the experiment divided by the total number of reads in that experiment)

D562 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Biosynthetic clusters

IMG contains biosynthetic clusters of genes associatedwith pathways involved in the generation of secondary me-tabolites in isolate prokaryotic genomes Experimentallyvalidated biosynthetic clusters were identified by searchingNCBIrsquos nucleotide database for genome fragments(partially sequenced genomes) containing gene clustersassociated with secondary metabolitesnatural products(28) Additional biosynthetic clusters were predictedusing ClusterFinder (Fischbach submitted for publica-tion) Biosynthetic clusters in IMG are associated withIMG Metacyc and KEGG pathways as well as informa-tion available in GOLD on their natural products

Genomes associated with biosynthetic clusters can beexamined as illustrated in Figure 2 where these genomesare listed in descending order of the number of biosyn-thetic clusters present in them Alternatively IMG can beused to find genomes associated with natural productsassociated with genome fragments but not with biosyn-thetic clusters as illustrated in Figure 2(v) Naturalproducts are small metabolites found in nature and

although the biosynthetic clusters associated with thegeneration of natural products have been identifiedthere are still natural products whose production mechan-isms in prokaryotes remain unknown

ANALYSIS TOOLS

Browsers and search tools allow finding and selectinggenomes genes and functions of interest which can thenbe examined individually or analyzed in a comparativecontext Gene content-based comparison of genomes isprovided by the lsquoPhylogenetic Profilerrsquo and thelsquoPhylogenetic Profiler for Gene Cassettesrsquo tools thatallow identifying genes in a query genome in terms ofpresence or absence of homologs in other genomes orparticipation in conserved gene cassettes across othergenomes (2930) Function-based comparison ofgenomes is provided by the lsquoAbundance ProfileOverviewrsquo and lsquoFunction Profilersquo tools that allowcomparing the relative abundance of protein families(COGs Pfams TIGRfams) and functional families(enzymes) across genomes The composition of analysis

Figure 2 Biosynthetic clusters (i) Genomes associated with biosynthetic clusters can be retrieved and examined using the lsquoGenome Browserrsquo (ii) Thenumber of biosynthetic clusters is provided in the lsquoGenome Statisticsrsquo section of the lsquoOrganism Detailrsquo page of a genome together with a hyperlink to(iii) the list of biosynthetic clusters whereby for each cluster the number of associated genes the evidence type and the corresponding natural productare provided (iv) A biosynthetic cluster can be examined using the lsquoBiosynthetic Cluster Detailrsquo page which includes information about the cluster(v) lsquoNatural Product Listrsquo provides the list of the IMG genomes associated with natural products

Nucleic Acids Research 2014 Vol 42 Database issue D563

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

operations is facilitated by genome scaffold gene andfunction lsquocartsrsquo that handle lists of genomes scaffoldsgenes and functions respectively IMG analysis toolshave been discussed in (24) Tools for identifying and cor-recting annotation anomalies such as dubious proteinproduct names and for filling annotation gaps such asgenes that may have been missed by gene prediction toolsor genes without predicted functions are discussed in (26)IMG analysis tool extensions have addressed performance(30) data quality control such as single cell data decon-tamination (31) and new data types such as RNA-Seqand biosynthetic cluster dataRNA-Seq studies can be accessed from the

lsquoExperiments Statisticsrsquo section of the lsquoIMG Statisticsrsquopage or the lsquoOrganism Detailsrsquo pages of their associatedgenomes For example the lsquoExpression Studiesrsquo link in thelsquoOrganism Detailsrsquo page such as that shown in Figure 3(i)leads to the list of associated RNA-Seq samples as the listshown in Figure 3(ii)

RNA-Seq studies associated with a genome can becompared using pairwise or multiple sample analysistools as illustrated in Figure 4 After samples areselected for comparison (Figure 4(i)) pairs of samplescan be compared in terms of up- or downregulationof genes as illustrated in Figure 4(ii) with a thresh-old specified for the difference in gene expression Thedifference in expression is computed using thelogR= log2(queryreference) or the RelDiff=2(query-reference)(query+reference) metric The comparisoncan be first previewed using a histogram as illustrated inFigure 4(iii) which can help set the thresholds for thesearch of over-expressed or under-expressed genesbetween a pair of samples The result of the comparisoncan be examined at the level of individual up- anddownregulated genes which can be selected for inclusionin the lsquoGene Cartrsquo for further analysis Alternatively theresult of the comparison can be examined in terms of func-tions as illustrated in Figure 4(iv) with genes associated

Figure 3 RNA-Seq data exploration (i) The list of RNA-Seq studies associated with a genome can be accessed from its lsquoOrganism Detailsrsquo witheach study associated with (ii) a list of RNA-Seq experiments (samples) Individual samples can be selected for further analysis such as(iii) examining its expressed genes as a list or using the (iv) chromosome viewer A sample can be also examined in the context of (v) pathwaysthat have at least one enzyme associated with an expressed gene in the sample whereby for each pathway (vi) enzymes are displayed with colorsrepresenting the level of expression for the associated genes mousing over an enzyme shows the number of expressed genes associated with theenzyme

D564 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 3: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

microbial genomes in the context of IMGrsquos publicgenomes (26) Since August 2011 hundreds of privategenomes have been reviewed and curated using IMGER a relatively small fraction of the 9000 genomes thatwere processed by IMGrsquos data annotation and integrationpipelines as genome curation is a time-consuming processGenome curation is usually carried out for identifyingmissing genes or for correcting functional annotationseg as part of the process of curating IMG native termsand pathways

Omics data

Proteomics datasets have been gradually included intoIMG starting in 2009 Since August 2011 64 newprotein expression datasets (samples) that are part oftwo studies were included into IMG bringing the totalto 90 samples across five studies The organization andanalysis of proteomic data in IMG are discussed in (24)The first RNAseq (transcriptomic) datasets included

into IMG in 2011 are part of the Synechococcus PCCstudy consisting of 40 samples (BillisK BilliniMKyrpidesNC and MavromatisK submitted for

publication) As of August 2013 IMG contains 99samples across 10 RNASeq studies A typical RNASeqstudy involves the sequencing of cDNA from a genomeunder different experimental conditions with the effect ofeach experimental condition being captured by a sampleAs part of RNASeq sequencing analysis reads aremapped to the reference genome involved in the studyand the expressed genes in each sample are recordedwith their observed read counts mean median andstrand RNA reads are mapped to reference genomesusing Bowtie2 (27) The scope of mapping is determinedby the type of cDNA sample (sscDNAdscDNA) and thedirectionality of the libraries whereby reads may map to asingle strand or both strands of the reference sequenceExpression levels are normalized by computing RPKM(reads per kilobase per million) Quantile or Affine trans-formations and may need to be interpreted based on thetype of cDNA in the sample For genomes involved inRNASeq studies the experimentssamples are recordedin IMG together with experimental conditions andthe read counts are organized per expressed gene asillustrated in Figure 1

Figure 1 RNA-Seq data organization (i) lsquoOmicsrsquo datasets generated can be accessed from lsquoIMG Statisticsrsquo on IMGrsquos front page following theExperiments link available on the lsquoIMG Statisticsrsquo page (ii) An RNA-Seq study is associated with samples and the number of genes expressed acrossall samples (iii) Each sample is associated with the number of expressed genes the total number of reads and the average number of reads per gene(iv) An expressed gene is associated with a read count (total number of reads divided by the size of the gene) and normalized coverage (coverage fora gene in the experiment divided by the total number of reads in that experiment)

D562 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Biosynthetic clusters

IMG contains biosynthetic clusters of genes associatedwith pathways involved in the generation of secondary me-tabolites in isolate prokaryotic genomes Experimentallyvalidated biosynthetic clusters were identified by searchingNCBIrsquos nucleotide database for genome fragments(partially sequenced genomes) containing gene clustersassociated with secondary metabolitesnatural products(28) Additional biosynthetic clusters were predictedusing ClusterFinder (Fischbach submitted for publica-tion) Biosynthetic clusters in IMG are associated withIMG Metacyc and KEGG pathways as well as informa-tion available in GOLD on their natural products

Genomes associated with biosynthetic clusters can beexamined as illustrated in Figure 2 where these genomesare listed in descending order of the number of biosyn-thetic clusters present in them Alternatively IMG can beused to find genomes associated with natural productsassociated with genome fragments but not with biosyn-thetic clusters as illustrated in Figure 2(v) Naturalproducts are small metabolites found in nature and

although the biosynthetic clusters associated with thegeneration of natural products have been identifiedthere are still natural products whose production mechan-isms in prokaryotes remain unknown

ANALYSIS TOOLS

Browsers and search tools allow finding and selectinggenomes genes and functions of interest which can thenbe examined individually or analyzed in a comparativecontext Gene content-based comparison of genomes isprovided by the lsquoPhylogenetic Profilerrsquo and thelsquoPhylogenetic Profiler for Gene Cassettesrsquo tools thatallow identifying genes in a query genome in terms ofpresence or absence of homologs in other genomes orparticipation in conserved gene cassettes across othergenomes (2930) Function-based comparison ofgenomes is provided by the lsquoAbundance ProfileOverviewrsquo and lsquoFunction Profilersquo tools that allowcomparing the relative abundance of protein families(COGs Pfams TIGRfams) and functional families(enzymes) across genomes The composition of analysis

Figure 2 Biosynthetic clusters (i) Genomes associated with biosynthetic clusters can be retrieved and examined using the lsquoGenome Browserrsquo (ii) Thenumber of biosynthetic clusters is provided in the lsquoGenome Statisticsrsquo section of the lsquoOrganism Detailrsquo page of a genome together with a hyperlink to(iii) the list of biosynthetic clusters whereby for each cluster the number of associated genes the evidence type and the corresponding natural productare provided (iv) A biosynthetic cluster can be examined using the lsquoBiosynthetic Cluster Detailrsquo page which includes information about the cluster(v) lsquoNatural Product Listrsquo provides the list of the IMG genomes associated with natural products

Nucleic Acids Research 2014 Vol 42 Database issue D563

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

operations is facilitated by genome scaffold gene andfunction lsquocartsrsquo that handle lists of genomes scaffoldsgenes and functions respectively IMG analysis toolshave been discussed in (24) Tools for identifying and cor-recting annotation anomalies such as dubious proteinproduct names and for filling annotation gaps such asgenes that may have been missed by gene prediction toolsor genes without predicted functions are discussed in (26)IMG analysis tool extensions have addressed performance(30) data quality control such as single cell data decon-tamination (31) and new data types such as RNA-Seqand biosynthetic cluster dataRNA-Seq studies can be accessed from the

lsquoExperiments Statisticsrsquo section of the lsquoIMG Statisticsrsquopage or the lsquoOrganism Detailsrsquo pages of their associatedgenomes For example the lsquoExpression Studiesrsquo link in thelsquoOrganism Detailsrsquo page such as that shown in Figure 3(i)leads to the list of associated RNA-Seq samples as the listshown in Figure 3(ii)

RNA-Seq studies associated with a genome can becompared using pairwise or multiple sample analysistools as illustrated in Figure 4 After samples areselected for comparison (Figure 4(i)) pairs of samplescan be compared in terms of up- or downregulationof genes as illustrated in Figure 4(ii) with a thresh-old specified for the difference in gene expression Thedifference in expression is computed using thelogR= log2(queryreference) or the RelDiff=2(query-reference)(query+reference) metric The comparisoncan be first previewed using a histogram as illustrated inFigure 4(iii) which can help set the thresholds for thesearch of over-expressed or under-expressed genesbetween a pair of samples The result of the comparisoncan be examined at the level of individual up- anddownregulated genes which can be selected for inclusionin the lsquoGene Cartrsquo for further analysis Alternatively theresult of the comparison can be examined in terms of func-tions as illustrated in Figure 4(iv) with genes associated

Figure 3 RNA-Seq data exploration (i) The list of RNA-Seq studies associated with a genome can be accessed from its lsquoOrganism Detailsrsquo witheach study associated with (ii) a list of RNA-Seq experiments (samples) Individual samples can be selected for further analysis such as(iii) examining its expressed genes as a list or using the (iv) chromosome viewer A sample can be also examined in the context of (v) pathwaysthat have at least one enzyme associated with an expressed gene in the sample whereby for each pathway (vi) enzymes are displayed with colorsrepresenting the level of expression for the associated genes mousing over an enzyme shows the number of expressed genes associated with theenzyme

D564 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 4: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

Biosynthetic clusters

IMG contains biosynthetic clusters of genes associatedwith pathways involved in the generation of secondary me-tabolites in isolate prokaryotic genomes Experimentallyvalidated biosynthetic clusters were identified by searchingNCBIrsquos nucleotide database for genome fragments(partially sequenced genomes) containing gene clustersassociated with secondary metabolitesnatural products(28) Additional biosynthetic clusters were predictedusing ClusterFinder (Fischbach submitted for publica-tion) Biosynthetic clusters in IMG are associated withIMG Metacyc and KEGG pathways as well as informa-tion available in GOLD on their natural products

Genomes associated with biosynthetic clusters can beexamined as illustrated in Figure 2 where these genomesare listed in descending order of the number of biosyn-thetic clusters present in them Alternatively IMG can beused to find genomes associated with natural productsassociated with genome fragments but not with biosyn-thetic clusters as illustrated in Figure 2(v) Naturalproducts are small metabolites found in nature and

although the biosynthetic clusters associated with thegeneration of natural products have been identifiedthere are still natural products whose production mechan-isms in prokaryotes remain unknown

ANALYSIS TOOLS

Browsers and search tools allow finding and selectinggenomes genes and functions of interest which can thenbe examined individually or analyzed in a comparativecontext Gene content-based comparison of genomes isprovided by the lsquoPhylogenetic Profilerrsquo and thelsquoPhylogenetic Profiler for Gene Cassettesrsquo tools thatallow identifying genes in a query genome in terms ofpresence or absence of homologs in other genomes orparticipation in conserved gene cassettes across othergenomes (2930) Function-based comparison ofgenomes is provided by the lsquoAbundance ProfileOverviewrsquo and lsquoFunction Profilersquo tools that allowcomparing the relative abundance of protein families(COGs Pfams TIGRfams) and functional families(enzymes) across genomes The composition of analysis

Figure 2 Biosynthetic clusters (i) Genomes associated with biosynthetic clusters can be retrieved and examined using the lsquoGenome Browserrsquo (ii) Thenumber of biosynthetic clusters is provided in the lsquoGenome Statisticsrsquo section of the lsquoOrganism Detailrsquo page of a genome together with a hyperlink to(iii) the list of biosynthetic clusters whereby for each cluster the number of associated genes the evidence type and the corresponding natural productare provided (iv) A biosynthetic cluster can be examined using the lsquoBiosynthetic Cluster Detailrsquo page which includes information about the cluster(v) lsquoNatural Product Listrsquo provides the list of the IMG genomes associated with natural products

Nucleic Acids Research 2014 Vol 42 Database issue D563

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

operations is facilitated by genome scaffold gene andfunction lsquocartsrsquo that handle lists of genomes scaffoldsgenes and functions respectively IMG analysis toolshave been discussed in (24) Tools for identifying and cor-recting annotation anomalies such as dubious proteinproduct names and for filling annotation gaps such asgenes that may have been missed by gene prediction toolsor genes without predicted functions are discussed in (26)IMG analysis tool extensions have addressed performance(30) data quality control such as single cell data decon-tamination (31) and new data types such as RNA-Seqand biosynthetic cluster dataRNA-Seq studies can be accessed from the

lsquoExperiments Statisticsrsquo section of the lsquoIMG Statisticsrsquopage or the lsquoOrganism Detailsrsquo pages of their associatedgenomes For example the lsquoExpression Studiesrsquo link in thelsquoOrganism Detailsrsquo page such as that shown in Figure 3(i)leads to the list of associated RNA-Seq samples as the listshown in Figure 3(ii)

RNA-Seq studies associated with a genome can becompared using pairwise or multiple sample analysistools as illustrated in Figure 4 After samples areselected for comparison (Figure 4(i)) pairs of samplescan be compared in terms of up- or downregulationof genes as illustrated in Figure 4(ii) with a thresh-old specified for the difference in gene expression Thedifference in expression is computed using thelogR= log2(queryreference) or the RelDiff=2(query-reference)(query+reference) metric The comparisoncan be first previewed using a histogram as illustrated inFigure 4(iii) which can help set the thresholds for thesearch of over-expressed or under-expressed genesbetween a pair of samples The result of the comparisoncan be examined at the level of individual up- anddownregulated genes which can be selected for inclusionin the lsquoGene Cartrsquo for further analysis Alternatively theresult of the comparison can be examined in terms of func-tions as illustrated in Figure 4(iv) with genes associated

Figure 3 RNA-Seq data exploration (i) The list of RNA-Seq studies associated with a genome can be accessed from its lsquoOrganism Detailsrsquo witheach study associated with (ii) a list of RNA-Seq experiments (samples) Individual samples can be selected for further analysis such as(iii) examining its expressed genes as a list or using the (iv) chromosome viewer A sample can be also examined in the context of (v) pathwaysthat have at least one enzyme associated with an expressed gene in the sample whereby for each pathway (vi) enzymes are displayed with colorsrepresenting the level of expression for the associated genes mousing over an enzyme shows the number of expressed genes associated with theenzyme

D564 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 5: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

operations is facilitated by genome scaffold gene andfunction lsquocartsrsquo that handle lists of genomes scaffoldsgenes and functions respectively IMG analysis toolshave been discussed in (24) Tools for identifying and cor-recting annotation anomalies such as dubious proteinproduct names and for filling annotation gaps such asgenes that may have been missed by gene prediction toolsor genes without predicted functions are discussed in (26)IMG analysis tool extensions have addressed performance(30) data quality control such as single cell data decon-tamination (31) and new data types such as RNA-Seqand biosynthetic cluster dataRNA-Seq studies can be accessed from the

lsquoExperiments Statisticsrsquo section of the lsquoIMG Statisticsrsquopage or the lsquoOrganism Detailsrsquo pages of their associatedgenomes For example the lsquoExpression Studiesrsquo link in thelsquoOrganism Detailsrsquo page such as that shown in Figure 3(i)leads to the list of associated RNA-Seq samples as the listshown in Figure 3(ii)

RNA-Seq studies associated with a genome can becompared using pairwise or multiple sample analysistools as illustrated in Figure 4 After samples areselected for comparison (Figure 4(i)) pairs of samplescan be compared in terms of up- or downregulationof genes as illustrated in Figure 4(ii) with a thresh-old specified for the difference in gene expression Thedifference in expression is computed using thelogR= log2(queryreference) or the RelDiff=2(query-reference)(query+reference) metric The comparisoncan be first previewed using a histogram as illustrated inFigure 4(iii) which can help set the thresholds for thesearch of over-expressed or under-expressed genesbetween a pair of samples The result of the comparisoncan be examined at the level of individual up- anddownregulated genes which can be selected for inclusionin the lsquoGene Cartrsquo for further analysis Alternatively theresult of the comparison can be examined in terms of func-tions as illustrated in Figure 4(iv) with genes associated

Figure 3 RNA-Seq data exploration (i) The list of RNA-Seq studies associated with a genome can be accessed from its lsquoOrganism Detailsrsquo witheach study associated with (ii) a list of RNA-Seq experiments (samples) Individual samples can be selected for further analysis such as(iii) examining its expressed genes as a list or using the (iv) chromosome viewer A sample can be also examined in the context of (v) pathwaysthat have at least one enzyme associated with an expressed gene in the sample whereby for each pathway (vi) enzymes are displayed with colorsrepresenting the level of expression for the associated genes mousing over an enzyme shows the number of expressed genes associated with theenzyme

D564 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 6: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

with KEGG pathways or COG functions groupedtogether Genes associated with a specific KEGGpathway can be examined in the context of the pathwaysimilar to the example shown in Figure 3(vi) earlier Thestrength of the association between pairs of samples canbe examined using lsquoSpearmanrsquos Rank Correlationrsquo asillustrated in Figure 4(v) whereas lsquoLinear Regressionrsquoanalysis illustrated in Figure 4(vi) helps determinewhether two samples are technical replicates

Multiple RNA-Seq sample analysis usually involvesclustering based on the abundance of expressed geneswhere the proximity of grouping indicates the relativedegree of similarity of samples to each other There is achoice of clustering methods such as pairwise completelinkage and pairwise single linkage and distance measuresuch as Pearson correlation Spearmanrsquos rank correlationand Euclidean distance as illustrated in Figure 4(vii) Theresult of clustering is displayed as a hierarchical tree of

samples and a normalized heat map of coverage values foreach gene for each sample Clusters of multiple samplescan be also examined in the context of pathways asillustrated in Figure 4(viii) whereby enzymes aredisplayed with colors representing the cluster

FUTURE PLANS

IMGrsquos genome sequence data content is maintainedthrough regular updates managed by the IMG submissionsystem and involving new genomes sequenced at JGIgenomes sequenced at other organizations and submittedfor inclusion into IMG by scientists worldwide andgenomes from Genbank For genomes with multiple sub-missions only the latest version is kept in IMG IMGgenome data are distributed through genome dataportals available at httpgenomejgidoegov IMGrsquosdata annotation and integration pipelines have been

Figure 4 RNA-Seq data comparison (i) RNA-Seq sample comparison starts with the selection of samples of interest (ii) lsquoPairwise Sample Analysisrsquosupports comparing samples in terms of updownregulated genes with (iii) a histogram preview helping setting the thresholds for comparison (iv)The result of the comparison can be examined in terms of functions whereby genes associated with KEGG pathways or COG functions are groupedtogether (v) The strength of the association of gene expression between pairs of samples can be examined using lsquoSpearmanrsquos Rank Correlationrsquo (vi)lsquoLinear Regressionrsquo analysis helps estimate whether two samples are technical replicates (vii) lsquoMultiple Sample Analysisrsquo consists of clusteringsamples based on the abundance of expressed genes using a variety of clustering methods (viii) Clusters of samples can be examined in thecontext of pathways whereby enzymes are displayed with colors representing the cluster

Nucleic Acids Research 2014 Vol 42 Database issue D565

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 7: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

automated thus improving their ability to keep pace withthe rapidly increasing number of sequenced genomesIMGrsquos integrated data framework allows assessing and

improving the quality of genome annotations Thus thequality of gene models for genomes available in publicresources is known to vary greatly depending on thequality of sequence and the software used for annotationAn analysis conducted at JGI of the protein-coding genesof microbial genes in Genbank indicates that 10 (gt1million) of predicted protein-coding are erroneous theyare false-positive genes unidentified pseudogene frag-ments or genes with translational exceptions or haveincorrectly predicted start sites To improve the consist-ency of annotation and the quality of predicted genes allpublic microbial genomes in IMG will be re-annotatedusing IMGrsquos annotation pipelineA rapidly increasing number of single cell genomes are

included into IMG Typically the first version of a singlecell genome is analyzed for identifying contigs that maycome from contaminant (eg Pseudomonas Ralstonia) or-ganisms The sequence of analysis steps needed to identifyand remove contaminated contigs is described in (31)The importance of functional genomics in validating

gene function in an integrated comparative genomicscontext is also being underscored pushing experimentaldata from methylomics and transposon mutagenesisexperiments into IMG Systematic paradigms forassociating computationally predicted gene structuraland functional information with experimental functionalgenomics are being constructed Tools are beingdeveloped for mining and visualizing different types ofOmics datasets in an integrated genomic contextIMGrsquos users are faced with the increasing burden of

analyzing a rapidly growing number of genomicdatasets This analytical challenge can be alleviated bysynthesizing genomic data using the lsquopangenomersquo concep-tual abstraction (32) A pangenome consists of the corepart of a species (ie the genes present in all of thesequenced strains or of all samples of a microbial commu-nity) and the variable part (the genes present in some butnot all of the strains or samples) An experimental versionof IMG has been extended with five pangenomes as wellas analysis tools and viewers that allow users to exploreindividual pangenomes and compare pangenomes andgenomes A public version of IMG containing pangenomedata and analysis tools is expected to be released in thenear future

FUNDING

The Director Office of Science Office of Biological andEnvironmental Research Life Sciences Division USDepartment of Energy under Contract No [DE-AC02-05CH11231] This research used resources of theNational Energy Research Scientific Computing Centerwhich is supported by the Office of Science of the USDepartment of Energy under Contract No [DE-AC02-05CH11231] Funding for open access charge Universityof California

Conflict of interest statement None declared

REFERENCES

1 PruittKD TatusovaT GarthRB and MaglottDR (2012)NCBI Reference Sequences (RefSeq) current status new featuresand genome annotation Nucleic Acids Res 40 D130ndashD135

2 PaganiI LioliosK JanssonJ ChenIM SmirnovaTNosratB MarkowitzVM and KyrpidesNC (2012) TheGenomes on Line Database (GOLD) v4 status of genomic andmetagenomic projects and their associated metadata NucleicAcids Res 40 D571ndashD579

3 BensonDA CavanaughM ClarkK Karsch-MizrahiILipmanDJ OstellJ and SayersEW (2013) GenBank NucleicAcids Res 41 D36ndashD42

4 BlandC RamseyTL SabreeF LoweM BrownKKyrpidesNC and HugenholtzP (2007) CRISPR recognitiontool (CRT) a tool for automatic detection of clustered regularlyinterspaced palindromic repeats BMC Bioinformatics 8 209

5 EdgarRC (2007) PILER-CR fast and accurate identification ofCRISPR repeats BMC Bioinformatics 8 18

6 LoweTM and EddySR (1997) tRNAscan-SE a program forimproved detection of transfer RNA genes in genomic sequenceNucleic Acids Res 25 955ndash964

7 LagesenK HallinP RodlandEA StaerfeldtHH RognesTand UsseryDW (2007) RNAmmer consistent and rapidannotation of ribosomal RNA genes Nucleic Acids Res 353100ndash3108

8 EddySR (2009) A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 205ndash211

9 BurgeSW DaubJ EberhardtR TateJ BarquistLNawrockiEP EddySR GardnerPP and BatemanA (2012)Rfam 110 10 years of RNA families Nucleic Acids Res 41D226ndashD232

10 NawrockiEP KolbeDL and EddySR (2009) Infernal 10inference of RNA alignments Bioinformatics 25 1335ndash1337

11 EmanuelssonO BrunakS von HeijneG and NielsenH (2007)Locating proteins in the cell using TargetP SignalP and relatedtools Nat Protoc 2 953ndash971

12 MollerS CroningMDR and ApweilerR (2001) Evaluation ofmethods for the prediction of membrane spanning regionsBioinformatics 17 646ndash653

13 HyattD ChenGL LocascioPF LandML LarimerFWand HauserLJ (2010) Prodigal prokaryotic gene recognitionand translation initiation site identification BMC Bioinformatics11 119

14 TatusovRL FedorovaND JacksonJD JacobsARKiryutinB KooninEV KrylovDM MazumderRMekhedovSL NikolskayaAN et al (2003) The COGdatabase an updated version includes eukaryotesBMC Bioinformatics 4 41

15 PuntaM CoggillPC EberhardtRY MistryJ TateJBoursnellC PangN ForslundK CericG CelmentsJ et al(2012) The Pfam protein families database Nucleic Acids Res40 D290ndashD301

16 SelengutJD HaftDH DavidsenT GanapathyAGwinn-GiglioM NelsonWC RichterAR and WhiteO (2007)TIGRFAMs and genome properties tools for the assignment ofmolecular function and biological process in prokaryotic genomesNucleic Acids Res 35 D260ndashD264

17 KanehisaM GotoS SatoY FurumichiM and TanabeM(2012) KEGG for integration and interpretation of large scalemolecular data sets Nucleic Acids Res 40 D109ndashD114

18 EdgarRC (2010) Search and clustering orders of magnitudefaster than BLAST Bioinformatics 26 2460ndash2461

19 CaspiR AltmanT DreherK FulcherAC SubhravetiPKeselerIM KothariA KrummenackerM LatendresseMMuellerLA et al (2012) The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathwaygenome databases Nucleic Acids Res 40 D742ndashD753

20 IvanovaNN AndersonI LykidisA MavrommatisKMikhailovaN ChenIA SzetoE PalaniappanKMarkowitzVM and KyrpidesNC (2007) MetabolicReconstruction of Microbial Genomes and Microbial CommunityMetagenomes Technical Report 62292 Lawrence BerkeleyNational Laboratory httpimgjgidoegovwdocimgtermshtml

D566 Nucleic Acids Research 2014 Vol 42 Database issue

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from

Page 8: IMG 4 version of the integrated microbial genomes ...cmore.soest.hawaii.edu/summercourse/2015/documents...Genome curation is usually carried out for identifying missing genes or for

21 SaierMH Jr YenMR NotoK TamangDG and ElkanC(2009) The transporter classification database recent advancesNucleic Acids Res 37 D274ndashD278

22 EnrightAJ IliopoulosI KyrpidesNC and OuzounisCA(1999) Protein interaction maps for complete genomes based ongene fusion events Nature 402 86ndash90

23 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem accepted for publication PLoS One 4 e7979

24 MarkowitzVM ChenIA PalaniappanK ChuK SzetoEGrechkinY RatnerA BijuJ HuangJ WilliamsP et al(2012) IMG the integrated microbial genomes database andcomparative analysis system Nucleic Acids Res 40 D115ndashD122

25 RinkeC SchwientekP SczyrbaA IvanovaNN AndersonIJChengJF DarlingA MalfattiS SwanBK GiesEA et al(2013) Insights into the phylogeny and coding potential ofmicrobial dark matter Nature 499 431ndash437

26 MarkowitzVM MavromatisK IvanovaNN ChenIAChuK and KyrpidesNC (2009) IMG ER a system formicrobial annotation expert review and curation Bioinformatics25 2271ndash2278

27 LangmeadB and SalzbergSL (2012) Fast gapped-readalignment with Bowtie 2 Nat Methods 9 357ndash359

28 WalshCT and FischbachMA (2010) Natural products version20 connecting genes to molecules J Am Chem Soc 1322469ndash2493

29 MavromatisK ChuK IvanovaN HooperSDMarkowitzVM and KyrpidesNC (2009) Gene context analysisin the Integrated Microbial Genomes (IMG) data managementsystem PLoS One 4 e7979

30 RomosanA ShoshaniA WuK MarkowitzVM andMavromatisK (2013) Accelerating gene context analysis usingbitmaps Proceedings of the 25th International Conference onScientific and Statistical Database Management (SSDBM 2013)

31 ClingenpeelS (2012) JGI microbial single cell program single celldata decontamination httpimgjgidoegovwdocSingleCellDataDecontaminationpdf

32 TettelinH MasignaniV CieslewiczMJ DonatiC MediniDWardNL AngiuoliSV CrabtreeJ JonesAL DurkinASet al (2005) Genome analysis of multiple pathogenicisolates of Streptococcus agalactiae implications for themicrobial lsquolsquopan-genomersquorsquo Proc Natl Acad Sci USA 10213950ndash13955

Nucleic Acids Research 2014 Vol 42 Database issue D567

by guest on June 5 2015httpnaroxfordjournalsorg

Dow

nloaded from