r-076 a data analysis and coordination center for …institute for genome sciences university of...

1
A Data Analysis and Coordination Center for the Human Microbiome Project Jennifer Wortman 1 , Michelle Giglio 1 , Amy Chen 2 , Victor Felix 1 , Konstantinos Mavrommatis 3 , Heather Creasy 1 , Todd DeSantis 2 , Clark Santee 2 , Yvette Piceno 2 , Shariff Osman 2 , Joshua Orvis 1 , Jonathan Crabtree 1 , Micah Hamady 4 , Justin Kuczynski 4 , Rob Knight 4 , Gary Andersen 2 , Nikos Kyrpides 3 , Victor Markowitz 2 , Owen White 1 1 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD; 2 Lawrence Berkeley National Laboratory, Berkeley, CA; 3 Joint Genome Institute, Walnut Creek, CA; 4 University of Colorado at Boulder, Boulder, CO Abstract DACC Partners DACC Web Resources Outreach Activities Analysis of Microbiome Data Tool Development The Human Microbiome Project (HMP) is an NIH Roadmap initiative that aims to collect and analyze unprecedented amounts of sequence information from microbial communities found in and on the human body. There is abundant and growing evidence that changes in microbial community composition are highly correlated with human health and disease. Efforts are underway to determine if such changes are the result of particular human diseases or perhaps a contributing cause. To gain insight into this question, the HMP has undertaken two main areas of effort: sequence 1000 reference genomes that live in or on the human body and sequence metagenomic samples from five different body sites collected in parallel from healthy subjects and those with disease. Initially, four large sequencing centers have begun the work of sequencing the 1000 reference strains. Subsequently, centers will be funded to carry out metagenomic sequencing from various sites with subjects suffering from various conditions. This project will generate unprecedented amounts of sequence data, annotation information, and metadata about subjects and strains. The analysis of this data requires the ability to collect, integrate, and standardize information of different types and from different sources. Responsibility for these activities falls on the HMP Data Analysis and Coordination Center (DACC). Successful data integration and standardization will rely on the use of controlled vocabularies, the application of quality control measures, and the development of standard operating procedures. The DACC will provide multiple analysis services to the research community including data query, comparative genomics, 16S rRNA analysis, and phylogenetic analysis. The DACC will also engage in extensive training and outreach. All information and analyses produced from the HMP will be available on a comprehensive web resource. The web presentation toolsets for the DACC will be based on those of the Integrated Microbial Genomes resource for both single genomes and metagenomes (IMG and IMG/m). The HMP DACC can be found at http://hmpdacc.org . Funded by: NIH Common Fund International Human Microbiome Consortium Nine countries from around the world have formed the International Human Microbiome Consortium (IHMC) to unravel the complexities of the microbial communities living within all humans. The DACC will interact with these other centers in several ways: coordinate reference strain selection share protocols and methods to insure consistency across the entire consortium collaborate on the establishment of standards to be applied by all members collect, integrate, and display reference genome, metagenomic, and 16S rRNA data from the international members Training The DACC and sequencing centers offer a host of workshops each year. Topics included are: single genome sequencing metagenome sequencing annotation pipelines manual curation metagenomic data analysis For more information, see: http://www.hmpdacc.org/ outreach.php Reference Strain Selection We encourage feedback from the scientific community on the selection of strains to include in the HMP reference collection. http://www.hmpdacc.org/feedback_form.php Most strains chosen as reference genomes for this project will be sequenced to "draft" level. However, about 15% of the reference strains will be taken closer to a "finished" or complete state. Criteria have been established to help guide the choice of which strains to advance in the finishing process. The HMP project is also interested in collaborating with researchers who have biological materials for bacterial strains isolated from human body sites. Any researcher who would like to contribute cells or DNA from a relevant strain should contact us. Length and base-call polymorphisms may be the source of observed diversity. One organism was chosen to more thoroughly investigate the source of the observed diversity. The 30 distinct flowgram clusters (representing 661 flowgrams) putatively generated from Clostridium beijerinckii were compared to database sequences based on common k-mers. The comparison revealed that de-noised The DACC quantifies the noise introduced by the 454 16S sequencing method -- unexpected diversity seen from simple communities. Genomic DNA from 22 species (19 genera) were combined in equal molar quantities, and the V1 and V2 regions of the 16S rRNA gene amplified by PCR. Amplicons were sequenced using the 454-FLX platform. The resulting records 1 10 100 1000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 FastUnifrac Unweighted FastUniFrac Weighted UniFrac Unweighted UniFrac Weighted The HMP Project Catalog maintains an interactive list of the targeted reference strains along with sequencing status and links to public databases. The IMG/HMP web resource supports the comparative analysis of completed reference genomes, and includes extensive functional annotation and pathway information were pre-clustered on base-call similarities. Each pre-cluster was then clustered based on flowgram similarities and de-noised (Quince, 2009, in review) resulting in 393 distinct sequences. An NJ tree was cast by Clearcut (Sheneman, 2006) then visualized with ITOL (Letunic, 2007). Branches are colored by closest- matching genus (of the 19) as identified by Greengenes (DeSantis, 2006) and red bars indicate the relative abundance of each sequence type. Notice that reads from the same organism do not form distinct clades. records diverged from Clostridium beijerinckii references up to 11%. Variation was also observed in lengths of de-noised records (0.2 to 0.4 kb) but did not correspond to divergence. The DACC is collaborating with Dr. Quince to determine improved parameters for removing noisy base calls from the 454 data. Performance of Fast UniFrac (red lines) versus original implementation on sample sizes ranging from 1000 to 10,000 sequences. Note log scale on y axis: Fast UniFrac implementation is consistently about 2 orders of magnitude faster, and largely eliminates the difference in time to calculate weighted and unweighted UniFrac metrics. Unifrac is a statistical tool for metagenomic community comparisons (Lozupone, 2006) The DACC will also contribute to the development and improvement of computational tools to facilitate human microbiome data analysis. Reference Genomes Approximately 600 microbes will be sequenced during the HMP. Combined with other existing and currently planned efforts, the total reference collection should reach 1000 genomes. These sequences will provide a benchmark against which metagenomic sequence data can be compared. 16S rRNA Gene Sequencing 16S rRNA gene sequencing will be used to characterize the complexity of microbial communities at individual body sites, and to determine whether there is a core microbiome. Several body sites will be studied, including the gastrointestinal tract, oral cavity, nasopharyngeal tract, female urogenital tract, and skin. R-076 Michelle Giglio Institute for Genome Sciences University of Maryland School of Medicine 801 W. Baltimore St. Baltimore, MD 21201 phone: 410-706-7694 fax: 410-706-6756 [email protected] Please visit booth 1339 to learn more about this and other Institute for Genome Sciences projects and services

Upload: others

Post on 15-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R-076 A Data Analysis and Coordination Center for …Institute for Genome Sciences University of Maryland School of Medicine 801 W. Baltimore St. Baltimore, MD 21201 phone: 410- 706-7694

A Data Analysis and Coordination Center for the Human Microbiome ProjectJennifer Wortman1, Michelle Giglio1, Amy Chen2, Victor Felix1, Konstantinos Mavrommatis3, Heather Creasy1, Todd DeSantis2, Clark Santee2, Yvette Piceno2, Shariff

Osman2, Joshua Orvis1, Jonathan Crabtree1, Micah Hamady4, Justin Kuczynski4, Rob Knight4, Gary Andersen2, Nikos Kyrpides3, Victor Markowitz2, Owen White1

1Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD; 2Lawrence Berkeley National Laboratory, Berkeley, CA; 3Joint Genome Institute, Walnut Creek, CA; 4University of Colorado at Boulder, Boulder, CO

Abstract

DACC Partners

DACC Web Resources

Outreach Activities

Analysis of Microbiome Data

Tool Development

The Human Microbiome Project (HMP) is an NIH Roadmapinitiative that aims to collect and analyze unprecedentedamounts of sequence information from microbial communitiesfound in and on the human body. There is abundant andgrowing evidence that changes in microbial communitycomposition are highly correlated with human health anddisease. Efforts are underway to determine if such changes arethe result of particular human diseases or perhaps a contributingcause.

To gain insight into this question, the HMP has undertaken twomain areas of effort: sequence 1000 reference genomes thatlive in or on the human body and sequence metagenomicsamples from five different body sites collected in parallel fromhealthy subjects and those with disease. Initially, four largesequencing centers have begun the work of sequencing the1000 reference strains.

Subsequently, centers will be funded to carry out metagenomicsequencing from various sites with subjects suffering fromvarious conditions. This project will generate unprecedentedamounts of sequence data, annotation information, andmetadata about subjects and strains. The analysis of this datarequires the ability to collect, integrate, and standardizeinformation of different types and from different sources.Responsibility for these activities falls on the HMP Data Analysisand Coordination Center (DACC). Successful data integrationand standardization will rely on the use of controlledvocabularies, the application of quality control measures, and thedevelopment of standard operating procedures. The DACC willprovide multiple analysis services to the research communityincluding data query, comparative genomics, 16S rRNAanalysis, and phylogenetic analysis. The DACC will also engagein extensive training and outreach.

All information and analyses produced from the HMP will beavailable on a comprehensive web resource. The webpresentation toolsets for the DACC will be based on those of theIntegrated Microbial Genomes resource for both single genomesand metagenomes (IMG and IMG/m). The HMP DACC can befound at http://hmpdacc.org.

Funded by: NIH CommonFund

International Human Microbiome Consortium

Nine countries from around the world have formed the International Human Microbiome Consortium (IHMC) to unravel the complexities of the microbial communities living within all humans. The DACC will interact with these other centers in several ways:

coordinate reference strain selection

share protocols and methods to insure consistency across the entire consortium

collaborate on the establishment of standards to be applied by all members

collect, integrate, and display reference genome, metagenomic, and 16S rRNA data from the international members

Training

The DACC and sequencing centers offer a host of workshops each year. Topics included are:

single genome sequencing

metagenome sequencing

annotation pipelines

manual curation

metagenomic data analysis

For more information, see:

http://www.hmpdacc.org/outreach.php

Reference Strain Selection

We encourage feedback from the scientificcommunity on the selection of strains to include inthe HMP reference collection.

http://www.hmpdacc.org/feedback_form.php

Most strains chosen as reference genomes for thisproject will be sequenced to "draft" level.However, about 15% of the reference strains willbe taken closer to a "finished" or complete state.Criteria have been established to help guide thechoice of which strains to advance in the finishingprocess.

The HMP project is also interested in collaboratingwith researchers who have biological materials forbacterial strains isolated from human body sites.Any researcher who would like to contribute cells orDNA from a relevant strain should contact us.

Length and base-call polymorphismsmay be the source of observeddiversity. One organism was chosen tomore thoroughly investigate the sourceof the observed diversity. The 30distinct flowgram clusters (representing661 flowgrams) putatively generatedfrom Clostridium beijerinckii werecompared to database sequencesbased on common k-mers. Thecomparison revealed that de-noised

The DACC quantifies the noise introduced by the 454 16S sequencing method -- unexpected diversity seen from simple communities.

Genomic DNA from 22 species (19 genera) were combined in equal molarquantities, and the V1 and V2 regions of the 16S rRNA gene amplified by PCR.Amplicons were sequenced using the 454-FLX platform. The resulting records

1

10

100

1000

10000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

FastUnifrac Unweighted FastUniFrac Weighted

UniFrac Unweighted UniFrac Weighted

The HMP Project Catalog maintains an interactive list of the targeted reference strains along with sequencing status and links to public databases.

The IMG/HMP web resource supports the comparative analysis of completed reference genomes, and includes extensive functional annotation and pathway information

were pre-clustered on base-callsimilarities. Each pre-cluster wasthen clustered based on flowgramsimilarities and de-noised(Quince, 2009, in review) resultingin 393 distinct sequences. An NJtree was cast by Clearcut(Sheneman, 2006) then visualizedwith ITOL (Letunic, 2007).Branches are colored by closest-matching genus (of the 19) asidentified by Greengenes(DeSantis, 2006) and red barsindicate the relative abundance ofeach sequence type. Notice thatreads from the same organism donot form distinct clades.

records diverged from Clostridium beijerinckii references up to 11%. Variationwas also observed in lengths of de-noised records (0.2 to 0.4 kb) but did notcorrespond to divergence. The DACC is collaborating with Dr. Quince todetermine improved parameters for removing noisy base calls from the 454 data.

Performance of Fast UniFrac (red lines) versus original implementation onsample sizes ranging from 1000 to 10,000 sequences. Note log scale on y axis:Fast UniFrac implementation is consistently about 2 orders of magnitudefaster, and largely eliminates the difference in time to calculate weighted andunweighted UniFrac metrics.

Unifrac is a statistical tool for metagenomic community comparisons (Lozupone, 2006)

The DACC will also contribute to the development and improvement of computational tools to facilitate human microbiome data analysis.

Reference GenomesApproximately 600 microbes will be sequencedduring the HMP. Combined with other existingand currently planned efforts, the totalreference collection should reach 1000genomes. These sequences will provide abenchmark against which metagenomicsequence data can be compared.

16S rRNA Gene Sequencing16S rRNA gene sequencing will be used tocharacterize the complexity of microbialcommunities at individual body sites, and todetermine whether there is a core microbiome.Several body sites will be studied, including thegastrointestinal tract, oral cavity, nasopharyngealtract, female urogenital tract, and skin.

R-076 Michelle GiglioInstitute for Genome Sciences

University of Maryland School of Medicine

801 W. Baltimore St.Baltimore, MD 21201phone: 410-706-7694fax: [email protected]

Please visit booth 1339 to learn more about this and other Institute for Genome Sciences projects and services