public data resources for metagenomics · (2) upload sequence data and metadata (3) sequence data...
TRANSCRIPT
Public data resources for metagenomics
Alex [email protected]
My background
Doctorate in pharmacology (1995-1998)
Post-doc in molecular biology (1998-2001)
Bioinformatics research (2001-2011)
Co-ordinator for InterPro and EBI metagenomics databases (2011-)
My background
Overview
• Considerations for the analysis of metagenomic sequence data
• What public metagenomic analysis resources offer
• The EBI metagenomics resource
What is metagenomics?
“Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.”
“Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”
“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora”
“Metagenomics” means literally ‘beyond genomics’
Sequencing
Filtering step
Extraction of DNA
Sampling from environment
Quality control
Taxonomic analysis
Functional analysis
16S rRNA18S rRNA
ITSetc
Identification and characterisation of
protein coding sequences
Applications of taxonomic analyses
Diversity analysisIdentification of new species
Comparing populations from different sites or
states
Applications of functional analyses
Bioprospecting for novel sequences with
functional applications
Reconstruction of pathways present in the
community
Comparing functional activities from different
sites or states
• Short sequence fragments are hard to characterise
• Assembly can lead to chimeras
• Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop
• Millions of different pieces• Thousands of different puzzles• All mixed together• Most of the pieces are missing• No boxes to refer to
Why is metagenomics challenging?
Limitations and pitfalls
Data used for analysis can have limitations:
• 16S rRNA genes - limited resolving power and subject to copy number variation
• Viral sequences – currently no gold-standard reference database
• Protist sequences – little experimentally-derived annotation of protein function in public databases
Additional pitfalls
• Different functional and taxonomic analysis tools can give different results
• The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3)
• The same version of the same tools can give different results depending on the reference database used
Reference databases
Reference databases
Reference databases
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.
Other considerations: data analysis speed
• The cost of sequencing has really gone down
• Now I can do metagenomics!
• Awesome!
• Amount of sequence generated has increased 5,000-fold
• Computational speed has increased only 10-fold
• Time taken to analyse has increased 500-fold
• $@%*!!!
Data analysis speed
70 %(~80 bp/$)
14.5 %
28 %
(~2m bp/$)
36.5 %
14.5 %
14.5 %
55 %
30 %
4.5 %
Sboner et al. Genome Biology (2011) 12:125
Data analysis cost
Raw sequence data:
• Important for metagenomics as some samples are hard to replicate
• Large file sizes
Analysis results ?
• Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions
Data description including metadata
• Essential: what, where, who, how and when
• If absent, raw data have very limited usefulness
What data to store?
Metadata includes the in-depth, controlled description of the sample that your sequence was taken from
The importance of metadata
Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?
How was it sampled? How was it extracted? How was it stored? What sequencing platform was used?
• If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible
The importance of metadata
Show the microbial species found in the North Pacific
… at depths of 50 – 100 m
… in samples taken May-June
… compared to the Indian Ocean, under the same conditions
Where are you going to store this?
• Locally : back-up ?
long term ?
sharing ?
access ?
• Amazon, Google or specialist research clouds
• Public repositories, such as ENA, NCBI or DDBJ
Considerations: storing data
• Free!• Secure long term storage
• No need for local infrastructure
• Enforced compliance:• Publisher requirements (accession numbers)• Institutional requirements• Funder requirements
• Data are more useful: • Data are reusable and can be discovered by others• Available for re- and meta-analyses
Public repositories
• Transferring a 100 Gb NGS data file across the internet• 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week*• High-speed bandwidth (10 Gigabit/s) < 1 day*
Considerations: moving data
* Stein, Genome Biol. (2010) 11:207
Traditional methods may be the most effective!
Metagenomics portals
http://www.ebi.ac.uk/metagenomics
http://metagenomics.anl.gov/
http://camera.calit2.net/
http://img.jgi.doe.gov/
Submit data
Sequence analysis(prebuilt workflows)
Quality filtering of sequences
Visualisation/Interpretation
What do metagenomics portals offer?
Sequence archiving
Tools to help capture & store
metadata
Tools to help transfer data
Data archivingPowerful analysisEasy submission
A free resource for the analysis, archiving & browsing of metagenomic study data
http://www.ebi.ac.uk/metagenomics
(1) Register for an account
(2) Upload sequence data and metadata
(3) Sequence data is archived in ENA and accessioned
(4) Sequence data is analysed by the pipeline
(5) Projects, metadata and results are made available on the website for private or public browsing / download
The submission & analysis process
~ 1-2 weeks, depending on study size, compute farm usage, etc
The submission process can be run interactively
3
The GSC (Genomics Standards Consortium) have created minimum standards for metagenomics metadata
Metagenomics standards
Metadata is captured via GSC-compliant checklist
GSC MIxS
rRNAselector
reads with rRNA
reads without
rRNAFragGeneScan
predicted CDS
Amplicon-based data
processed reads
discarded reads
QC
raw reads
Qiime
Taxonomic analysis
InterProScan
Function assignment
Unknown function
pCDS
The sequence analysis pipeline
EBI Metagenomics: QC step by step
• Clipping - low quality ends trimmed and adapter sequences removed
• Quality filtering - sequences with > 10% undetermined nucleotides removed
• Read length filtering - short sequences (< 100 nt) are removed
• Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen
• Repeat masking - RepeatMasker (open-3.2.2), removes reads with 50% or more nucleotides masked (low complexity regions)
EBI Metagenomics: QC consequences
Roche 454
Illumina
Ion Torrent
EBI Metagenomics: taxonomic analysis
rRNAselector
reads with rRNA
Amplicon-based data
processed reads
Qiime
Taxonomic analysis
Taxonomic analysis with EBI Metagenomics
EBI Metagenomics currently only provides taxonomy analysis for Prokaryotes.
rRNA sequences are identified using rRNASelector:
hidden Markov models to identified rRNA sequences
60 bp minimum overlap with well-curated HMM model
E-value < 10-5
Annotations are associated using Qiime:
rRNA are annotated using the Greengenes reference database
EBI Metagenomics taxonomy visualizations
Re-analysis of: Sutton et al, (2013), Impact of Long-Term Diesel
Contamination on Soil Microbial Community Structure.
Validation of taxonomic analysis
Alpha diversity analysis
polluted
clean
clean (outlier)
EBI Metagenomics: overview of functional analysis
reads without rRNA
FragGeneScan
predicted CDS
InterProScan
Function assignment
Unknown function
pCDS
EBI Metagenomics: functional annotation
EBI Metagenomics uses FragGeneScan to predict CDSs directly from the reads:
hidden Markov models to correct frame-shift using codon usage
probabilistic identification of start and stop codons
60 bp minimum ORF
Annotation is carried out using InterProScan to mine a subset of the InterPro database
Why not BLAST?
• BLAST: Basic Local Alignment and Search Tool
• Relatively fast
• User friendly
• Very good at recognising similarity between closely related sequences
Using BLAST for annotation
Using BLAST for annotation
Using BLAST for annotation
Because BLAST performs local pairwise alignment, it:
• can sometimes struggle with multi-domain proteins
• is less useful for weakly-similar sequences (e.g., divergent homologues)
Using BLAST for annotation
BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 closely-related species
Using BLAST for annotation
60S acidic ribosomal protein P0: multiple sequence alignment
An alternative approach
• This is the approach taken by protein signature databases
• Alternatively, we can model the pattern of conserved amino acids at specific positions within a multiple sequence alignment
• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed
Full alignment methods
Single motif methods
Patterns
Multiple motif methods
Fingerprints
Three different protein signature approaches
Profiles & Hidden Markov models (HMMs)
* For a detailed description, see: https://www.ebi.ac.uk/training/online/course/introduction-protein-classification-ebi
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
HAMAP
The aim of InterPro
InterPro
Features of InterPro
• Manually checked and updated against a manually annotated database
• Errors are identified and fixed• Annotated with full text abstracts and Gene Ontology terms
… with a brief diversion into the Gene Ontology…
http://geneontology.org/
Aims of the Gene Ontology
• Allow cross-species and/or cross-database comparisons
• Unify the representation of gene and gene product attributes across species
http://geneontology.org/
English is not a very precise language
• Same name for different concepts• Different names for the same concept
Inconsistency in naming of biological concepts
?
An example …
Tactition Tactile sense
Taction
Sensory perception of touch ; GO:0050975
http://geneontology.org/
• A way to capture biological knowledge in a written and computable form
The Gene Ontology
• A set of concepts and their relationships to each other arrangedas a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
http://geneontology.org/
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity• insulin receptor activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
http://geneontology.org/
Anatomy of a GO term
Unique identifier
Term name
Definition
Synonyms
http://geneontology.org/
InterPro2GO
InterPro
We now return to your scheduled programming...
Using InterPro for annotation
• Underlies the automated system that adds annotation to
UniProtKB/TrEMBL
• Provides matches to 67 million proteins - over 80% of UniProtKB
• Source of ~170 million GO mappings for ~50 million distinct
UniProtKB sequences
Annotation consistency:• Using InterPro and GO for annotation allows direct comparison
with all of the proteins in UniProtKB
Analysing metagenomic sequences with InterPro
Considerations for metagenome analysis:
• Vast numbers of short reads
• analysis speed
• ability to cope with sequence fragments
• Making sense of output• visualisation on web site• downstream analysis and sample comparison
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Patterns
Databases
4
Assembly of metagenomics data
• Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera
EBI Metagenomics does not perform assembly
We are still able to annotate metagenome data as shown by this re-analysis of rumen metagenomics by Hess et al, (2011)
Visualising data: InterProScan results
Visualising data: GO Slims
• GO slims are cut-down versions of the GO ontologies
containing a subset of the terms in the whole GO
• Give a broad overview of the ontology content without the
detail of the specific fine-grained terms
GO Slims
GO Slims
Slimmed term:
Visualising data: GO slims
• For visualisation, EMG uses a GO slim specially developed for metagenomic data sets
EBI Metagenomics output files
sequence files
tab or comma separated files
TreeView, TOL,
Newick Viewer …
Megan …
sequence files
Simplified overview of MG-RAST pipeline
Reads Quality control
Feature prediction(FragGeneScan)
Clustering (Uclust)Protein databases
http://metagenomics.anl.gov/
Abundance profilesMetabolic
reconstructionMetabolic model
RNA database
BlatrRNAs
SILVA CommunityprofilesBlat
Blat
NH3 + A-H2 + O2 NH2OH + A + H2O ammonia monooxygenase:
12 Ammonia monooxygenase 2 ammonia monooxygenase family protein 4 Ammonia monooxygenase subunit A 5 Ammonia monooxygenase, putative62 Putative ammonia monooxygenase 3 putative ammonia monooxygenase protein 4 putative ammonia monooxygenase subunit A
EBI Metagenomics: 3 IPR003393 Ammonia monooxygenase/particulate methane monooxygenase, subunit A
25 IPR007820 Putative ammonia monooxygenase/protein AbrB
8 KEGG18 eggNOG13 GenBank11 IMG 8 PATRIC10 RefSeq12 TrEMBL 9 SEED
MG-RAST & EBI Metagenomics Functional analysis
MG-RAST: 92 hits to 8 different databases
Example: Analysis of Prairie Soil Sample
1 ammonia monooxygenase family protein2 ammonia monooxygenase subunit A1 ammonia monooxygenase, putative6 putative ammonia monooxygenase2 Putative ammonia monooxygenase1 putative ammonia monooxygenase subunit A
13 GenBank
MG-RAST & EBI Metagenomics Taxonomy analysis
MG-RAST
EBI Metagenomics: only Prokaryotic taxonomy (333 OTU)
Bacteria
Archaebacteria
Eukaryotes
Others (including virus)
(55 categories)
(15 categories)
(98 categories)
(3 types)
Example: Analysis of Prairie Soil Sample
domain level of taxonomy
Example: Analysis of Prairie Soil Sample
Phylum level of bacteria domain taxonomy
28 categories
MG-RAST
13 OTU
EBI Metagenomics
MG-RAST & EBI Metagenomics Taxonomy analysis
IMG/M
http://img.jgi.doe.gov/m
Some other metagenomics packages and tools
http://www.computationalbioenergy.org/software.html
http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS
CloVR metagenomics
http://clovr.org/methods/clovr-metagenomics/
Hands-on session
• Using InterProScan to analyse a single metagenomic sequence
• Exploring EMG Portal’s analysis of a metagenomic data set
• Comparing analysis results for samples within a project using STAMP
Questions?