tools for metagenomics with 16s/its and whole genome shotgun sequences
Post on 10-May-2015
28.554 Views
Preview:
DESCRIPTION
TRANSCRIPT
Computational Tools for Metagenomics
Surya Saha Twitter: @SahaSurya / LinkedIn: www.linkedin.com/in/suryasaha/
Magdalen Lindeberg Plant Pathology & Plant-Microbe Biology
Microbial Friends & Foes, Sep 25, 2012
Temperton, Current Opinion in Microbiology, 2012
Impact of Technology on Metagenomics
Types of “Meta” genomics
16S rRNA survey of bacterial microbiome
ITS survey of fungal microbiome
Bellemain, BMC Microbiology 2010 Slide: Julien Tremblay, JGI
Types of “Meta” genomics
Whole genome shotgun • Varying complexity of microbial communities • High coverage sequencing • Sophisticated informatics • Host associated metagenomes
– Deep sequencing of host meta-genome – Bioinformatic screening of host sequences
• Environmental metagenomes – Eg. Soil samples – Requires very high depth of coverage – Complicated to assemble
Big picture!!
Big picture!!
What users see
Big picture!!
What users see
What users want!!
16S/ITS community surveys
• Multiple target regions in 16S gene and ITS region • Comparison of results requires amplification of same region • Advantages
– Fast survey of large communities – Mature set of tools and statistics for analysis – Good for first round survey
• 454 16S tags or pyrotags (~ 700 bp) have been the preferred method
• Illumina Miseq (2x150bp, 2x250 bp) are the next workhorses
• Depth of sampling – 2-6000 reads/sample for simple communities – 20000 reads /sample for complex soil metagenomes
16S/ITS issues
• Lack of tools for processing ITS/Fungal microbiome data sets – RDP classifier targets only ITS – No ITS reconstruction tools
• Amplification bias effects accuracy and replication • Use of short reads prevents disambiguation of similar
strains • 16S or ITS may not differentiate between similar strains
– Clustering is done at 97% – Regions may be >99% similar
• Sequencing error inflates number of OTUs • Chloroplast 16S sequences can get amplified in plant
metagenomes
16S/ITS sequence processing workflow Filter for contaminants and low quality reads
Assemble overlapping reads
Reduce datasets (clustering)
Perform taxonomic classification and compute diversity metrics
16S/ITS sequence processing workflow Filter for contaminants and low quality reads
Assemble overlapping reads
Reduce datasets (clustering)
Perform taxonomic classification and compute diversity metrics
• Quality plots and read trimming
– FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
– FASTX http://hannonlab.cshl.edu/fastx_toolkit/
• Chimera removal
– AmpliconNoise http://code.google.com/p/ampliconnoise/
– UCHIME http://www.drive5.com/uchime/
Impact of Sequence Length
Slide: Feng Chen, JGI
16S/ITS sequence processing workflow Filter for contaminants and low quality reads
Assemble overlapping reads
Reduce datasets (clustering)
Perform taxonomic classification and compute diversity metrics
• Merge overlapping paired end reads
– FLASH http://www.genomics.jhu.edu/software/FLASH/index.shtml
– FastqJoin http://code.google.com/p/ea-utils/wiki/FastqJoin
– CD-HIT read-linker http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit-auxtools-manual
16S/ITS sequence processing workflow Filter for contaminants and low quality reads
Assemble overlapping reads
Reduce datasets (clustering)
Perform taxonomic classification and compute diversity metrics
• Clustering with high stringency
– UCLUST/USEARCH (16S only) http://www.drive5.com/usearch/
– CD-HIT-OTU (16S only) http://weizhong-lab.ucsd.edu/cd-hit-otu/
– phylOTU (16S only) https://github.com/sharpton/PhylOTU
16S/ITS sequence processing workflow Filter for contaminants and low quality reads
Assemble overlapping reads
Reduce datasets (clustering)
Perform taxonomic classification and compute diversity metrics
• Composition based classifiers – RDP database + classifier http://rdp.cme.msu.edu/classifier/classifier.jsp
• Homology based classifiers – ARB + Silva database (16S only) http://www.arb-home.de/
– GreenGenes database (16S only) http://greengenes.lbl.gov/cgi-bin/nph-index.cgi
– UNITE database (ITS only) http://unite.ut.ee/
– FungalITSPipeline (ITS only) http://www.emerencia.org/fungalitspipeline.html
• http://www.qiime.org/
• Comprehensive suite of tools – OTU picking
– Taxonomic classification
– Construction of phylogenetic trees
– Visualization
– Compute diversity statistics
• Available as Amazon EC2 image
Whole Genome Shotgun (WGS) Metagenomics
• Better classification with Increasing number of complete genomes
• Focus on whole genome based phylogeny (whole genome phylotyping)
• Advantages – No amplification bias like in 16S/ITS
• Issues – Poor sampling of fungal diversity – Assembly of metagenomes is complicated due to
uneven coverage – Requires high depth of coverage
WGS sequence processing workflow
Filter for low quality reads
Assemble reads
Perform taxonomic classification and compute diversity metrics
WGS sequence processing workflow
Filter for low quality reads
Assemble reads
Perform taxonomic classification and compute diversity metrics
• Quality plots and read trimming
– FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
– FASTX http://hannonlab.cshl.edu/fastx_toolkit/
WGS sequence processing workflow
Filter for low quality reads
Assemble reads
Perform taxonomic classification and compute diversity metrics
• NGS assembly with uneven depth
– IDBA-UD http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
– MIRA http://www.chevreux.org/projects_mira.html
– Velvet / MetaVelvet http://www.ebi.ac.uk/~zerbino/velvet/
http://metavelvet.dna.bio.keio.ac.jp/
WGS sequence processing workflow
Filter for low quality reads
Assemble reads
Perform taxonomic classification and compute diversity metrics
• Hybrid composition/homology based classifiers – FCP http://kiwi.cs.dal.ca/Software/FCP
– Phymm/PhymmBL http://www.cbcb.umd.edu/software/phymm/
– AMPHORA2 http://wolbachia.biology.virginia.edu/WuLab/Software.html
– NBC http://nbc.ece.drexel.edu/
– MEGAN http://ab.inf.uni-tuebingen.de/software/megan/
WGS sequence processing workflow
Filter for low quality reads
Assemble reads
Perform taxonomic classification and compute diversity metrics
• Web based classifiers
– MG-RAST http://metagenomics.anl.gov/
– CAMERA http://camera.calit2.net/
– IMG/M http://img.jgi.doe.gov/cgi-bin/m/main.cgi
MetaPhAln
• Unique clade-specific markers for sequenced bacteria and archaea • 400 genuses/4000 genomes including HMP genomes • Species level resolution • MetaPhAln 2 in the works
– Eukaryotes including Fungi – Viruses – Higher coverage of archaea
• Krona and GraphAln for visualization of output • Websites
– https://bitbucket.org/nsegata/metaphlan – http://huttenhower.sph.harvard.edu/metaphlan
PhyloSift/pplacer
• Reference database of marker genes • Places reads on tree of life based on homology to
reference protein • Integration with metAMOS for pre-assembling next-
generation datasets • Bacterial and Archaeal classification only • Plant and Fungi marker genes are being added • Websites
– http://phylosift.wordpress.com/ – https://github.com/gjospin/PhyloSift
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
Acknowledgements
Funding
Magdalen Lindeberg Cornell University
Dave Schneider USDA-ARS, Ithaca
Citrus greening / Wolbachia (wACP)
Thank you!
Surya Saha ss2489@cornell.edu
Suggestions
• Plan informatics workflow as early as possible
• Incorporate statistics at different stages in the workflow
top related