img/m and metagenome analysis
DESCRIPTION
IMG/M and metagenome analysis. Natalia Ivanova. MGM Workshop February 5, 2009. Problems of metagenomic data IMG/M features Analysing metagenomic data: flowcharts. Outline. Problems of metagenomic data (metagenomic data is the problem) (see IMG/M -> Using IMG/M -> About IMG/M - PowerPoint PPT PresentationTRANSCRIPT
Advancing Science with DNA Sequence
IMG/M and metagenome analysis
Natalia IvanovaNatalia Ivanova
MGM WorkshopMGM Workshop
February 5, 2009February 5, 2009
Advancing Science with DNA Sequence
1.1.Problems of Problems of metagenomic datametagenomic data
2.2.IMG/M featuresIMG/M features
3.3.Analysing metagenomic Analysing metagenomic data: flowchartsdata: flowcharts
Outline
Advancing Science with DNA Sequence
1.1. Problems of metagenomic Problems of metagenomic data (metagenomic data is data (metagenomic data is the problem)the problem)
(see IMG/M -> Using IMG/M -> About IMG/M (see IMG/M -> Using IMG/M -> About IMG/M -> Background for definitions)-> Background for definitions)
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Definition of high quality genome sequence: an example of “finished” JGI genomes - each base is covered by at least two Sanger reads in each direction with a quality of at least Q20
• Definition of “ high quality” metagenome?Too many variables:
species composition/abundance amount of DNA available average GC content of each species (applies to 454
Titanium as well) “clonability” of the DNA of each species (or biases of
454 libraries) amount of sequence allocated no clear sequencing goal …
Advancing Science with DNA Sequence
Metagenomic data are noisy
• Sequence coverage of metagenomes is low
• Rate of sequencing artifacts is high• Frameshifts are the most unpleasant
artifacts, they lead to errors in gene prediction
US Sludge, Phrap assembly # of scaffolds % total scaffolds
Scaffolds, coverage > 2.0 2954 9.3
Scaffolds, coverage 1.03-2.0 8158 25.7
Unassembled reads 20630 65
Advancing Science with DNA Sequence
Metagenomic data are highly fragmented
• Median scaffold length in 56 GEBA genomes – 28,179 bp
• Median scaffold length in US Sludge, Phrap assembly – 1,157 bp
• Many more gene fragments in metagenomes (median protein size in GEBA genomes – 252 aa, median protein size in US Sludge, Phrap – 195 aa)
• Problems with assignment to protein families and functional annotation
Advancing Science with DNA Sequence
Metagenomic datasets are large (or huge)
# of CDSs GEBA genomes Samples in IMG Projects in IMG
minimal 1,375 2,331 (mouse gut ob2)
2,386 (AMO community)
maximal 9,433 185,274 (soil) 333,301 (Lake Washington sediment)
median 3,562 16,053 83,662
• No manual annotation (functional annotations in metagenomes should be taken with a grain of salt)
• “Divide and conquer” approach
Advancing Science with DNA Sequence
2. IMG/M features 2. IMG/M features (see also IMG/M -> Using IMG/M -> Using (see also IMG/M -> Using IMG/M -> Using
IMG/M -> IMG User Guide and IMG/M IMG/M -> IMG User Guide and IMG/M Addendum)Addendum)
Advancing Science with DNA Sequence
Dividing the genes phylogenetically
• BinsMicrobiome Details -> Microbiome Information -> Bins (of scaffolds)
• Phylogenetic Distribution of GenesMicrobiome Details -> Phylogenetic Distribution of Genes
Components: histograms Protein Recruitment Plots summary statistics tables lists of genes
histogram(phylum/
class)
gene counts
gene lists
summary statistics
histogram
(family)
histogram
(species)
counts, lists, statistics
counts, lists
recruitment plots
Advancing Science with DNA Sequence
Dividing the genes by abundance/ by function
• Abundance ProfilesCompare Genomes -> Abundance Profiles Tools
Components: Abundance Profile Overview Abundance Profile Search Function ComparisonsFunction Category ComparisonsCommon parameters:Normalization (none/scale for size)Type of count (raw counts/estimated gene copies)Type of protein family (COG, Pfam, Enzyme,
TIGRfam)
Advancing Science with DNA Sequence
3. 3. Analysing metagenomic Analysing metagenomic data: flowchartsdata: flowcharts
Advancing Science with DNA Sequence
10 plate QC
Full sequen
ce
16S sequenc
es
raw read QC:GC content
insert-less clonescontamination
taxonomicanalysis
(MEGAN)
Sangerlibrary loading to
IMG/M-ER(upon
request)
manual analysis (protein families,
etc.)
Sanger metagenomes
assemblyannotation
binning
vector and quality
trimming loading
to IMG/M-ER
Advancing Science with DNA Sequence
¼ run QC (100
Mb)
Full sequence
(1 run, ~500 Mb)
16S pyrotags
raw read QC; initial
assembly
?
taxonomicanalysis
(MEGAN)
Titaniumlibrary loading to
IMG/M-ER(upon
request)
manual analysis (protein families,
etc.)
454 Titanium metagenomes
assembly ?
dereplicationquality
trimming
? loading
to IMG/M-ER
annotation?binning ?
Advancing Science with DNA Sequence
unassembled metagenomes
Sanger/Titanium metagenomes: unassembled data
taxonomic analysis using Phylogenetic
Distribution of genes
abundance analysis using Function
Comparisons and Function Category
Comparisons
gross counts of hits to taxahits to housekeeping genes at different % identitycompare to 16S and MEGAN resultscompare to relevant metagenomes (ecology/taxonomy)compare to relevant genomes (ecology/taxonomy)check “Genes in internal clusters”
abundance analysis of custom function categories using Function Profiles
find the relevant genes and reference sequences in the literatureidentify relevant protein familiesadd them to Function Cart, run Function Profiles, compare sums of counts
Advancing Science with DNA Sequence
assembled metagenom
es
Sanger/Titanium metagenomes: assembled data
taxonomic analysis using Phylogenetic
Distribution of genes
abundance analysis using Function
Comparisons and Function Category
Comparisons
look for reference genomestry to select a training set for binning
compare to relevant metagenomes (ecology/taxonomy)compare to relevant genomes (ecology/taxonomy)check “Genes in internal clusters”
abundance analysis of custom function categories using Function Profiles
find the relevant genes and reference sequences in the literatureidentify relevant protein familiesadd them to Function Cart, run Function Profiles, compare sums of counts
binning
Advancing Science with DNA Sequence
assembled and binned metagenom
es
Sanger/Titanium metagenomes: assembled and binned data
QC analysis of bins
metabolic reconstruction on
bins
check the genes on the scaffolds with lowest confidence analysis of bin coverage: check the presence of COGs in biosynthetic pathways, ribosomal proteins, etc.
COG Pathways and Functional CategoriesKEGG mapscustom pathways
compare bin content using Phylogenetic
Profiles
keep in mind bin coverageanalyze gene presence/absence in pathway contextbe careful with unique proteins – they may be errors of gene prediction
analyze recombination
within populations using SNP VISTA