img/m and metagenome analysis

Advancing Science with DNA Sequence

IMG/M and metagenome analysis

Natalia IvanovaNatalia Ivanova

MGM WorkshopMGM Workshop

February 5, 2009February 5, 2009


1.1.Problems of Problems of metagenomic datametagenomic data

2.2.IMG/M featuresIMG/M features

3.3.Analysing metagenomic Analysing metagenomic data: flowchartsdata: flowcharts

Outline


1.1. Problems of metagenomic Problems of metagenomic data (metagenomic data is data (metagenomic data is the problem)the problem)

(see IMG/M -> Using IMG/M -> About IMG/M (see IMG/M -> Using IMG/M -> About IMG/M -> Background for definitions)-> Background for definitions)


Metagenomic data are noisy

• Definition of high quality genome sequence: an example of “finished” JGI genomes - each base is covered by at least two Sanger reads in each direction with a quality of at least Q20

• Definition of “ high quality” metagenome?Too many variables:

species composition/abundance amount of DNA available average GC content of each species (applies to 454

Titanium as well) “clonability” of the DNA of each species (or biases of

454 libraries) amount of sequence allocated no clear sequencing goal …


Metagenomic data are noisy

• Sequence coverage of metagenomes is low

• Rate of sequencing artifacts is high• Frameshifts are the most unpleasant

artifacts, they lead to errors in gene prediction

US Sludge, Phrap assembly # of scaffolds % total scaffolds

Scaffolds, coverage > 2.0 2954 9.3

Scaffolds, coverage 1.03-2.0 8158 25.7

Unassembled reads 20630 65


Metagenomic data are highly fragmented

• Median scaffold length in 56 GEBA genomes – 28,179 bp

• Median scaffold length in US Sludge, Phrap assembly – 1,157 bp

• Many more gene fragments in metagenomes (median protein size in GEBA genomes – 252 aa, median protein size in US Sludge, Phrap – 195 aa)

• Problems with assignment to protein families and functional annotation


Metagenomic datasets are large (or huge)

# of CDSs GEBA genomes Samples in IMG Projects in IMG

minimal 1,375 2,331 (mouse gut ob2)

2,386 (AMO community)

maximal 9,433 185,274 (soil) 333,301 (Lake Washington sediment)

median 3,562 16,053 83,662

• No manual annotation (functional annotations in metagenomes should be taken with a grain of salt)

• “Divide and conquer” approach


2. IMG/M features 2. IMG/M features (see also IMG/M -> Using IMG/M -> Using (see also IMG/M -> Using IMG/M -> Using

IMG/M -> IMG User Guide and IMG/M IMG/M -> IMG User Guide and IMG/M Addendum)Addendum)


IMG/M User Interface Map


Dividing the genes phylogenetically

• BinsMicrobiome Details -> Microbiome Information -> Bins (of scaffolds)

• Phylogenetic Distribution of GenesMicrobiome Details -> Phylogenetic Distribution of Genes

Components: histograms Protein Recruitment Plots summary statistics tables lists of genes

histogram(phylum/

class)

gene counts

gene lists

summary statistics

histogram

(family)

histogram

(species)

counts, lists, statistics

counts, lists

recruitment plots


Dividing the genes by abundance/ by function

• Abundance ProfilesCompare Genomes -> Abundance Profiles Tools

Components: Abundance Profile Overview Abundance Profile Search Function ComparisonsFunction Category ComparisonsCommon parameters:Normalization (none/scale for size)Type of count (raw counts/estimated gene copies)Type of protein family (COG, Pfam, Enzyme,

TIGRfam)


3. 3. Analysing metagenomic Analysing metagenomic data: flowchartsdata: flowcharts


10 plate QC

Full sequen

ce

16S sequenc

es

raw read QC:GC content

insert-less clonescontamination

taxonomicanalysis

(MEGAN)

Sangerlibrary loading to

IMG/M-ER(upon

request)

manual analysis (protein families,

etc.)

Sanger metagenomes

assemblyannotation

binning

vector and quality

trimming loading

to IMG/M-ER


¼ run QC (100

Mb)

Full sequence

(1 run, ~500 Mb)

16S pyrotags

raw read QC; initial

assembly

?

taxonomicanalysis

(MEGAN)

Titaniumlibrary loading to

IMG/M-ER(upon

request)

manual analysis (protein families,

etc.)

454 Titanium metagenomes

assembly ?

dereplicationquality

trimming

? loading

to IMG/M-ER

annotation?binning ?


unassembled metagenomes

Sanger/Titanium metagenomes: unassembled data

taxonomic analysis using Phylogenetic

Distribution of genes

abundance analysis using Function

Comparisons and Function Category

Comparisons

gross counts of hits to taxahits to housekeeping genes at different % identitycompare to 16S and MEGAN resultscompare to relevant metagenomes (ecology/taxonomy)compare to relevant genomes (ecology/taxonomy)check “Genes in internal clusters”

abundance analysis of custom function categories using Function Profiles

find the relevant genes and reference sequences in the literatureidentify relevant protein familiesadd them to Function Cart, run Function Profiles, compare sums of counts


assembled metagenom

es

Sanger/Titanium metagenomes: assembled data

taxonomic analysis using Phylogenetic

Distribution of genes

abundance analysis using Function

Comparisons and Function Category

Comparisons

look for reference genomestry to select a training set for binning

compare to relevant metagenomes (ecology/taxonomy)compare to relevant genomes (ecology/taxonomy)check “Genes in internal clusters”

abundance analysis of custom function categories using Function Profiles

find the relevant genes and reference sequences in the literatureidentify relevant protein familiesadd them to Function Cart, run Function Profiles, compare sums of counts

binning


assembled and binned metagenom

es

Sanger/Titanium metagenomes: assembled and binned data

QC analysis of bins

metabolic reconstruction on

bins

check the genes on the scaffolds with lowest confidence analysis of bin coverage: check the presence of COGs in biosynthetic pathways, ribosomal proteins, etc.

COG Pathways and Functional CategoriesKEGG mapscustom pathways

compare bin content using Phylogenetic

Profiles

keep in mind bin coverageanalyze gene presence/absence in pathway contextbe careful with unique proteins – they may be errors of gene prediction

analyze recombination

within populations using SNP VISTA

img/m and metagenome analysis

Documents