using long native reads to partition and assemble genomes ... · after aligning all reads to an e....

1
Using long native reads to partition and assemble genomes from complex metagenomic samples Long PCR-free nanopore reads allow partitioning and assembly of individual genomes from complex mixtures of different organisms, using several different bioinformatics approaches © 2019 Oxford Nanopore Technologies. All rights reserved. Oxford Nanopore Technologies’ products are currently for research use only. P17004 - Version 7.0 Fig. 1 De novo metagenomic assembly a) laboratory workflow b) typical bioinformatics pipeline Long reads provide more genomic context, improving assembly from complex samples Fig. 2 Assembly by coverage binning a) workflow b–e) performance on Zymo mock community Binning using differential coverage profiles to improve assembly contiguity Total DNA extracted from complex sample and long- fragment library prepared Library sequenced and long reads obtained Genome sequences reconstructed by metagenomic assembly of long reads a) b) Contact: [email protected] More information at: www.nanoporetech.com and publications.nanoporetech.com The majority of microbes cannot be cultured in the laboratory, and so the most direct way to derive whole genome sequences from complex mixtures of organisms is by metagenomic assembly, where all genomes in the sample are assembled together (Fig. 1a). Such mixtures often contain many similar genomes with different levels of abundance, which often leads to misassembly. A common approach to this problem is to bin reads into subsets that ideally represent a single genome, and to then assemble bins individually. Long reads can improve this by improving the sensitivity and specificity of binning strategies and providing longer overlaps for assembly. An example of such a workflow is shown in Fig. 1b. For metagenomic samples where the microbial genomes are not well represented in reference databases, differences in the organism abundance within the samples can be exploited as a binning strategy (Fig. 2a). We used three different extraction protocols on the Zymo mock community to create different genome abundances (Fig. 2b). We aligned reads from each sample to contigs assembled from the combined set of samples, and used aligned read depth to measure contig abundance in each sample (Fig. 2c). This allows binning of contigs based on matching abundance profiles (Fig. 2d). Finally, contigs in the initial bins can be refined by taking all reads that align to the contigs and conducting a second, bin-specific assembly (Fig. 2e). Fig. 3 Partitioning native bacterial reads using strain-specific methylation patterns a) overview of experimental set-up b) bioinformatics workflow c) hexbin plot showing partitioned reads Separating reads from closely related bacterial genomes using Tombo to identify strain- specific patterns of Dam and Dcm methylation The high degree of sequence similarity between multiple strains in microbial communities can present significant challenges for analysis. One way to resolve strain-specific sequences is to take advantage of the patterns of DNA methylation that are often present in microbial genomes. Methylation occurs at specific target motifs, yet there exists a great diversity of these motifs in the bacterial world, even among members of the same species. These naturally occurring methylation patterns can be detected in nanopore reads and can serve as epigenetic barcodes for binning reads by strain. In the example shown, two strains of E. coli were co-cultured: a wild-type K12 strain and a K12 mutant lacking the Dcm and Dam methyltransferases that methylate the 5’-CCWGG-3’ and 5’-GATC-3’ motifs, respectively (Fig. 3a). Nanopore sequencing resulted in a mixture of reads from each strain. After aligning all reads to an E. coli K12 reference sequence, the methylation detection tool Tombo was used to characterise the methylation status at 5’-GATC-3’ and 5’-CCWGG-3’ sites. Read-level statistics were compiled by assessing all motif sites from each read and taking the median methylation score for these sites (Fig. 3b). The resulting hexbin plot shows a division of reads solely based on these read-level methylation assessments at the two motifs in question: one group has high scores for both the Dcm and Dam motifs, while the other group has low methylation scores for both motifs (Fig. 3c). Nanopore reads Bin by genus or species Classified reads Unclassified reads Taxonomic bins PLOTTING: min_cov: 5 window: 20 Each bin Initial assembly miniasm Final assembly canu QC/identification BLASTN Recruit reads minimap2 Filter Length > 2 kb Q score > 7 Classify Kaiju a) Nanopore reads Sample N Sample 1 Sample 2 a) Co-assemble all samples wtdgb2 Filter Length > 2 kb Q score > 7 Create coverage profiles and assign contigs to bins metabat2 minimap2 Align sample 1 Align sample 2 Align sample N Coverage bins Each bin Evaluate bin quality CheckM Final assembly canu/wtdgb2 QC/identification BLASTN Recruit reads minimap2 Relative abundance 0.0 0.2 0.4 0.6 0.8 1.0 Supernatant Pellet 2 Pellet 1 b) Bacillus Enterococcus Escherichia Lactobacillus Listeria Pseudomonas Salmonella Staphylococcus Unassigned 50 kb 500 kb 1,000 kb 2,000 kb Supernatant 10 2 10 1 Pellet 2 10 1 10 2 10 3 c) d) E. faecalis E. coli S. enterica L. fermentum L. monocytogenes S. aureus P. aeruginosa B. subtilis unknown Bin 10 8 5 2 3 6 9 7 4 1 n/a 0 0 0 0 0 0 0 0 0 0 0 0.03 0 0 0 0 0 0 0 0 0.03 0 0 0 0 0 0 0 0 0 0.01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.05 0 0 0.01 0 0 0 0.07 0.07 0.33 0.48 0.05 0 0 0 0 0 0 0 0 0 0 0.09 0.06 0.11 0.08 0.06 0 0 0.01 0.02 0.34 1.0 0.8 0.6 0.4 0.2 0 1 1 0.99 1 0.97 0.97 0.95 0.23 % Pseudomonas genome 0 20 40 60 80 100 e) Initial Final Assembly N = 53 N = 18 -4 4 2 0 0 0 20 40 60 80 100 120 140 -1 -2 -3 1 2 -2 Median Dcm Median Dam Read-level methylation detection N 6 -methyladenine (6mA) 5-methylcytosine (5mC) 5’-GATC-3’ 5’-CCWGG-3’ 5’-GATC-3’ E. coli K12 E. coli K12 Dam - /Dcm - 5’-CCWGG-3’ Co-cultured strains with distinct MTase activities a) c) Align to reference minimap2 Call methylation at motifs of interest Tombo Pool motif scores on each read Separate reads by motif scores Nanopore reads Strain binning by methylation b) Read count

Upload: others

Post on 20-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using long native reads to partition and assemble genomes ... · After aligning all reads to an E. coli K12 reference sequence, the methylation detection tool Tombo was used to characterise

Using long native reads to partition and assemble genomes from complex metagenomic samples Long PCR-free nanopore reads allow partitioning and assembly of individual genomes from complex mixtures of different organisms, using several different bioinformatics approaches

© 2019 Oxford Nanopore Technologies. All rights reserved. Oxford Nanopore Technologies’ products are currently for research use only.P17004 - Version 7.0

Fig. 1 De novo metagenomic assembly a) laboratory workflow b) typical bioinformatics pipeline

Long reads provide more genomic context, improving assembly from complex samples

Fig. 2 Assembly by coverage binning a) workflow b–e) performance on Zymo mock community

Binning using differential coverage profiles to improve assembly contiguity

Total DNA extracted fromcomplex sample and long-fragment library prepared

Library sequenced andlong reads obtained

Genome sequencesreconstructed by metagenomic

assembly of long reads

a)

b)

Contact: [email protected] More information at: www.nanoporetech.com and publications.nanoporetech.com

The majority of microbes cannot be cultured in the laboratory, and so the most direct way to derive whole genome sequences from complex mixtures of organisms is by metagenomic assembly, where all genomes in the sample are assembled together (Fig. 1a). Such mixtures often contain many similar genomes with different levels of abundance, which often leads to misassembly. A common approach to this problem is to bin reads into subsets that ideally represent a single genome, and to then assemble bins individually. Long reads can improve this by improving the sensitivity and specificity of binning strategies and providing longer overlaps for assembly. An example of such a workflow is shown in Fig. 1b.

For metagenomic samples where the microbial genomes are not well represented in reference databases, differences in the organism abundance within the samples can be exploited as a binning strategy (Fig. 2a). We used three different extraction protocols on the Zymo mock community to create different genome abundances (Fig. 2b). We aligned reads from each sample to contigs assembled from the combined set of samples, and used aligned read depth to measure contig abundance in each sample (Fig. 2c). This allows binning of contigs based on matching abundance profiles (Fig. 2d). Finally, contigs in the initial bins can be refined by taking all reads that align to the contigs and conducting a second, bin-specific assembly (Fig. 2e).

Fig. 3 Partitioning native bacterial reads using strain-specific methylation patterns a) overview of experimental set-up b) bioinformatics workflow c) hexbin plot showing partitioned reads

Separating reads from closely related bacterial genomes using Tombo to identify strain- specific patterns of Dam and Dcm methylation The high degree of sequence similarity between multiple strains in microbial communities can present significant challenges for analysis. One way to resolve strain-specific sequences is to take advantage of the patterns of DNA methylation that are often present in microbial genomes. Methylation occurs at specific target motifs, yet there exists a great diversity of these motifs in the bacterial world, even among members of the same species. These naturally occurring methylation patterns can be detected in nanopore reads and can serve as epigenetic barcodes for binning reads by strain. In the example shown, two strains of E. coli were co-cultured: a wild-type K12 strain and a K12 mutant lacking the Dcm and Dam methyltransferases that methylate the 5’-CCWGG-3’ and 5’-GATC-3’ motifs, respectively (Fig. 3a). Nanopore sequencing resulted in a mixture of reads from each strain. After aligning all reads to an E. coli K12 reference sequence, the methylation detection tool Tombo was used to characterise the methylation status at 5’-GATC-3’ and 5’-CCWGG-3’ sites. Read-level statistics were compiled by assessing all motif sites from each read and taking the median methylation score for these sites (Fig. 3b). The resulting hexbin plot shows a division of reads solely based on these read-level methylation assessments at the two motifs in question: one group has high scores for both the Dcm and Dam motifs, while the other group has low methylation scores for both motifs (Fig. 3c).

Nanopore reads

Bin by genusor species

Classifiedreads

Unclassifiedreads

Taxonomicbins

PLOTTING: min_cov: 5 window: 20

Each bin

Initial assemblyminiasm

Final assemblycanu

QC/identificationBLASTN

Recruit readsminimap2

Filter Length > 2 kbQ score > 7

ClassifyKaiju

a)

Nanoporereads

Sample N

Sample 1

Sample 2

a)

Co-assembleall samples

wtdgb2

Filter Length > 2 kbQ score > 7

Createcoverage

profiles andassigncontigsto bins

metabat2

minimap2

Alignsample 1

Alignsample 2

Alignsample N

Coveragebins

Each bin

Evaluate bin qualityCheckM

Final assemblycanu/wtdgb2

QC/identificationBLASTN

Recruit readsminimap2

Rel

ativ

e ab

unda

nce

0.0

0.2

0.4

0.6

0.8

1.0

Super

natan

t

Pellet

2

Pellet

1

b)

BacillusEnterococcusEscherichiaLactobacillusListeriaPseudomonasSalmonellaStaphylococcusUnassigned

50 kb500 kb1,000 kb2,000 kb

Sup

erna

tant

102

101

Pellet 2101 102 103

c) d)

E. faec

alis

E. coli

S. ente

rica

L. fer

mentum

L. mon

ocyto

gene

s

S. aure

us

P. ae

rugin

osa

B. sub

tilis

unkn

own

Bin

10852369741

n/a

0

0

0

0

0

0

0

0

0

0

0

0.03

0

0

0

0

0

0

0

0

0.03

0

0

0

0

0

0

0

0

0

0.01

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0.05

0

0

0.01

0

0

0

0.07

0.07

0.33

0.48

0.05

0

0

0

0

0

0

0

0

0

0

0.09

0.06

0.11

0.08

0.06

0

0

0.01

0.02

0.34

1.0

0.8

0.6

0.4

0.2

01

1

0.99

1

0.97

0.97

0.95

0.23

% P

seud

omon

asge

nom

e

0

20

40

60

80

100e)

Initial Final

Assembly

N = 53

N = 18

-4

4

2

0

00

20

40

60

80

100

120

140

-1-2-3 1 2

-2

Med

ian

Dcm

Median Dam

Read-levelmethylation detection

N6-methyladenine (6mA)5-methylcytosine (5mC)

5’-GATC-3’

5’-CCWGG-3’

5’-GATC-3’

E. coli K12 E. coli K12 Dam-/Dcm-

5’-CCWGG-3’

Co-cultured strainswith distinct MTase activities

a) c)

Align toreference

minimap2

Call methylationat motifs of interest

Tombo

Pool motif scores on each read

Separate readsby motif scores

Nanopore readsStrain binningby methylation

b)

Read count