gq-16 achieving complete and accurate assemblies … · showed in a previous technical application...

Tech

nica

l App

licat

ion

Not

eG

Q-1

6

gqinnovationcenter.com

McGill University and Génome Québec Innovation Centre 740 Docteur-Penfield Ave., Montréal, QC Canada H3A 0G1T 514 398-7211 • [email protected]

Achieving complete and accurate assemblies with PacBio RSII and HGAP2

IntroductionAchieving complete genome assembly using next-generation sequencing is often challenging due to the presence of repeated elements that are longer than the sequencing reads generated. As we showed in a previous Technical Application Note (GQ-13), the use of Pacific Biosciences (PacBio) long reads, in combination with standard short read sequencing, dramatically improves the contig sizes for de novo genome assembly. In this application note, we will describe recent improvements in the PacBio instrument and chemistry as well as a

new analysis pipeline adopted at the McGill University and Génome Québec Innovation Centre that further improves assembly results and simplifies the sequencing requirements. We show that the use of PacBio reads and companion software tools can produce unfragmented and accurate assemblies of small genomes at levels unattainable with other technologies currently available. The analysis pipeline is now offered as a service, we consider it to be an indispensable tool to be included along with PacBio sequencing for de novo assembly.

PacBio Evolution with the RSIITwo major improvements were made recently to the PacBio system. The first is the PacBio RSII upgrade which consists of a new optical beam splitter that allows the excitation and imaging of all 150,000 zero mode waveguides (ZMW) comprised in a SMRTcell thus doubling the number of reads produced. We also increased our library insert sizes to 20Kb by shearing samples at lower speed

in gTUBEs (Covaris). This resulted in an overall three-fold increase in bases produced in a standard two-hour sequencing run and pushed the median read length to over 5Kb (Figure 1). In addition, the sequencing polymerase chemistry was also improved, with the new P4-C2 reagents and protocols improving the raw accuracy from 85-87% to ~90%.

Figure 1. Increased throughput with the RSII upgrade. The plots above show histograms for the number of reads (left) and number of bases (right) in relation to read length for pre- (blue) and post- (red) RSII upgrade. In this typical example, the total bases produced increased from 114Mb to 336Mb post-upgrade (all subreads) and the average read length increased from 2Kb to 5Kb. Also more than 50% of the bases produced are now in reads of 6.5Kb or larger (dashed line), compared to less than 10% before the upgrade.

Base

pai

rs

Cou

nt

Readlength (bins: 100bp) Readlength (bins: 100bp)


PacBio Evolution with the RSII (Cont’d)The second major improvement comes from a new correction and de novo genome assembly pipeline called Hierarchical Genome Assembly Process (HGAP) [1] described in Figures 2 and 3. Since errors are randomly distributed along the PacBio reads, one can use the redundancy from the coverage to determine a consensus sequence. First, a subset of longer reads selected on the basis of

a length threshold, are corrected with shorter reads, and then the corrected long reads are used to generate an assembly. Then the pipeline (using the program Quiver) realigns the full set of PacBio reads onto the assembly and uses the detailed quality information contained in the PacBio raw files to further correct any base calling errors.

Figure 2. HGAP pipeline. Typically 100X of sequence coverage is collected. Raw reads are split into longest “seed” reads and shorter reads by using a length threshold. This threshold is typically chosen such that we have at least 30X of coverage in the “seed” group. The shorter reads are then used to correct the seed reads through alignment and consensus resulting in corrected reads. These corrected long reads have high enough quality for assembly. After we obtain the genome sequence, a contig polishing software called Quiver [1] is used to further improve the accuracy of the assembly.

Figure 3. PacBio raw read transformation with the HGAP pipeline. This figure illustrates the importance of every step in the HGAP pipeline. The largest correction takes place from raw reads to corrected reads. In this example all insertions and miscalled bases have been removed but some deletions remain in the corrected reads. The assembly process further repairs the data but as seen in the highlighted position, errors can still remain. Finally the polishing is able to correct the remaining deletion. This is due to the fact that unlike the assembly where only the longest reads are used the polishing uses all the available coverage.

Long raw reads

Longest raw seed” reads”

Align long reads to seed” reads”

Produce corrected reads

Assemble corrected reads

Produce Genome sequence

Realign all raw reads on genome

Produce Polished Genome sequence

Length threshold


Evaluating PacBio Assembly AccuracyIn our previous Application Note we evaluated hybrid assemblies in which short high quality reads were used to correct PacBio long reads for assembly. It was shown that the hybrid approach resulted in less contigs with larger lengths.

To evaluate HGAP assembly, we made use of the 2.2 Mb genome of Streptococcus agalactiae. Using PacBio reads only and HGAP we assembled the genome de novo in a single contig using two SMRTcells (with two-hour movies). This resulted in 39X coverage in reads above 2890 bases (HGAP threshold used) from a total of 161X. We also generated MiSeq reads of 1x100 for an additional coverage of 185X.

Figure 4 illustrates the coverage in PacBio reads on an HGAP assembly. When realigning the Illumina reads to a polished PacBio HGAP assembly the vast majority of the bases in the genome are covered. Areas with no coverage by Illumina reads are shown by the marks named “no_cvg_in_MiSeq” which add up to 6 bases on the genome. Using samtools, we observed only one single-base insertion in the

PacBio-HGAP assembly. This method yields >99.99% concordance between PacBio-HGAP and Illumina reads.

An interesting particularity of the PacBio data is that we can visualize log phase growth of the bacteria using the coverage of PacBio reads. The parabolic shape of the coverage is due to the fact that we are sequencing the actual genomic material from the cell, with no amplification step that could introduce biases. The peaks in coverage occur at the start and end of the genome because the reference sequence starts at the origin of replication of the bacteria.

Figure 5 shows the PacBio performance in AT rich regions using the region defined by a red box in Figure 4. The variation in Illumina read coverage correlates with AT content levels and can be attributed to the PCR steps used in the protocol. Using only Illumina would likely result in poor assemblies in these regions or even fragmentation. The PacBio read coverage (in purple) on the other hand is insensitive to AT content and is very uniform.

Figure 4. Concordance with Illumina reads. PacBio reads alignments are shown in purple. Positions where we found no coverage in Illumina reads are marked in black with the track named “no_cvg_in_MiSeq”. The yellow line highlights the shape of the coverage of PacBio reads.

Figure 5. Close up of an AT-rich region. This figure shows a 30Kb span of our HGAP-assembled S. agalactiae corresponding to the red box illustrated in Figure 4. The tracks shown are for greater than 3Kb PacBio read alignments (in purple), Illumina (MiSeq) read alignments (in green), a track showing AT content above 60% across the genome (in orange) and the positions with no coverage in Illumina (in black).

AT

CO

NTE

NT8

1

AT

CO

NTE

NT8

1

Tech

nica

l App

licat

ion

Not

eG

Q-1

6

2799

_GQ

16 (0

5-14

)


Client Management Office: 514 398-7211 [email protected]

Assistant Scientific Director:Alexandre Montpetit, PhD 514 398-3311 [email protected] ext. 00913

Haïg Djambazian, Jessica Wasserscheid, Nikoleta Juretic, Geneviève Geneau, Patrick Willett, Pierre Bérubé, Gary Leveque, Julien Tremblay, Louis Létourneau, Alfredo Staffa, Rob Sladek, Alexandre Montpetit and Ken Dewar

McGill University and Génome Québec Innovation Centre; Montréal, Québec, Canada.

AcknowledgementsWe would like to thank the following researchers for allowing us to use their isolates and data for this Technical Application Note:

Dr. Nahuel Fittipaldi, Public Health OntarioDr. Marcel Behr, McGill International TB Centre, McGill University Health CentreDr. Adrian Tsang, Center for Functional and Structural Genomics, Concordia University (Genozymes Project)Dr. Peter Loewen, University of ManitobaDr. John Nash, Laboratory for Foodborne Zoonoses, Public Health Agency of CanadaDr. Andrew Kropinski, Molecular & Cellular Biology, University of GuelphDr. Janet MacInnes, Dept. of Pathobiology, University of GuelphDr. Sadjia Bekal, Laboratoire de santé publique du Québec

References[1] Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing datahttp://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html [2] Reducing assembly complexity of microbial genomes with single-molecule sequencinghttp://genomebiology.com/2013/14/9/R101/abstract

ConclusionLong PacBio reads are revolutionizing genome assemblies. Now with the HGAP pipeline we have a complete software solution regularly capable of producing closed and finished genomes with very limited or no additional manual work required. The increase in read length and throughput allows easy assembly across even larger repeated elements. In practice, PacBio will assemble most bacterial genomes into a single contig and its plasmids if present. However, for certain bacterial genomes even though the PacBio long reads will still offer great improvements relative to short-read technologies we will still be limited if the size of the repeated

regions is too large [2]. Beyond bacterial assemblies we are now also successfully assembling fungal genome in the tens of megabases into complete chromosomes and have started work with even larger genomes in the hundreds of megabases.

The HGAP analysis pipeline is now used routinely at the Innovation Centre for various PacBio de novo assembly sequencing projects. The pipeline is now offered as a service, we consider it to be an indispensable tool to be included along with PacBio sequencing for de novo assembly.

We have also applied HGAP to assemble genomes from many different organisms. Table 1 shows detailed statistics for the 8 different assemblies. HGAP often produces single contig assemblies for bacterial chromosomes and additional contigs for plasmids if present. However in some cases the repeated elements are still too large and lead to a fragmented assembly. This is illustrated by the Vibrio cholerae assembly, where we expected two chromosomes (1 Mb and 3 Mb) but obtained 22 contigs with the largest nearing

1Mb. The fragmentation here can be explained by the presence of repeats larger than 20Kb-30Kb in size where PacBio reads are not long enough to achieve assembly [2]. We also used HGAP to try to assemble a larger genome, in this case a 28Mb fungal genome. We obtained over 30 contigs (these fungi usually have between 6 and 8 chromosomes) which is a major improvement over previous Roche GS-FLX assemblies. In this case increasing coverage closer to 100X would probably further improve the assembly.

Evaluating PacBio Assembly Accuracy (Cont’d)

gq-16 achieving complete and accurate assemblies … · showed in a previous technical application...

Documents