fundamentals and applications of single molecule real-time ... · single-molecule, real-time dna...

36
FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. Fundamentals and Applications of Single Molecule Real-Time SMRT® Sequencing Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Celera is a trademark of Celera Corporation; and HiSeq and MiSeq are trademarks of Illumina, Inc.© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved. CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing March 26, 2014 Dr. Christoph König

Upload: others

Post on 17-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

FIND MEANING IN COMPLEXITY

© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

Fundamentals and Applications of Single Molecule

Real-Time SMRT® Sequencing

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Celera is a trademark of Celera Corporation; and HiSeq and

MiSeq are trademarks of Illumina, Inc.© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.

CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing

March 26, 2014 Dr. Christoph König

Page 2: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

DNA Polymerase ZMW Confinement Phospholinked Nucleotides

Single-Molecule, Real-Time DNA Sequencing (SMRT) Is:

Page 3: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

PacBio® RS II Typical Performance

Page 4: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Read Definitions in RS System & SMRT® Analysis v2.0

SMRTbell™ Template

Polymerase Read

Definition:

• Formerly called “read”

• 1 pass

• With adapters

• 1 molecule, 1 pol. read

Uses:

• QC of instrument run

Subreads

Definition:

• Adapters removed

• 1 pass

• 1 molecule, 1+ subread

Uses:

• Applications such as

assembly and base

modification

Read (of Insert)

Definition:

• The highest quality

single sequence for an

insert

• 1+ passes including

partial passes

• 1 molecule, 1 read

Uses:

• Insert size distribution

Page 5: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Blue Pippin™ System for Size Selection

Size-Selected

Mouse Lemur

20 kb library

20 kb AMPure®

Mouse Lemur

library

- Input gDNA

- Size-selected

Page 6: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Most Uniform Coverage

• Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51

“Pacific Biosciences coverage

levels are the least biased”

Page 7: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Detection of DNA Base Modifications by SMRT

Sequencing

Flusberg et al. (2010) Nature Methods 7: 461-465

Page 8: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Summary Sequence Performance

1. Long sequence reads

– Finish genomes, de novo assemblies

– Full-length cDNA sequencing

– Long-range haplotype phasing

2. High Consensus Accuracy

– >99.999% (QV50)

– Lack of systematic sequencing errors

3. Lack of sequence context bias

– GC content

– Low complexity sequence

4. Base modification detection

– Epigenome characterization

Page 9: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

FIND MEANING IN COMPLEXITY

© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

De Novo Assembly

Page 10: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Advantages of SMRT® Sequencing:

Impact of Long Read Lengths on De Novo Assembly

Koren S. et. al. (2013) Reducing assembly complexity of microbial genomes with single molecule sequencing.

Genome Biology, 14:R101

What can be achieved with infinite coverage given the read length?

PacBio

Page 12: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

SMRT® Sequencing:

Gold Standard for microbial De Novo Assembly

Page 13: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

FIND MEANING IN COMPLEXITY

© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

Page 14: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Progress of PacBio-Only De Novo Assembly

Spinach

1 Gb

Contig N50

531 kb Drosophila

170 Mb

Contig N50

4.5 Mb

Arabidopsis

120 Mb

Contig N50

7.1 Mb Human

(haploid)

3.2 Gb

Contig N50

4.4 Mb

Max=44 Mb

2013 2014

Bacteria

1-10 Mb

Finished

Genomes

Yeast

12 Mb

Resolve most

chromosomes

Page 15: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

PacBio-Only Sequencing of Arabidopsis

Short-read

(Ler 1)*

PacBio reads

(Ler-0) Improvement

Est. Genome

Size (Mb) 110.4 124.6 11.5%

Polished

Contigs 4,662 545 8.5X

N50 Contig

Length (Mb) 0.067 6.36 95X

Max Contig

Length (Mb) 0.46 13.21 29X

Read Blog Entry Download Arabidopsis

• Original Col-0 strain assembly (Sanger + manual finishing)

• ~$70M, several years

• PacBio® data recently used to assemble Ler-0 strain

*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/

Page 16: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

SNP Discovery with PacBio® Assemblies

17

Watch Arabidopsis Genome Recording Other PAG XXII Recordings

509,836

95%/68%

685,104

92%/72%

Ler0 ILMN

PE

27,106

PacBio Ler0

Assembly

PacBio Cvi

Assembly

271,335

Cvi ILMN

PE

55,947

238,637

Called SNPs between Cvi and Col

Mapping of ILMN PE or PacBio Assembly to TAIR 10

Discovery of single nucleotide polymorphism by PacBio assemblies

Mapping of ILMN PE to PacBio Assembly

Ler0 PE – Ler0 Assembly 885 homozygous SNPs

Cvi PE – Cvi Assembly 838 homozygous SNPs

SNP frequency 7.5 x 106

These SNPs are highly enriched in peri-

centromere and associate with aberrantly

high coverage number

Page 17: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

SNP Discovery with PacBio® Assemblies

18

Watch Arabidopsis Genome Recording Other PAG XXII Recordings

PacBio assembly identifies SNPs in Illumina low-

coverage (unmappable) regions

Called SNPs between Cvi and Col

Both

Illumina only

PacBio only

Analysis by Jason Chin

Page 20: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Long-Read Shotgun Human Genome Data Release

Read Blog Post

• 54x coverage of CHMT1 cell line

• Avg SMRT® Cell throughput: 608 Mb

• Avg DNA insert length: 7,680 bp

• Half of sequenced bases in reads

greater than: 10,739 bp

• Longest DNA insert sequenced:

42,774 bp

Download Dataset

Page 21: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

107 7,4 5,5 24 127 144

4378

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2007 2009 2010 2010 2013 2013 2014

Contig N50 (kb)

Human Genome De Novo Assemblies Comparison

2007 2009 2010 2010 2013 2013 2014

HuRef (Venter) BGI YH KB1 NA12878 RP11_0.7 CHM1 CHM1

Technology ABI 3730 Illumina GA 454 GS FLX

Titanium

Illumina GA 454 GS,

HiSeq, MiSeq

HiSeq,

BAC clones

PacBio RS II

Assembly method Celera

Assembler

SOAP

de novo

Newbler ALLPATHS-LG Newbler Reference

Guided

FALCON,

Celera

Assembler

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/

20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/

early/2010/12/20/1017351108.abstract Table3); CHM1 (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)

# of library types 4 5 2 5 3 NA 1

Total assembly size

(Gb) 2.78 2.46 2.79 2.82 2.81 2.83 3.25

Page 22: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Comparison of Human CHM1 Assemblies

2014 PacBio® de novo

2013 reference-guided

short-read with BACs

gaps

MHC region

44 MB

contig

Page 23: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

The Next Challenge: Assembling Diploid Genomes

Developing

bioinformatics and

visualization tools to

resolve diploid

genomes

Early

assembly

result for the

Ler-0 + Col-0

“synthetic” diploid Watch Jason Chin’s 2014 AGBT

presentation “String Graph Assembly for

Diploid Genomes with Long Reads”

Page 24: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Benefits of PacBio® Sequencing for Large Genomes

• PacBio data complements short reads to improve new and existing

de novo assemblies

• Improve N50 contig length even with modest 5x coverage

• Scaffold PacBio long reads to set framework for genome completion

• Resolve troublesome gaps with low-complexity and repetitive

genomic regions

• Catalog transposable elements

• Conduct gene-specific surveys

PacBio® De Novo Assembly Homepage

Page 25: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

FIND MEANING IN COMPLEXITY

© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.

PacBio® Isoform Sequencing of Full-length Transcripts

Page 26: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Transcript Diversity

Page 27: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Current State of Transcript Assembly

“The way we do RNA-seq now is…

you take the transcriptome, you

blow it up into pieces and then

you try to figure out how they all

go back together again… If you

think about it, it’s kind of a crazy

way to do things”

Michael Snyder

Professor and Chair of Genetics

Stanford University

Tal Nawy, End to end RNA Sequencing, Nature

Methods, v10, n10, Dec . 2013, p1144–1145

Ian Korf (2013) Genomics: the state of the art in

RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6.

doi: 10.1038/nmeth.2735.

Page 28: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

SampleNet: Iso-Seq Method with Clonetech cDNA Synthesis Kit

PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts

PolyA mRNA

AAAAA

AAAAA

AAAAA

AAAAA

cDNA synthesis

with adapters

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

Size partitioning &

PCR amplification

SMRTbell™

ligation

PacBio® RS II

Sequencing

Experimental Pipeline

Informatics Pipeline

Remove adapters

Remove artifacts

Clean

sequence

reads

Reads

clustering

Isoform

clusters

Consensus

calling

Nonredundant

transcript

isoforms

Quality

filtering

Final isoforms

PacBio raw

sequence

reads

Raw 5’ primer 3’ primer

Map to

reference genome

Experimental pipeline Informatics pipeline

PacBio raw

sequence reads

Figure 1

a b

AAAA

AAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

Size partitioning &

PCR amplification

cDNA synthesis

with adapters

SMRTbell ligation

RS sequencing

Remove adapters

Remove artifacts

Reads clustering

Quality filtering

Clean

sequence reads

Nonredundant

transcript isoforms

Final isoforms

TTTT

TTTT

Consensus calling

Isoform clusters

Map to reference genome

Evidence-based gene models

polyA mRNA

AAAA

AAAA

TTTT

TTTT

AAAATTTT

AAAATTTT

AAAATTTT

AAAATTTT

Evidenced-based

gene models

(AAA)n

(TTT)n

SMRT adapter

1 2 3 4 5

6 7 8 9 10

(TTT)n

(AAA)n

Coding sequence polyA

tail

SMRT adapter

DevNet: Iso-Seq wiki page

(AAA)n Reads of Insert (AAA)n

Page 30: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

“Gene Identification, Even in Well-Characterized Human

Cell Lines and Tissues, is Likely Far From Complete”

Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS doi:

10.1038/pnas.1320101110.

8,048 RefSeq-annotated, full-length isoforms and 5,459

predicted isoforms

“Over one-third of these are novel isoforms, including 273

RNAs from gene loci that have not previously been identified”

Page 31: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

ABRF NGS RNA-Seq Comparative Study:

Iso-Seq™ Application provides Most Uniform 5’ to 3’ Coverage

Page 33: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Confidence

Without

PacBio

reads

Including

PacBio

reads

Additional ~5000

gene models

validated

PacBio® Sequences Used for

Gene Model Validation in Lettuce

PAG 2014, Marilena Christopouku “Targeted transcriptome analysis using PacBio sequencing to dissect multi-gene

families encoding NBS-LBR resistance proteins in lettuce”

Page 34: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

PacBio® Iso-Seq Data Used to Confirm Predicted

Scaffolds in Norway Spruce Genome

39

PAG 2014: Yao-Cheng Lin “PacBio cDNA sequencing of Norway spruce”

14 SMRT® Cells

of PacBio data

using early

chemistry &

protocols

Page 36: Fundamentals and Applications of Single Molecule Real-Time ... · Single-Molecule, Real-Time DNA Sequencing (SMRT) Is: ... Hierarchical Genome Assembly Process (HGAP) Chin CS., et

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific

Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.