2nd (next) generation sequencing · next generation sequencing (ngs) allows for producing millions...

32
2nd (Next) Generation Sequencing 2/2/2018

Upload: others

Post on 05-Nov-2019

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

2nd (Next) Generation Sequencing

2/2/2018

Page 2: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

Why do we want to sequence a genome?

- To see the sequence (assembly)- To validate an experiment (insert or knockout)- To compare to another genome and find variations (cancer, populations)

The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces.

Sanger sequencing is slow and not high throughput: 13 years for a human genome.

2

Page 3: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost.

You can sequence your genome at 30X depth for 1000-1500 USD.

3

Roche 454 Ion Torrent

Illumina HiSeq 2500

Page 4: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

4Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)

Page 5: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona5

Page 6: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

1. Reads come from molecule fragments

2. Read length is the same for an entiredataset (e.g. 101 bases long)

3. Either single or paired-end reads4. Mate reads5. Physical coverage and depth 6. Number of reads7. Duplicates (PCR or sequence)8. Dark matter (PCR cannot find repeats)

6

fragmentpair 1 pair 2

Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014

Page 7: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

chr22:11M-12M

7

RepeatMaskerGap

Page 8: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

➔ Illumina sequencers can only sequence DNA fragments up to ~300nt long

➔ DNA must be size-selected, usually by gel cut

➔ ~200-300nt band cut, purified, prepared for sequencing

➔ Fragment length follows a normal distribution around target cut size 8

Page 9: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

➔ Each sequencing run generates a certain # of total reads

➔ # of reads per sample ~= # total reads/number of samples

➔ # of reads for one sample: library size

➔ Can choose target library size for your instrument based on:

◆ Desired depth

◆ Desired coverage

For more see https://genohub.com/recommended-sequencing-coverage-by-application/

9

Page 10: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

Single End

10

Paired End

Page 11: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

● Question:“Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?”

● E.g. chr3:2,358,092-2,358,193

● More on this next lecture

11

Page 12: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

12

Mapped or

Aligned reads

Genome Locus

Coverage: fraction of genomic

locus covered by at least one read

Depth: number of sequenced bases that map to a

given location

Page 13: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

➔ Illumina is now the most common sequencer.➔ It’s error is uniformly distributed (~0.1%) only substitutions (no indels).➔ Older Illumina machines had a fall of quality towards the end of the read.

13

Page 14: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

➔ Fragment (insert) size follow a truncated normal distribution

➔ Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that.

➔ Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth!

➔ Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth.

➔ Coverage follows a Poisson distribution.

➔ Read length is a fixed number for Illumina reads. Error is usually higher toward the ends → trimming

14

Page 15: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

15

Good coverage Bad coverage

Page 16: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

- The machines output files containing short reads in fastq format.- For each read there are 4 lines:

@ read_header commentRead_sequence+ [read_header]Quality_string (in ASCII)

- Scores estimate the probability that a base is called incorrectly. - Q30 means 99.9% accuracy.- Reads are short, we need a “reference sequence” to resolve where they

come from (resequencing).

16

Page 17: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

start new read

17

Page 18: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

unique read header

17

Page 19: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

comments separated by space, could be anything

17

Page 20: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

Sequence of the read

17

Page 21: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

start quality line

17

Page 22: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

repeat read header and comment, not required

17

Page 23: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

Quality sequence of the read, in ASCII

17

Page 24: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125

Next read

17

Page 25: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341. 18

Page 26: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

➔ NCBI (https://www.ncbi.nlm.nih.gov/sra)

➔ Illumina basespace (https://basespace.illumina.com/home/index)

➔ Google genomics cloud (https://console.cloud.google.com/genomics/)

➔ Genome In A Bottle (GIAB) (http://jimb.stanford.edu/giab/)

➔ REPOSITIVE (https://discover.repositive.io/datasets/)

➔ GDC (https://portal.gdc.cancer.gov/)

➔ Seven Bridges (https://igor.sbgenomics.com/)

19

Page 27: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

27

Page 28: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

28

Page 29: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your
Page 30: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

30

Page 31: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

31

Page 32: 2nd (Next) Generation Sequencing · Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your

ART : WGS simulator

WGSIM: WGS simulator

PBSIM: PacBio simulator

See more on OMIC tools (https://omictools.com/read-simulators-category )

20