2nd (next) generation sequencing · next generation sequencing (ngs) allows for producing millions...
TRANSCRIPT
2nd (Next) Generation Sequencing
2/2/2018
Why do we want to sequence a genome?
- To see the sequence (assembly)- To validate an experiment (insert or knockout)- To compare to another genome and find variations (cancer, populations)
The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces.
Sanger sequencing is slow and not high throughput: 13 years for a human genome.
2
Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost.
You can sequence your genome at 30X depth for 1000-1500 USD.
3
Roche 454 Ion Torrent
Illumina HiSeq 2500
4Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)
Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona5
1. Reads come from molecule fragments
2. Read length is the same for an entiredataset (e.g. 101 bases long)
3. Either single or paired-end reads4. Mate reads5. Physical coverage and depth 6. Number of reads7. Duplicates (PCR or sequence)8. Dark matter (PCR cannot find repeats)
6
fragmentpair 1 pair 2
Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014
chr22:11M-12M
7
RepeatMaskerGap
➔ Illumina sequencers can only sequence DNA fragments up to ~300nt long
➔ DNA must be size-selected, usually by gel cut
➔ ~200-300nt band cut, purified, prepared for sequencing
➔ Fragment length follows a normal distribution around target cut size 8
➔ Each sequencing run generates a certain # of total reads
➔ # of reads per sample ~= # total reads/number of samples
➔ # of reads for one sample: library size
➔ Can choose target library size for your instrument based on:
◆ Desired depth
◆ Desired coverage
For more see https://genohub.com/recommended-sequencing-coverage-by-application/
9
Single End
10
Paired End
● Question:“Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?”
● E.g. chr3:2,358,092-2,358,193
● More on this next lecture
11
12
Mapped or
Aligned reads
Genome Locus
Coverage: fraction of genomic
locus covered by at least one read
Depth: number of sequenced bases that map to a
given location
➔ Illumina is now the most common sequencer.➔ It’s error is uniformly distributed (~0.1%) only substitutions (no indels).➔ Older Illumina machines had a fall of quality towards the end of the read.
13
➔ Fragment (insert) size follow a truncated normal distribution
➔ Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that.
➔ Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth!
➔ Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth.
➔ Coverage follows a Poisson distribution.
➔ Read length is a fixed number for Illumina reads. Error is usually higher toward the ends → trimming
14
15
Good coverage Bad coverage
- The machines output files containing short reads in fastq format.- For each read there are 4 lines:
@ read_header commentRead_sequence+ [read_header]Quality_string (in ASCII)
- Scores estimate the probability that a base is called incorrectly. - Q30 means 99.9% accuracy.- Reads are short, we need a “reference sequence” to resolve where they
come from (resequencing).
16
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
start new read
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
unique read header
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
comments separated by space, could be anything
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
Sequence of the read
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
start quality line
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
repeat read header and comment, not required
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
Quality sequence of the read, in ASCII
17
@SRR1997412.1 1 length=125NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT+SRR1997412.1 1 length=125#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<[email protected] 2 length=125NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA+SRR1997412.2 2 length=125
Next read
17
Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341. 18
➔ NCBI (https://www.ncbi.nlm.nih.gov/sra)
➔ Illumina basespace (https://basespace.illumina.com/home/index)
➔ Google genomics cloud (https://console.cloud.google.com/genomics/)
➔ Genome In A Bottle (GIAB) (http://jimb.stanford.edu/giab/)
➔ REPOSITIVE (https://discover.repositive.io/datasets/)
➔ GDC (https://portal.gdc.cancer.gov/)
➔ Seven Bridges (https://igor.sbgenomics.com/)
19
27
28
30
31
ART : WGS simulator
WGSIM: WGS simulator
PBSIM: PacBio simulator
See more on OMIC tools (https://omictools.com/read-simulators-category )
20