dealing with `raw reads' - analysis of next-generation sequencing data · 2020-04-14 ·...

51
Dealing with ‘raw reads’ Analysis of Next-Generation Sequencing Data Friederike Dündar Applied Bioinformatics Core Slides at https://bit.ly/2T3sjRg 1 January 28, 2020 1 https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/ F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 1 / 43

Upload: others

Post on 02-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Dealing with ‘raw reads’Analysis of Next-Generation Sequencing Data

Friederike Dündar

Applied Bioinformatics Core

Slides at https://bit.ly/2T3sjRg1

January 28, 2020

1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 1 / 43

Page 2: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

1 Fluorescence-based microscopy

2 Single and paired-end reads

3 Illumina’s “raw reads”

4 Quality control of sequencing reads

5 Sequence Read Archive

6 References

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 2 / 43

Page 3: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Fluorescence-based microscopy

Fluorescence-based microscopy

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 3 / 43

Page 4: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Fluorescence-based microscopy

Re-cap: Sequencing by synthesis after library preparation

The number of sequencing cycles2 determines the read length.

2(1) Incorporate fluor-dNTP, (2) detect, (3) deblock, (4) cleave fluorF. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 4 / 43

Page 5: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Fluorescence-based microscopy

Fluorophores and fluorescence detection

Fluorophores: molecules that re-emit lightupon absorption of light

Fluorescence microscopes separate emittedlight (dim) from excitation light (bright).

See Sanderson et al. [2014] for an overview of fluorescence microscropytechniques (not just DNA-sequencing-related).

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 5 / 43

Page 6: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Single and paired-end reads

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 6 / 43

Page 7: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Types of reads

Single reads are cheaper.(why?)Paired-end (PE) reads arehelpful for:

alignment along repetitiveregions

chromosomalrearrangements and genefusion detection

de novo genome andtranscriptome assembly

precise information aboutthe size of the originalfragment (insert size)

PCR duplicate identification

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 7 / 43

Page 8: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 8 / 43

Page 9: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 9 / 43

Page 10: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 10 / 43

Page 11: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 11 / 43

Page 12: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

Illumina’s “raw reads”

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 12 / 43

Page 13: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

Illumina’s read output: turning images into text files

TIFF BCL file

basecall files (binary textfiles)

during sequencing, basecalls for every location ofthe flowcell are added livefor every cycle

FASTQ files

base calls are gathered perread rather than per cycle

reads are sorted into dif-ferent files per sample asidentified by the barcodes(demultiplexing)

All steps here are performed by Illumina’s proprietory CASAVA software.The file name usually includes some information about the sample:<sample name>_<barcode sequence>_<L(lane)>_<R(read number)>_<setnumber>.fastq.gz, e.g. MyExperiment_AGCTTGTTC_L001_R1_001.fastq.gz

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 13 / 43

Page 14: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

The FASTQ format: FASTA + quality score

1 read = 4 lines

1 @Read ID and sequencing run information2 sequence3 + (additional description possible; usually an emptyline)

4 quality scores

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 14 / 43

Page 15: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

The read ID line is standardized by Casava 1.8

CAUTION

This will only betrue if you receiveFASTQ files freshoff the sequencer.If you downloadFASTQ files frompublic repositories,the read ID mighthave been changedsignificantly.

see https://en.wikipedia.org/wiki/FASTQ_format

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 15 / 43

Page 16: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

The quality scores: summarizing numerical scores intosingle-character representations

Illumina’s CASAVA pipeline:BCL files: Base calls (A/C/T/G) are immediately recorded with an error

probability3.The error probabilites are translated into ASCII symbols in the FASTQ files.

3See the QC section for reasons for base call uncertainties.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 16 / 43

Page 17: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

ASCII symbols

www.ascii-code.com

ASCII encodes 128 specifiedcharacters into seven-bit integers,which is useful for digitalcommunication.

The first 33 characters representunprintable control codes (e.g. “Startof Text"), therefore the Phred scoreswere originally encoded by using anoffset of +33 (Rightarrow “!").

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 17 / 43

Page 18: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

Printable ASCII symbols start at 33

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 18 / 43

Page 19: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

Different offsets have been used by different Casavaversions

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 19 / 43

Page 20: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Illumina’s “raw reads”

Different offsets have been used by different Casavaversions

Both the range of the base call score as well as its translation via theASCII code (offset) are somewhat arbitrary and have undergone numerouschanges.

Today’s standard:

min. score: 0

max. score: 41

ASCII offset: 33

Make sure you know which version you’re dealing with.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 20 / 43

Page 21: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Quality control of sequencing reads

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 21 / 43

Page 22: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Two basic QC questions

1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?

I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no degradationI no bias towards fragments of certain GC contents and/or sizes

2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies

Biases

QC should help identify systematic distortions of data and theirpossible sources.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 22 / 43

Page 23: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

unpublished, but most widely used QC toolsupports all NGS technologiescontinuously developed and maintained by long-time bioinformaticsexpertswill only use the first 200K reads for the diagnosis!

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 23 / 43

Page 24: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Sequencing qualityBased on ASCII-endoced Phred scores within the fastq file.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 24 / 43

Page 25: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Sequencing quality

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 25 / 43

Page 26: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Sequencing quality: reasons for sequencing noise

Noise = fluorophore intensity signal is not as strong and clear asexpected.

laser not well calibratedinterfering signals from neighbouring clusters or bases with similaremission spectraunsynchronized fragments in each cluster:

I phasing: small fraction of fragments in each cluster fails to incorporateany base

I prephasing: more than one base is incorporateddecaying chemicals (runs often last several days to a week!)extraneous objects on the flow cell (e.g. dust, air bubbles)

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 26 / 43

Page 27: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Physically localized error rates: tiles vs. time

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 27 / 43

Page 28: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Physically localized error rates: tiles vs. time

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 28 / 43

Page 29: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 30: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 31: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 32: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 33: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 34: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 35: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 36: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 37: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Page 38: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Detecting contaminations

Per Base Sequence Content

If the fragments representa random and diverserepresentation of the entiregenome, there should be a**uniform distribution** ofall four bases across allcycles.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 30 / 43

Page 39: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Detecting contaminations

Per Base Sequence Content – more examples

irregularities in the first ca. 8 bp are often seen for RNA-seq and ATAC-seq andindicate a bias for certain sequences at the fragment beginningmore severe deviations from uniformity often indicate contaminations and/or lack oflibrary diversity

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 31 / 43

Page 40: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Detecting contaminations

Overrepresented sequences & adapter sequence frequencies

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 32 / 43

Page 41: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Trimming contaminations & low-quality bases

Mostly done to improve alignment.Can be done before alignment or, if contaminations/low-quality basesare low in number, might be left to the “soft-clipping” function4 ofread aligners.There are numerous tools out there to do the job, e.g. Cutadapt[Martin, 2011] and TrimGalore.For de novo assemblies, it is probably more meaningful to performsome error-correction based on overlapping reads rather than trimmingthe reads [Salzberg et al., 2012, Yang et al., 2013]

4ignoring mis-matched bases at the beginning/end of a readF. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 33 / 43

Page 42: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Duplicate read: types

optical duplicates (same DNA cluster erraneouslyreported as separate clusters)natural duplicates (multiple independent originalfragments with very similar sequence)

I more likely to occur for small(ish)genomes/transcriptomes and experiments thatenrich for relatively few and small regions ofthe genome

PCR duplicates (1 original fragment)I often sample-specific and very difficult to

correct in silicoI can be reduced by avoiding excessive PCR

The Problem

There is no way to distinguish natural from PCR duplicates!

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 34 / 43

Page 43: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Duplicate reads: FastQC assessmentProportion of reads (y-axis) that contain sequences in each of the differentduplication level bins (x-axis).

Blue line: all reads (=first 100K!) – how many

times are individualsequences found?

Red line: sequences afterde-deduplication – howmany different sequences

were found to beduplicated?

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 35 / 43

Page 44: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Duplicate reads: FastQC assessment

Check that the red line is flat and that the number of remaining reads afterde-duplication is acceptable.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 36 / 43

Page 45: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

Two basic QC questions

1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?

I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no bias towards fragments of certain GC contents and/or sizesI no degradation

2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 37 / 43

Page 46: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Quality control of sequencing reads

QC summary

Figure from Zhou and Rokas [2014] (highly recommended reading!)F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 38 / 43

Page 47: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Sequence Read Archive

Sequence Read Archive

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 39 / 43

Page 48: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

Sequence Read Archive

Where are all the reads?SRA = main repository for publicly available DNA and RNA sequencing data of which

three instances are maintained world-wide.GEO (https://www.ncbi.nlm.nih.gov/geo/) can be used to find SRA data, too.

See O’Sullivan et al. [2017] for many more details.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 40 / 43

Page 49: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

References

References

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 41 / 43

Page 50: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

References

Marcel Martin. Cutadapt removes adapter sequences from high-throughputsequencing reads. EMBnet.journal, 2011. doi: 10.14806/ej.17.1.200.

Christopher O’Sullivan, Benjamin Busby, and Ilene Karsch Mizrachi. InJonathan M. Keith, editor, Bioinformatics: Volume I: Data, SequenceAnalysis, and Evolution, chapter Managing Sequence Data. HumanaPress, 2017. doi: 10.1007/978-1-4939-6622-6_4.

Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu, TanjaMagoc, Sergey Koren, Todd J. Treangen, Michael C. Schatz, Arthur L.Delcher, Michael Roberts, Guillaume Marcxais, Mihai Pop, and James A.Yorke. GAGE: A critical evaluation of genome assemblies and assemblyalgorithms. Genome Research, 2012. doi: 10.1101/gr.131383.111.

Michael J. Sanderson, Ian Smith, Ian Parker, and Martin D. Bootman.Fluorescence microscopy. Cold Spring Harbor Protocols, 2014. doi:10.1101/pdb.top071795.

Xiao Yang, Sriram P. Chockalingam, and Srinivas Aluru. A survey oferror-correction methods for next-generation sequencing. Briefings inBioinformatics, 2013. doi: 10.1093/bib/bbs015.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 42 / 43

Page 51: Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike

References

Xiaofan Zhou and Antonis Rokas. Prevention, diagnosis and treatment ofhigh-throughput sequencing data pathologies. Molecular Ecology, 23(7):1679–1700, 2014. doi: 10.1111/mec.12680.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 43 / 43