dealing with `raw reads' - analysis of next-generation sequencing data · 2020-04-14 ·...

Post on 02-Aug-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dealing with ‘raw reads’Analysis of Next-Generation Sequencing Data

Friederike Dündar

Applied Bioinformatics Core

Slides at https://bit.ly/2T3sjRg1

January 28, 2020

1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 1 / 43

1 Fluorescence-based microscopy

2 Single and paired-end reads

3 Illumina’s “raw reads”

4 Quality control of sequencing reads

5 Sequence Read Archive

6 References

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 2 / 43

Fluorescence-based microscopy

Fluorescence-based microscopy

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 3 / 43

Fluorescence-based microscopy

Re-cap: Sequencing by synthesis after library preparation

The number of sequencing cycles2 determines the read length.

2(1) Incorporate fluor-dNTP, (2) detect, (3) deblock, (4) cleave fluorF. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 4 / 43

Fluorescence-based microscopy

Fluorophores and fluorescence detection

Fluorophores: molecules that re-emit lightupon absorption of light

Fluorescence microscopes separate emittedlight (dim) from excitation light (bright).

See Sanderson et al. [2014] for an overview of fluorescence microscropytechniques (not just DNA-sequencing-related).

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 5 / 43

Single and paired-end reads

Single and paired-end reads

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 6 / 43

Single and paired-end reads

Types of reads

Single reads are cheaper.(why?)Paired-end (PE) reads arehelpful for:

alignment along repetitiveregions

chromosomalrearrangements and genefusion detection

de novo genome andtranscriptome assembly

precise information aboutthe size of the originalfragment (insert size)

PCR duplicate identification

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 7 / 43

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 8 / 43

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 9 / 43

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 10 / 43

Single and paired-end reads

Paired-end read generation

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 11 / 43

Illumina’s “raw reads”

Illumina’s “raw reads”

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 12 / 43

Illumina’s “raw reads”

Illumina’s read output: turning images into text files

TIFF BCL file

basecall files (binary textfiles)

during sequencing, basecalls for every location ofthe flowcell are added livefor every cycle

FASTQ files

base calls are gathered perread rather than per cycle

reads are sorted into dif-ferent files per sample asidentified by the barcodes(demultiplexing)

All steps here are performed by Illumina’s proprietory CASAVA software.The file name usually includes some information about the sample:<sample name>_<barcode sequence>_<L(lane)>_<R(read number)>_<setnumber>.fastq.gz, e.g. MyExperiment_AGCTTGTTC_L001_R1_001.fastq.gz

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 13 / 43

Illumina’s “raw reads”

The FASTQ format: FASTA + quality score

1 read = 4 lines

1 @Read ID and sequencing run information2 sequence3 + (additional description possible; usually an emptyline)

4 quality scores

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 14 / 43

Illumina’s “raw reads”

The read ID line is standardized by Casava 1.8

CAUTION

This will only betrue if you receiveFASTQ files freshoff the sequencer.If you downloadFASTQ files frompublic repositories,the read ID mighthave been changedsignificantly.

see https://en.wikipedia.org/wiki/FASTQ_format

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 15 / 43

Illumina’s “raw reads”

The quality scores: summarizing numerical scores intosingle-character representations

Illumina’s CASAVA pipeline:BCL files: Base calls (A/C/T/G) are immediately recorded with an error

probability3.The error probabilites are translated into ASCII symbols in the FASTQ files.

3See the QC section for reasons for base call uncertainties.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 16 / 43

Illumina’s “raw reads”

ASCII symbols

www.ascii-code.com

ASCII encodes 128 specifiedcharacters into seven-bit integers,which is useful for digitalcommunication.

The first 33 characters representunprintable control codes (e.g. “Startof Text"), therefore the Phred scoreswere originally encoded by using anoffset of +33 (Rightarrow “!").

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 17 / 43

Illumina’s “raw reads”

Printable ASCII symbols start at 33

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 18 / 43

Illumina’s “raw reads”

Different offsets have been used by different Casavaversions

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 19 / 43

Illumina’s “raw reads”

Different offsets have been used by different Casavaversions

Both the range of the base call score as well as its translation via theASCII code (offset) are somewhat arbitrary and have undergone numerouschanges.

Today’s standard:

min. score: 0

max. score: 41

ASCII offset: 33

Make sure you know which version you’re dealing with.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 20 / 43

Quality control of sequencing reads

Quality control of sequencing reads

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 21 / 43

Quality control of sequencing reads

Two basic QC questions

1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?

I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no degradationI no bias towards fragments of certain GC contents and/or sizes

2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies

Biases

QC should help identify systematic distortions of data and theirpossible sources.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 22 / 43

Quality control of sequencing reads

FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

unpublished, but most widely used QC toolsupports all NGS technologiescontinuously developed and maintained by long-time bioinformaticsexpertswill only use the first 200K reads for the diagnosis!

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 23 / 43

Quality control of sequencing reads

Sequencing qualityBased on ASCII-endoced Phred scores within the fastq file.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 24 / 43

Quality control of sequencing reads

Sequencing quality

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 25 / 43

Quality control of sequencing reads

Sequencing quality: reasons for sequencing noise

Noise = fluorophore intensity signal is not as strong and clear asexpected.

laser not well calibratedinterfering signals from neighbouring clusters or bases with similaremission spectraunsynchronized fragments in each cluster:

I phasing: small fraction of fragments in each cluster fails to incorporateany base

I prephasing: more than one base is incorporateddecaying chemicals (runs often last several days to a week!)extraneous objects on the flow cell (e.g. dust, air bubbles)

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 26 / 43

Quality control of sequencing reads

Physically localized error rates: tiles vs. time

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 27 / 43

Quality control of sequencing reads

Physically localized error rates: tiles vs. time

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 28 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)

Sources:I primer contaminationI adapter contamination

sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert

I DNA from other species/librariesConsequences:

I noiseI reduced alignment rates

Can be identified by examining sequence composition andoverrepresented sequences/k-mers.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43

Quality control of sequencing reads

Detecting contaminations

Per Base Sequence Content

If the fragments representa random and diverserepresentation of the entiregenome, there should be a**uniform distribution** ofall four bases across allcycles.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 30 / 43

Quality control of sequencing reads

Detecting contaminations

Per Base Sequence Content – more examples

irregularities in the first ca. 8 bp are often seen for RNA-seq and ATAC-seq andindicate a bias for certain sequences at the fragment beginningmore severe deviations from uniformity often indicate contaminations and/or lack oflibrary diversity

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 31 / 43

Quality control of sequencing reads

Detecting contaminations

Overrepresented sequences & adapter sequence frequencies

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 32 / 43

Quality control of sequencing reads

Trimming contaminations & low-quality bases

Mostly done to improve alignment.Can be done before alignment or, if contaminations/low-quality basesare low in number, might be left to the “soft-clipping” function4 ofread aligners.There are numerous tools out there to do the job, e.g. Cutadapt[Martin, 2011] and TrimGalore.For de novo assemblies, it is probably more meaningful to performsome error-correction based on overlapping reads rather than trimmingthe reads [Salzberg et al., 2012, Yang et al., 2013]

4ignoring mis-matched bases at the beginning/end of a readF. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 33 / 43

Quality control of sequencing reads

Duplicate read: types

optical duplicates (same DNA cluster erraneouslyreported as separate clusters)natural duplicates (multiple independent originalfragments with very similar sequence)

I more likely to occur for small(ish)genomes/transcriptomes and experiments thatenrich for relatively few and small regions ofthe genome

PCR duplicates (1 original fragment)I often sample-specific and very difficult to

correct in silicoI can be reduced by avoiding excessive PCR

The Problem

There is no way to distinguish natural from PCR duplicates!

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 34 / 43

Quality control of sequencing reads

Duplicate reads: FastQC assessmentProportion of reads (y-axis) that contain sequences in each of the differentduplication level bins (x-axis).

Blue line: all reads (=first 100K!) – how many

times are individualsequences found?

Red line: sequences afterde-deduplication – howmany different sequences

were found to beduplicated?

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 35 / 43

Quality control of sequencing reads

Duplicate reads: FastQC assessment

Check that the red line is flat and that the number of remaining reads afterde-duplication is acceptable.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 36 / 43

Quality control of sequencing reads

Two basic QC questions

1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?

I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no bias towards fragments of certain GC contents and/or sizesI no degradation

2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 37 / 43

Quality control of sequencing reads

QC summary

Figure from Zhou and Rokas [2014] (highly recommended reading!)F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 38 / 43

Sequence Read Archive

Sequence Read Archive

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 39 / 43

Sequence Read Archive

Where are all the reads?SRA = main repository for publicly available DNA and RNA sequencing data of which

three instances are maintained world-wide.GEO (https://www.ncbi.nlm.nih.gov/geo/) can be used to find SRA data, too.

See O’Sullivan et al. [2017] for many more details.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 40 / 43

References

References

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 41 / 43

References

Marcel Martin. Cutadapt removes adapter sequences from high-throughputsequencing reads. EMBnet.journal, 2011. doi: 10.14806/ej.17.1.200.

Christopher O’Sullivan, Benjamin Busby, and Ilene Karsch Mizrachi. InJonathan M. Keith, editor, Bioinformatics: Volume I: Data, SequenceAnalysis, and Evolution, chapter Managing Sequence Data. HumanaPress, 2017. doi: 10.1007/978-1-4939-6622-6_4.

Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu, TanjaMagoc, Sergey Koren, Todd J. Treangen, Michael C. Schatz, Arthur L.Delcher, Michael Roberts, Guillaume Marcxais, Mihai Pop, and James A.Yorke. GAGE: A critical evaluation of genome assemblies and assemblyalgorithms. Genome Research, 2012. doi: 10.1101/gr.131383.111.

Michael J. Sanderson, Ian Smith, Ian Parker, and Martin D. Bootman.Fluorescence microscopy. Cold Spring Harbor Protocols, 2014. doi:10.1101/pdb.top071795.

Xiao Yang, Sriram P. Chockalingam, and Srinivas Aluru. A survey oferror-correction methods for next-generation sequencing. Briefings inBioinformatics, 2013. doi: 10.1093/bib/bbs015.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 42 / 43

References

Xiaofan Zhou and Antonis Rokas. Prevention, diagnosis and treatment ofhigh-throughput sequencing data pathologies. Molecular Ecology, 23(7):1679–1700, 2014. doi: 10.1111/mec.12680.

F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 43 / 43

top related