errors, biases and quality control in next gen...

26
Errors, biases and Quality control in Next Gen Sequencing Dr David Humphreys [email protected] - Lab scientist : Bioinformatician - RNA biologist - small RNAs (miRNA) Victor Chang Cardiac Research Institute, Sydney, Australia

Upload: others

Post on 18-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Errors, biases and Quality control

in Next Gen Sequencing

Dr David [email protected]

- Lab scientist : Bioinformatician

- RNA biologist

- small RNAs (miRNA)

Victor Chang Cardiac Research Institute, Sydney, Australia

Page 2: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Testing hypothesis and theories

Errors/Biases:

- Present in all experiments

- Be aware/informed

- Minimise

- Test

Da

ta p

oin

ts

HTS/NGS

Time line

1994 20132009 ME!??? 2013 You???

Next generation sequencing:

- Series of experiments

- Biases/error accumulate!

Page 3: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Anscombe’s Quartet

Image source: Wikipedia

• Maths is a tool for analysis.

• You can blindly ignore biases and errors in data sets.

- mean, stdev, variance, correlation are the same!

Anscombe F.J (1973)

American Statistician

Page 4: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Workflow:

High Throughput Sequencing

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

Challenges:

Quantification

Purity

(1) Awareness

Community

Literature

Network;

(2) QC considerations

TimeCost

Gels

Stains

Absorbance

Molarity

Titrations

FluoresenceCPUCores

Scripts

Command line

RAM

Threads

ConsumptionThroughput

Genes

GenomeSNPs

Sensitivity/specificity

Cummulative Error

Page 5: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Quantification: Nanodrop spectrophotometer

http://www.nanodrop.com/Library/CVStech_17_11_FINAL.pdf

WARNING!

• Careful of accuracy < 50ng/ul

• Careful of concentrations > 1ug/ul

• Does not assess quality!!

* http://seqanswers.com/forums/showthread.php?t=21280

Contaminants:

230nm: EDTA, carbohydrates,

sodium acetate*, tris*

270nm: Phenol (plus at 230nm*)

280nm: DTT

WARNING!

• Contaminants can impact on downstream

enzymatic reactions

Ratios

260/280 : 1.8 (DNA) 2.0 (RNA)

260/270 : 1.2 – 1.3?

260/230 : 2.0 – 2.2

• Quick

• Consumes 1-2ul sample

• Large dynamic range

(10 – 10,000ng/ul)

• Can identify contaminations

Solution: Re-precipitate/buffer exchange

Page 6: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Quantification:Qubit fluorimeter

WARNING!

• Known biases in quantifying ssRNA < 50ng/ul

• Cannot quantitate ssDNA in presence of dsDNA

• More sensitive than nano-drop

• Consumes small amount of sample

• Specific assays

Page 7: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Quantification

• Consumes small amount of sample

• Quantification

• Estimating nucleic acid size

Agilent Bioanalyzer

WARNING!

• Each chip has a quantitative range

• Sensitive to salts.

• Limitations on size range

• Not accurate quantitating broad smears

* RNA integrity index (RIN)

- Use at least 50ng for meaningful RIN

Schroeder et al (2006) BMC Mol Bio.

Total RNA * 5-500ng/ul

mRNA 25-250ng/ul

Total RNA * 50-5000pg/ul

mRNA 250-5000pg/ul

dsDNA 5-500 pg/ul

(50-7000bp)

Chip Application Quantitative range

Page 8: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Criteria RNA DNA QC

High complexity Trizol vs column

based

Phenol:chloroform

vs column based

qPCR, Northern

blotting??

High quality RIN > 8 Unfragmented Bioanalyzer, gel

electrophoresis

Accurate

Quantification

pg - ng - ug pg - ng - ug Qubit/Nanodrop,

Agilent Bioanalyser

Contamination

(salts, organics)

A260/280 = 2

A260/230 >2

A260/280 = 1.8

A260/230 >2

Qubit, Nanodrop

Enrichment Deplete ribosomes Exome capture qPCR/Agilent

Fragment Uniform peaks better than broad Agilent

GOAL: to have a final sample with high complexity

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

Sample Purification/Assessment/Processing

1) Library manual as provided by the manufacturer

2) http://nxseq.bitesizebio.com/articles/

Page 9: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

miRNAs:

-141 -29b -21 -106b -15a -34a

decreased in cells grown at low

confluence/loss of adhesion

Library prep

+

Sequence

Purification

biases

Cell number

(L) = 200,000

(H) = 800,000

1mL

Trizol

Kim et al., (2011)Molecular Cell 43, 1005-1014

Cell number

Low = 500,000

High = 800,000

Ra

tio

14

1/2

00

c

Kim et al., (2012)Molecular Cell 46, 893-895

• Small RNA ppt with longer RNA

• Most susceptible:

Low GC content, 2ndary structure

Page 10: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Ligation biases-Enzyme

-Temperature

-Sequence

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

miRNA

library

biases

Hafner et al., (2011)“RNA-ligase-dependent biases in miRNA ….. cDNA libraries”RNA 17(9), 1-16

Input:- 770 synthetic miRNAs

- 45 designed RNAs

Reverse Transcription biasNot a significant source of

sequence specific biases

Pool A = Equimolar

Pool B = 10 fold serial dilution

PCR biasDilute 1:10000

10 PCR cycles

- No appreciable distortion!

5 x

WARNING!

• Don’t compare NGS data sets from different library preps

• Be consistent with incubation times/temperatures

Page 11: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

• Ross et al., Characterizing and measuring bias in sequence data. Genome Biology 2013

• Bragg et al., Shining a light on Dark sequencing characterising errors. PLoS Comp Biol 2013

• Loman et al., Performance comparison of benchtop HTS platforms. Nature Biotech 2012

• Quail et al., Tale of three NGS platforms. BMC Genomics, 2012

• Lam et al., Performance comparison of whole genome sequencing platforms. Nat Biotech 2012

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

Ion torrent

Illumina

Complete genomics

Kapa Biosystems

Standard reagents

Flowcell/lane variations do occur

Smaller than those observed

between platforms

Sequencing platforms

Page 12: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Sample

preparation

Library

preparation

Clonal

amplificationSequencing Bioinformatics

Raw sequencing files

Assessing sequence quality

Align (pipeline)

Assessing alignment data

Page 13: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Raw

sequencing

files

Assessing

sequence

quality

Align

(pipeline)

Assessing

alignment

data

The Basics:

0 10 20 30 40. . . .

! “ # $ % & ‘ ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h

Numerical :

Phred+33 :

Phred+64 :

Quality values: Phred score

File types: fastq, csfasta, qual, fasta, xsq

Sequence: A T C G N/.

Header: Coordinates/other

VCCRI

Page 14: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Raw

sequencing

files

Assessing

sequence

quality

Align

(pipeline)

Assessing

alignment

data

• Free java utility that can assess QC metrics of HTS data sets.

- GUI

- Command line

- Can create html output

• fastq (standard, gzip, colorspace, casava), SAM/BAM

Not all data sets require full complement of green ticks!!

VCCRI

Page 15: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Very good

Reasonable

Poor

Median

90%

10%

75%

25%

Raw sequencing

files

Assessing

sequence

quality

Align

(pipeline)

Assessing

alignment

data

Mean

Page 16: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Identify adaptors

and primers

VCCRIRaw sequencing

files

Assessing

sequence

quality

Align

(pipeline)

Assessing

alignment

data

Identifies if subset

of sequences have

low quality

May identify cycles

that are unreliable

Helps assess raw data files prior to mapping- low quality data may cause incorrect alignments

- low quality data may incorrectly call variations

- Sequence with trailing adaptor sequences will not map

Page 17: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Aligners

Choose a suitable reference.

Include mitochondrial sequence

Design a filter set to capture repeated sequences (rRNA, tRNA)

Reference

Be aware of the default options

- Accepted Errors

- Multimappers

Raw sequencing files

Assessing

sequence

quality

Align

(pipeline)

Assessing alignment

data

Different aligners can give different results.

Benchmarking short sequence mapping tools

Hatem et al (BMC Bioinformatics, 2013)

Page 18: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Assessing alignment data

Raw sequencing

files

Assessing

sequence

quality

Align

(pipeline)

Assessing

alignment

data

Mapping statistics

PassQuestionable

Alignment feature statistics

- Coverage

- Expression

- Discovery

Test

Filter raw data

- Filter

- Trim

Important!

• Know your mapping statistics

• Know what to expect from your data sets

• Test on existing data set

Include a filter

% mapped

% mapped at what length

Page 19: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Take home messages

Be familiar with existing data sets

• NGS is a collection of experiments

• Biases/errors can/will occur at all steps of a high throughput sequencing study

• QC measures should applied at all steps of a high throughput sequencing study

• Don’t be alarmed, stay informed

Page 20: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

miRNA sequencing profilingmiRspring

• Small (<2MB) HTML document that replicates the miRNA aligned sequencing data.

• Needs NO internet connectivity.

• Provides visualization of sequence data

• Reports on miRNA processing

• Complete transparency.

Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013.

http://miRspring.victorchang.edu.au

Page 21: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

microRNAsmiRspring reporting tools

5’ 3’

i

i

i) 5’ isomiRs

ii

ii) 3’ isomiRsii

iii

iii) Non-canonical

iv

iv) Arm bias

v) miRNA length

v v

A � G

C � Tvi

vi) RNA editing

• Small non-coding RNAs (22nt)

• Bind to 3’UTRs � decay and/or translational repression

• Biogenesis: Derived from longer stem loop precursors

Page 22: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

miRspring

miRNA clusters

Mono-cistronic Poly-cistronic

miRNA Seed analysis

miR-196a UAGGUAGUUUCCUGUUGUUGGG

let-7a UGAGGUAGUAGGUUGUAUAGUUU

AGGUAGU

GAGGUAGlet-7a UGAGGUAGUAGGUUGUAUAGUUU

GenomicGenomic

Page 23: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

miRspring QC features

Sampling bias!

Tissue

Atlas

Heart

Kidney

Liver

Lung

Ovary

Spleen

Testes

Thymus

Brain

Placenta

AGO IP

THP-1

ENCODE

HeLa S3

A549

Ag04450

Bj

Gm1287

H1hesc

HepG2

Huvec

K562

MCF7

NheK

Sknshra

• 73 miRspring documents

• 895 million sequence tags

• < 55 megabytes of disk space

Page 24: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

miRspring reporting features

Top 100 miRNAs typically:

- 22nt long

- Good correlation with miRBase

miRspring provide a quick easy way to analyse QC parameters of your data set

Centile RankCentile Rank

Page 25: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

Final points

Victor Chang Cardiac Research Institute, Sydney, Australia

• Many NGS protocols are well established.- Worth understanding what variations/features are found in data sets.

• miRspring a powerful tools to help you assess a data set- Yes only examines one data set at a time.

- Provides complete transparency

- Allows ANYONE to examine a NGS data set.

Example miRspring documents can be found at http://miRspring.victorchang.edu.au

Page 26: Errors, biases and Quality control in Next Gen Sequencingbioinformatics.org.au/ws13/wp-content/uploads/ws13/...Anscombe’sQuartet Image source: Wikipedia • Maths is a tool for analysis

AcknowledgementsVCCRI

Cath Suter

Paul Young

Rupert Shuttleworth

Diane Fatkin

Monique Ohanian

Djordje Djordjevic

Chris Hayward

Kavitha Muthiah

Richard Harvey

Mirana Ramialison

Ashley Waardenberg

IT

Timothy Kersten

Pardeep Dhiman

Thomas Priess (VCCRI/ANU)

- Pardeep Patel

- Carly Hynes

- Tennille Sibbritt

- Jennifer Clancy

Matthias Hentze (EMBL)

Funding bodies

ARC

NHMRC

Viertel Charitable Foundation

Perpetual Trust

VCCRI