think before you start sequencing before you... · •pcr, tagging, sequencing, barcoding, ......

Think before you start sequencing

Judith BoerPediatric Oncology, Erasmus MC-Sophia Children’s Hospital, Rotterdam

Center for Human and Clinical Genetics, LUMC, Leiden

Sample & Experiment

• Understand the sample analysed

• organism(s): haploid, diploid, ...

• potential contaminants

• Understand the sequencing experiment

• primers used

• PCR, tagging, sequencing, barcoding, ...

• individual or pooled sample

• insert size of sequencing library

• region to analyse: genome or target region only

• controls included (spike-ins)

• sequencer used: specific errors

• Start immediately

• before data received

• generate & analyse model data: test software

introduction

• Technology choices

• Replicates

• Alignment choices

Overview

Technology choices

• Which sequencing platform?

• Single or paired end run?

• How long?

• How deep?

technology choices

Which sequencing platform?

• Depends on your application!

• short tags more suitable for quantitative analysis (counting)

• de novo assembly: best to combine two platforms

• Length and number of reads

• Number of samples

• Quality of reads: technology-specific errors, GC-bias

• Amount and quality of input material

• Library primer sequences influence yield in PCR

• platform-specific bias

Morozova & Marra Genomics 2008; Matt Hestand; Johan den Dunnen

Many short or fewer long?

Counting: many short

Re-sequencing: many short

Alignment to reference sequence

De novo assembly: fewer long (and many

short)

Platform characteristics 1

PlatformFLX

Titanium

HiSeq

2500SOLiD 4 HeliScope

Ion

Torrent/

Proton

PacBio

Company Roche IlluminaApplied

Biosystems

Helicos

BioSciences

LifeTech

nologies

Pacific

Bioscience

Read length

(bp)400-600

2x125(2x250 in rapid

mode)

50+25 ~301x 400/

1x200average 15,000

Samples/run 16 16 16 50 1 16

Reads/run ~1 M >3500M >700 M ~500 M 5 M/80 M 0.6 M

Run time 10 hrs 6 days 11-13 d 8 days 4 hrs 2.5 days

DeepCAGE/

DeepSAGE* **** **** *** **

whole cDNA

seq

RNA-seq *** **** **** *** * -

miRNA-seq **** **** *** ** -

ChIP-seq ** **** **** ** ** mC

Resequencing ** **** **** ** *** ****

De novo seq **** ** ** * ** ****

Platform characteristics 2

Platform HiSeq 2500 HiSeq X10 NextSeq 500 MiSeq

Company Illumina Illumina Illumina Illumina

Read length

(bp)

2x125

(2x250 in rapid

mode)

2x150 2x150 2x300

Samples/run 16 ? lanes 4 (same pool) 1

Reads/run >3500M 12000 M 330 M 25 M

Run time 6 days 3 days 30 h 60 h

DeepCAGE/

DeepSAGE**** **** *** **

RNA-seq **** **** *** **

miRNA-seq **** **** *** **

ChIP-seq **** **** *** **

Resequencing **** ** ** ***

De novo seq ** ** ** ***

Single or paired end run?

Paired end run gives more sequence for a relatively small

increase in costs:

• More depth (or better quality)

• Include length of product in alignment

• DNA sequencing: better alignment in repeat regions

• RNA-Seq: more information about transcript variants

• But: need software that exploits this

• And: it still costs more...

technology choices

Paired end run

• Re-sequencing of mutation hotspot in cancer gene

• Paired reads increase coverage and improve quality

How long? Read length

• Number of cycles on sequencer (read) plus QC

• For mapping to reference genome and counting: 25-30

• Paired end makes no sense

• Trisomy detection, SAGE-like RNA expression, ChIP-Seq

• Genome or transcriptome assemby: the longer the better

• But less than the library insert size!

• Paired end very informative

How deep? mRNA-Seq

• Transcriptome complexity and dynamic range

• Very low abundant transcripts are expressed at ~0.1

copies per cell, average of 5 tags in 10 million tags

• Tag sequencing: few million tags per sample sufficient

• need more for heterogeneous sample (e.g. tissue)

• Whole transcriptome sequencing: at least 3 times more

• reasonable coverage over the entire transcript

• more difficult sequence alignments (exon-exon borders)

technology choices

How deep? miRNA-Seq

• Few (~2) million tags per sample sufficient

• fewer miRNAs

• big dynamic range

• but not only miRNAs are sequenced

• Multiple samples combined using barcoding

• LGTC pilot experiment using three barcoded samples per

Illumina GA lane gave identical results to three individual

lanes without barcoding

Henk Buermans

How deep? ChromatinIP-Seq

• Depends on the number of binding sites in genome

• and the enrichment factor in the chromatin

immunopreciptation step

• Example transcription factor:

• average fragment size 200 nt

• 50,000 binding sites in 3.2 Gb human genome: 0.3 %

• average enrichment 100 x (good antibody)

• 1/3 of tags will be in bound regions, 2/3 background

• 10 million tags average 60 tags per peak

• tags/peak will go down with decreasing enrichment or

increased number of binding sites

• histone modifications and meDIP: ~30-100 million / sampletechnology choices

How deep? Genomic sequencing

• Depends on ploidy of genome

• Reliable calling of heterozygote SNPs needs more tags

• Deep sequencing is more sensitive: be prepared to get

signals from e.g. somatic variation

• With higher coverage chances increase that you pick up

a signal caused by random error

• Allele calling: ratio is important

• 8 x A, 12 x G = OK

• 75 x A, 25 x G = too far from 50:50

• Minimal coverage (e.g. 20x) and frequency thresholds

given for variant calling algorithms

Li, Ruan & Durbin, Genome Research 2008

How deep? Copy number aberrations

• Trisomy detection in free DNA from maternal plasma

• sequencing depth

• percentage of fetal DNA in maternal plasma

• Fan and Quake, Plos One 2010

• Power calculation based on Poisson distribution after GC

bias correction

• ~10 million aligned reads enables detection of fetal

trisomy 21 at p<0.001 in a sample containing > 3.9 %

fetal DNA

Overview


• Replicates


Number of biological replicates

• Power depends on

• experimental variation – platform, wet lab

• biological variation – organism, cell line

• effect size – differential expression, methylation, binding

• Compared to microarrays: NGS less technical variation,

lower background, larger effect size

• Some rule of thumb examples

• inbred mice, cultured cells: 3-4 per group

• human samples: 20-25 per group for large effect sizes

replicates

Power estimation mRNA profiling

• Platform

comparison study

P.-B. 't Hoen

• 4-5 mice per group

• Power plotted

against sample size

per group

• R package SSPA

van Iterson et al.,

BMC Genomics 2009

Illumina sequencing

Microarrays

Minimize technical variance: randomize!

• Avoid bias: don't order your samples in a logical way !!!

Example for 2 groups with 4 samples eachSimilar when pooling indexed libraries into a single lane

Overview


• Replicates


Reference sequence

• Source ?

• errors, variants, ...

• Is it complete ?

• human genome far from complete

• Map against all ?

• full genome

• including X/Y, mtDNA, unmapped, … ?

• repeat for masked genome

• target region (mark probes / PCR fragments)

compare results!

alignment choices

Mapping software

• Settings

• which can be used ?

• which were used ?

• where do indels go ?

• paired-end / mate-pairs ?

• Effect of settings ?

• mapping non-unique reads: 2 or more positions

• out / to first position / probabilistic

• mapping paired reads

• allowed distance between mapped reads

alignment choices

Non-unique mapping

GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

+

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

true situation

0% variant

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

+

GATTTGGGTAGAGCGATGG

GATTTGGGTAGAGCGATGG

50% variant

( e.g. globin genes: heterozygous SNP in one gene )

alignment choices

Non-unique mapping2


to dust bin

deletion deletion

alignment choices

Non-unique mapping3


GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGTAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGTAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

map to first position

25% variant deletion

alignment choices

Non-unique mapping4


GATTTGGGCAGAGCGATGG

GATTTGGGTAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

probabilistic mapping

25% variant

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGTAGAGCGATGG

25% variant

alignment choices

Non-unique mapping4


GATTTGGGCAGAGCGATGG

GATTTGGGTAGAGCGATGG

GATTTGGGTAGAGCGATGG

GATTTGGGCAGAGCGATGG

probabilistic mapping

50% variant

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

GATTTGGGCAGAGCGATGG

0% variant

set thresholds to detect variantsaccording to alignment method used

alignment choices

Dust bin analysis

• de novo assembly

• blast contigs

• target sequence

• genome

• repeat database

• all other organisms

• visual inspection

• count seqs

• sort sequences

• mutated primers

• blast

• against all databases

alignment choices

And finally…

Submit your NGS data to a public repository:

• Raw data (.srf or .sff) + processed data, e.g. transcript

summarized and scaled

• Minimal Information About a high-throughput

SEQuencing Experiment (MINSEQE)

• Carefully document your experiment, incl. data analysis!

• Draft MINSEQE proposal: http://www.mged.org/minseqe/

• It is a standard for publication!

www.mged.org/minseqe

Sequencing data repositories

• NCBI’s GEO and Short Read Archive

• mRNA expression profiling

• ChIP-Seq

• bisulfite sequencing

• small RNA discovery and profiling

• SAGE (Web submission available)

• EBI’s ArrayExpress for non-human samples;

European Genotype Archive (EGA) for human

identifiable sequencing data (secure storage)

• MIAMExpress (MAGE-TAB spreadsheet)

www.ebi.ac.uk/miamexpress www.ncbi.nlm.nih.gov/projects/geo/info/seq.html

Summary

• Think ahead of the final data analysis when you plan

the experiment, include QC measures in your design!

• Choose the technology, alignment method and

threshold settings that will help you answer your

research question

• Data in = sequence out

• We certainly don’t have all the answers, but we can

help you ask the right questions ...

With help from ...

Peter-Bram 't Hoen

Alex Hoogkamer

Maarten van Iterson

Henk Buermans

Michiel van Galen

Johan den Dunnen

[email protected]

http://www.lgtc.nl/home/index.php

http://www.lgtc.nl/home/index.php

think before you start sequencing before you... · •pcr, tagging, sequencing, barcoding, ......

Documents