think before you start sequencing before you... · •pcr, tagging, sequencing, barcoding, ......
TRANSCRIPT
Think before you start sequencing
Judith BoerPediatric Oncology, Erasmus MC-Sophia Children’s Hospital, Rotterdam
Center for Human and Clinical Genetics, LUMC, Leiden
Sample & Experiment
• Understand the sample analysed
• organism(s): haploid, diploid, ...
• potential contaminants
• Understand the sequencing experiment
• primers used
• PCR, tagging, sequencing, barcoding, ...
• individual or pooled sample
• insert size of sequencing library
• region to analyse: genome or target region only
• controls included (spike-ins)
• sequencer used: specific errors
• Start immediately
• before data received
• generate & analyse model data: test software
introduction
• Technology choices
• Replicates
• Alignment choices
Overview
Technology choices
• Which sequencing platform?
• Single or paired end run?
• How long?
• How deep?
technology choices
Which sequencing platform?
• Depends on your application!
• short tags more suitable for quantitative analysis (counting)
• de novo assembly: best to combine two platforms
• Length and number of reads
• Number of samples
• Quality of reads: technology-specific errors, GC-bias
• Amount and quality of input material
• Library primer sequences influence yield in PCR
• platform-specific bias
Morozova & Marra Genomics 2008; Matt Hestand; Johan den Dunnen
Many short or fewer long?
Counting: many short
Re-sequencing: many short
Alignment to reference sequence
De novo assembly: fewer long (and many
short)
Platform characteristics 1
PlatformFLX
Titanium
HiSeq
2500SOLiD 4 HeliScope
Ion
Torrent/
Proton
PacBio
Company Roche IlluminaApplied
Biosystems
Helicos
BioSciences
LifeTech
nologies
Pacific
Bioscience
Read length
(bp)400-600
2x125(2x250 in rapid
mode)
50+25 ~301x 400/
1x200average 15,000
Samples/run 16 16 16 50 1 16
Reads/run ~1 M >3500M >700 M ~500 M 5 M/80 M 0.6 M
Run time 10 hrs 6 days 11-13 d 8 days 4 hrs 2.5 days
DeepCAGE/
DeepSAGE* **** **** *** **
whole cDNA
seq
RNA-seq *** **** **** *** * -
miRNA-seq **** **** *** ** -
ChIP-seq ** **** **** ** ** mC
Resequencing ** **** **** ** *** ****
De novo seq **** ** ** * ** ****
Platform characteristics 2
Platform HiSeq 2500 HiSeq X10 NextSeq 500 MiSeq
Company Illumina Illumina Illumina Illumina
Read length
(bp)
2x125
(2x250 in rapid
mode)
2x150 2x150 2x300
Samples/run 16 ? lanes 4 (same pool) 1
Reads/run >3500M 12000 M 330 M 25 M
Run time 6 days 3 days 30 h 60 h
DeepCAGE/
DeepSAGE**** **** *** **
RNA-seq **** **** *** **
miRNA-seq **** **** *** **
ChIP-seq **** **** *** **
Resequencing **** ** ** ***
De novo seq ** ** ** ***
Single or paired end run?
Paired end run gives more sequence for a relatively small
increase in costs:
• More depth (or better quality)
• Include length of product in alignment
• DNA sequencing: better alignment in repeat regions
• RNA-Seq: more information about transcript variants
• But: need software that exploits this
• And: it still costs more...
technology choices
Paired end run
• Re-sequencing of mutation hotspot in cancer gene
• Paired reads increase coverage and improve quality
How long? Read length
• Number of cycles on sequencer (read) plus QC
• For mapping to reference genome and counting: 25-30
• Paired end makes no sense
• Trisomy detection, SAGE-like RNA expression, ChIP-Seq
• Genome or transcriptome assemby: the longer the better
• But less than the library insert size!
• Paired end very informative
How deep? mRNA-Seq
• Transcriptome complexity and dynamic range
• Very low abundant transcripts are expressed at ~0.1
copies per cell, average of 5 tags in 10 million tags
• Tag sequencing: few million tags per sample sufficient
• need more for heterogeneous sample (e.g. tissue)
• Whole transcriptome sequencing: at least 3 times more
• reasonable coverage over the entire transcript
• more difficult sequence alignments (exon-exon borders)
technology choices
How deep? miRNA-Seq
• Few (~2) million tags per sample sufficient
• fewer miRNAs
• big dynamic range
• but not only miRNAs are sequenced
• Multiple samples combined using barcoding
• LGTC pilot experiment using three barcoded samples per
Illumina GA lane gave identical results to three individual
lanes without barcoding
Henk Buermans
How deep? ChromatinIP-Seq
• Depends on the number of binding sites in genome
• and the enrichment factor in the chromatin
immunopreciptation step
• Example transcription factor:
• average fragment size 200 nt
• 50,000 binding sites in 3.2 Gb human genome: 0.3 %
• average enrichment 100 x (good antibody)
• 1/3 of tags will be in bound regions, 2/3 background
• 10 million tags average 60 tags per peak
• tags/peak will go down with decreasing enrichment or
increased number of binding sites
• histone modifications and meDIP: ~30-100 million / sampletechnology choices
How deep? Genomic sequencing
• Depends on ploidy of genome
• Reliable calling of heterozygote SNPs needs more tags
• Deep sequencing is more sensitive: be prepared to get
signals from e.g. somatic variation
• With higher coverage chances increase that you pick up
a signal caused by random error
• Allele calling: ratio is important
• 8 x A, 12 x G = OK
• 75 x A, 25 x G = too far from 50:50
• Minimal coverage (e.g. 20x) and frequency thresholds
given for variant calling algorithms
Li, Ruan & Durbin, Genome Research 2008
How deep? Copy number aberrations
• Trisomy detection in free DNA from maternal plasma
• sequencing depth
• percentage of fetal DNA in maternal plasma
• Fan and Quake, Plos One 2010
• Power calculation based on Poisson distribution after GC
bias correction
• ~10 million aligned reads enables detection of fetal
trisomy 21 at p<0.001 in a sample containing > 3.9 %
fetal DNA
Overview
• Technology choices
• Replicates
• Alignment choices
Number of biological replicates
• Power depends on
• experimental variation – platform, wet lab
• biological variation – organism, cell line
• effect size – differential expression, methylation, binding
• Compared to microarrays: NGS less technical variation,
lower background, larger effect size
• Some rule of thumb examples
• inbred mice, cultured cells: 3-4 per group
• human samples: 20-25 per group for large effect sizes
replicates
Power estimation mRNA profiling
• Platform
comparison study
P.-B. 't Hoen
• 4-5 mice per group
• Power plotted
against sample size
per group
• R package SSPA
van Iterson et al.,
BMC Genomics 2009
Illumina sequencing
Microarrays
Minimize technical variance: randomize!
• Avoid bias: don't order your samples in a logical way !!!
Example for 2 groups with 4 samples eachSimilar when pooling indexed libraries into a single lane
Overview
• Technology choices
• Replicates
• Alignment choices
Reference sequence
• Source ?
• errors, variants, ...
• Is it complete ?
• human genome far from complete
• Map against all ?
• full genome
• including X/Y, mtDNA, unmapped, … ?
• repeat for masked genome
• target region (mark probes / PCR fragments)
compare results!
alignment choices
Mapping software
• Settings
• which can be used ?
• which were used ?
• where do indels go ?
• paired-end / mate-pairs ?
• Effect of settings ?
• mapping non-unique reads: 2 or more positions
• out / to first position / probabilistic
• mapping paired reads
• allowed distance between mapped reads
alignment choices
Non-unique mapping
GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
+
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
true situation
0% variant
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
+
GATTTGGGTAGAGCGATGG
GATTTGGGTAGAGCGATGG
50% variant
( e.g. globin genes: heterozygous SNP in one gene )
alignment choices
Non-unique mapping2
GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG
to dust bin
deletion deletion
alignment choices
Non-unique mapping3
GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGTAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGTAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
map to first position
25% variant deletion
alignment choices
Non-unique mapping4
GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGTAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
probabilistic mapping
25% variant
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGTAGAGCGATGG
25% variant
alignment choices
Non-unique mapping4
GATTTGGGCAGAGCGATGG GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGTAGAGCGATGG
GATTTGGGTAGAGCGATGG
GATTTGGGCAGAGCGATGG
probabilistic mapping
50% variant
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
GATTTGGGCAGAGCGATGG
0% variant
set thresholds to detect variantsaccording to alignment method used
alignment choices
Dust bin analysis
• de novo assembly
• blast contigs
• target sequence
• genome
• repeat database
• all other organisms
• visual inspection
• count seqs
• sort sequences
• mutated primers
• blast
• against all databases
alignment choices
And finally…
Submit your NGS data to a public repository:
• Raw data (.srf or .sff) + processed data, e.g. transcript
summarized and scaled
• Minimal Information About a high-throughput
SEQuencing Experiment (MINSEQE)
• Carefully document your experiment, incl. data analysis!
• Draft MINSEQE proposal: http://www.mged.org/minseqe/
• It is a standard for publication!
www.mged.org/minseqe
Sequencing data repositories
• NCBI’s GEO and Short Read Archive
• mRNA expression profiling
• ChIP-Seq
• bisulfite sequencing
• small RNA discovery and profiling
• SAGE (Web submission available)
• EBI’s ArrayExpress for non-human samples;
European Genotype Archive (EGA) for human
identifiable sequencing data (secure storage)
• MIAMExpress (MAGE-TAB spreadsheet)
www.ebi.ac.uk/miamexpress www.ncbi.nlm.nih.gov/projects/geo/info/seq.html
Summary
• Think ahead of the final data analysis when you plan
the experiment, include QC measures in your design!
• Choose the technology, alignment method and
threshold settings that will help you answer your
research question
• Data in = sequence out
• We certainly don’t have all the answers, but we can
help you ask the right questions ...
With help from ...
Peter-Bram 't Hoen
Alex Hoogkamer
Maarten van Iterson
Henk Buermans
Michiel van Galen
Johan den Dunnen