illumina primary and secondary analysis · pdf fileprimary data analysis components secondary...
TRANSCRIPT
© 2010 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,
GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Illumina Primary
and Secondary
Analysis
David Townley, Ph.D.
Bioinformatics specialist
Illumina UK
2
3
Primary Analysis
Data Analysis is grouped
into three main categories– Primary Analysis
– Secondary Analysis
– Data Visualization
Illumina Data Analysis Overview
Secondary Analysis
Data VisualizationNote: Primary Analysis prior to v1.6 was called Pipeline
4
Sample Prep
Instrumentation
Analysis
Illumina Sequencing Workflow
cBot
HiSeq 2000Genome AnalyzerIIx,
Paired-end Module
Primary
Analysis
Secondary
Analysis
5
Sample Prep
Instrumentation
Analysis
Illumina Sequencing Workflow Outcomes
(DNA Library)
Primary
Secondary
Clusters
Images/TIFF files
Intensities
Alignments
Basecalling
6
Data Volumes
Data Volume Total Final Comment
HiSeq 2000 200G run
Image Data 32 TB 0
Intensity Data 2 TB 0 Optionally transferred
Base Call / Quality Score Data 250 GB 250 GB1 byte/base (raw) assuming
qseq generation offline
Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files
GAIIx 50G run
Image Data 6.9 0 Optionally transferred
Intensity Data 0.93 0.93
Base Call / Quality Score Data 0.17 0.17
Alignment Output 1.2TB 1.2 TB
7
Primary Analysis
8
Primary Analysis
Instrument Control Software
– Provides a graphical interface
while running the instrument
Real Time Analysis (RTA)
– Component within the Instrument
Control Software that monitors the
run‟s progress, optimizes run
conditions and provides run-time
quality statistics
Off-Line Base Caller (OLB)
– Provides the option to perform
data analysis off-line
Primary Data Analysis Components
Secondary Analysis
CASAVA Build
Samplesheet.csv
Instrument Control Software
RTA
Off-Line Basecaller
Firecrest(image analysis)
Bustard(base calling)
CIF/files
Qseq
or .bcl
Note: Primary Analysis prior to v1.6 was called Pipeline
CASAVA
GenomeStudio or third party apps.
9
Image Analysis Algorithm
Threshold
Maximum
Cluster
10
Base Calling
11
Base Calling
C
A
Corrected
Intensity
C G T
12
Phred Quality Scores
Phred Quality Scores
– A quality score is a prediction of the probability of an error in base calling
– A method for assigning quality scores to sequencing data, using numerical predictors
of base quality
– Typically quoted as a log-odds ratio
Phred Quality Scores are produced by a model that uses quality predictors as
inputs and produces Q-scores as outputs
Phred Quality
Score
Probability of
Incorrect Based Call
Base Call
Accuracy
Q-score
10 1 in 10 90% Q10
20 1 in 100 99% Q20
30 1 in 1000 99.9% Q30
40 1 in 10000 99.99% Q40
13
Predicting Quality Scores for New Data
Phred output is a table
– predictor 1 value, predictor 2 value, ..., quality score
– predictor 1 value, predictor 2 value, ..., quality score
– ...
To get new quality scores
1. Compute predictors for new base call
2. Compare predictors to each line of table.
3. When you find a line where all of the predictors in the table are bigger than the
predictors for the base, use the corresponding quality value
Table example with 2 predictors:
– 0.1 0.5 Q30
– 0.2 0.2 Q25
– 0.25 0.3 Q20
New base: if predictors are 0.15 0.15, then base is Q25
14
Quality score are represented as
ASCII characters (to save space)
– One ASCII character per base
To get Phred score:
Sanger quality scores use the
same principle
– Same as a Phred score but the
ASCII score calculation is different
Quality Score Representation
CharacterASCII
Value
Phred
Score
^ 94 30
_ 95 31
„ 96 32
a 97 33
b 98 34
c 99 35
d 100 36
e 101 37
f 102 38
g 103 39
Examples of ASCII
Note: Get more info on ACII tables at http://www.asciitable.com
ASCII value
– 64
Phred Score
ASCII value
– 33
Sanger Score
15
qseq.txt File
Tab-delimited: easy to parse, easy to import into databases
Split files per read on a read pair / multiple read run
ASCII Character Q-score
PF
(0,1
)Sequence
Instru
ment
Run ID
Lane
Tile
X-c
oord
Y-c
oord
Index #
Read #
16
HiSeq 2000 Real-time Metrics – Extensive/Interactive
17
Status.xml
Status.xml
– Data visualization
– Location: \\RunName\Data\Status.xml
– Provides analysis status/progress, updated throughout run, available off-line
Provides access to the most important runtime statistics
– Run Info
– Title Status
– Charts
– Cluster Density
– Data by Cycle
Status.xml
18
Real Time Metrics: Data by Cycle
Box-plot graphs deciphered:
– Red line – median
– Box – interquartile – middle
50% data
– Error bars – min and max for
the metric
– Outliers – 1.5 below/above
IQR (inner quartile range)
19
Status.xml: Cluster Density
The cluster density plots
per lane shows the
density as a box plot by
lane
20
Quality score by cycle
21
HiSeq 2000 Real-time Metrics – Q Score Distribution
22
Secondary Analysis - CASAVA
23
What is CASAVA?
Consensus Assessment of
Sequence And VAriation (CASAVA)
is a Linux application designed to:
– Demultiplex samples
– Align reads
– Call alleles and SNPs
– Find indels
– Count expression level for exons,
genes and splice junctions in case
of RNA-seq runs
CASAVA‟s output is a folder
structure (called a “CASAVA Build”)
ready for import into GenomeStudio
for visualization and further analysis
Primary Analysis
Secondary Analysis
Data Visualization
Multiple *_export.txt
+ other files
CASAVA Build
Counting
For mRNA
Gerald
(Eland v2)script for multiplexed run
SNP and
Indel Calls
24
Alignment
25
GERALD
Generation of Recursive Analyses
Linked by Dependency (GERALD)
is the alignment module in
CASAVA
Configuration through GERALD
configuration file
It still works essentially the same
as before but with enhanced
alignment options/algorithms
Primary Analysis
Secondary Analysis
Data Visualization
Multiple *_export.txt
+ other files
CASAVA Build
Counting
For mRNA
Gerald
(Eland v2)script for multiplexed run
SNP and
Indel Calls
SampleSheet.csvqseq or .bcl files
26
Ungapped vs. Gapped Alignment
Read with Insertion
True Alignment
Extension Alignment
Ungapped or Gapped
ATCGTTAACGTAA******CCGATAG
ATCGTTAACGTAAGTTAGTCCGATAG|||||||||||||||XXXXXXX||||||||
Reference genome
ATCGTTAACGTAAAACGTCCGATAG
ATCGTTAACGTAA*****CCGATAG|||||||||||||||XXXXXX||||||||
ATCGTTAACGTAACCGATAG
ATCGTTAACGTAAGTTAGTCCGATAG
ATCGTTAACGTAAAACGTCCGATAG
ATCGTTAACGTAACCGATAG|||||||||||||||XXXXXXXXXXXXXX
|||||||||||||||XXXXXXXXXXXXXX
Read with Deletion
Gapped (up to 20 bases)Ungapped (no extension alignment)ATCGTTAACGTAA******CCGATAG
ATCGTTAACGTAAGTTAGTCCGATAG
ATCGTTAACGTAAAACGTCCGATAG
ATCGTTAACGTAA*****CCGATAG||||||||||||||| ||||||||
||||||||||||||| |||||||
True Alignment
27
Singleseed vs. Multiseed Alignment
Extension Alignment
Reference genome
Read 1
No Extension Alignment
ATCGTTAACGTAAAACGTCCGATAG
First 32
base seed
ATCGTTAACGTAAAACGTCCGATAG
First 32
base seed Second seed
||XXXXXXXXXXXXXXXXXXXXXXXXXXX
ATATGCTTTCCCTGACGTCCGATAG
ATCGTTAACGCCTGACGTCCGATAG||XXXXXXXXX||||||||||||||||||
ATATGCTTTCCCTGACGTCCGATAG
ATCGTTAACGCCTGACGTCCGATAG
Seed(s)
(up to four seeds)
Seed Alignment
Extension alignment is the second seed
Singleseed Multiseed
||XXXXXXXXX||||||||||||||||||
ATATGCTTTCCCTGACGTCCGATAG
ATCGTTAACGCCTGACGTCCGATAG
28
Variants Detection (SNPs + Indels) + Read Counts (RNA-
Seq)
Import GERALD Files
Sort
Call Alleles
Call SNPs
Call Indels
Remove Duplicates
genome_size.xml _export.txt _export.txt _export.txt fasta Files
Sorted Text Files Sort Count Files Indel Files Count FilesSNP Text Files
…
Paired-End Read Only
RNA Only
29
CASAVA Build
Note: Pumpkin text denotes folders, blue text represents files
30
Other tools
31
Leveraging the GA Informatics CommunityDe Novo Assembly
Velvet – De novo assembly of short reads– Daniel Zerbino and Ewan Birney, EMBL-EBI
– http://www.ebi.ac.uk/~zerbino/velvet/
SSAKE – Assembly of short reads – Group: Rene Warren, et al; British Columbia
– http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500
Euler SR – Genomic Assembly – Group: Pavel Pevzner, Mark Chaisson; UC San Diego
– http://nbcr.sdsc.edu/euler/
SOAPdenovo
– http://soap.genomics.org.cn/
32
Leveraging the GA Informatics CommunityAlignment and Polymorphism Detection
SOAP – Short Oligonucleotide Alignment Program– Ruiqiang Li, Beijing Genomics Institute
– http://soap.genomics.org.cn/
BWA
– Heng Li, Sanger Institute
– http://bio-bwa.sourceforge.net/
Bowtie
– Ben Langmead, University of Maryland
– http://bowtie-bio.sourceforge.net/index.shtml
33
iConnect Program:
Connecting with the larger informatics universe
~30 vendors and academic partners in the program
http://www.illumina.com/pagesnrn.ilmn?ID=229
Third-party tools are available for a broad range of genetic analysis
applications including:
Sequence alignment, SNP calling, indel detection
Sequencing informatics workflow and data management
Whole-genome association
Copy number variation analysis
Gene expression analysis
eQTL analysis
Multi-assay data integration
Biological pathway and network analysis