illumina primary and secondary analysis · pdf fileprimary data analysis components secondary...

© 2010 Illumina, Inc. All rights reserved.

Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,

GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

Illumina Primary

and Secondary

Analysis

David Townley, Ph.D.

Bioinformatics specialist

Illumina UK

3

Primary Analysis

Data Analysis is grouped

into three main categories– Primary Analysis

– Secondary Analysis

– Data Visualization

Illumina Data Analysis Overview

Secondary Analysis

Data VisualizationNote: Primary Analysis prior to v1.6 was called Pipeline

4

Sample Prep

Instrumentation

Analysis

Illumina Sequencing Workflow

cBot

HiSeq 2000Genome AnalyzerIIx,

Paired-end Module

Primary

Analysis

Secondary

Analysis

5

Sample Prep

Instrumentation

Analysis

Illumina Sequencing Workflow Outcomes

(DNA Library)

Primary

Secondary

Clusters

Images/TIFF files

Intensities

Alignments

Basecalling

6

Data Volumes

Data Volume Total Final Comment

HiSeq 2000 200G run

Image Data 32 TB 0

Intensity Data 2 TB 0 Optionally transferred

Base Call / Quality Score Data 250 GB 250 GB1 byte/base (raw) assuming

qseq generation offline

Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files

GAIIx 50G run

Image Data 6.9 0 Optionally transferred

Intensity Data 0.93 0.93

Base Call / Quality Score Data 0.17 0.17

Alignment Output 1.2TB 1.2 TB

7

Primary Analysis

8

Primary Analysis

Instrument Control Software

– Provides a graphical interface

while running the instrument

Real Time Analysis (RTA)

– Component within the Instrument

Control Software that monitors the

run‟s progress, optimizes run

conditions and provides run-time

quality statistics

Off-Line Base Caller (OLB)

– Provides the option to perform

data analysis off-line

Primary Data Analysis Components

Secondary Analysis

CASAVA Build

Samplesheet.csv

Instrument Control Software

RTA

Off-Line Basecaller

Firecrest(image analysis)

Bustard(base calling)

CIF/files

Qseq

or .bcl

Note: Primary Analysis prior to v1.6 was called Pipeline

CASAVA

GenomeStudio or third party apps.

9

Image Analysis Algorithm

Threshold

Maximum

Cluster

10

Base Calling

11

Base Calling

C

A

Corrected

Intensity

C G T

12

Phred Quality Scores

Phred Quality Scores

– A quality score is a prediction of the probability of an error in base calling

– A method for assigning quality scores to sequencing data, using numerical predictors

of base quality

– Typically quoted as a log-odds ratio

Phred Quality Scores are produced by a model that uses quality predictors as

inputs and produces Q-scores as outputs

Phred Quality

Score

Probability of

Incorrect Based Call

Base Call

Accuracy

Q-score

10 1 in 10 90% Q10

20 1 in 100 99% Q20

30 1 in 1000 99.9% Q30

40 1 in 10000 99.99% Q40

13

Predicting Quality Scores for New Data

Phred output is a table

– predictor 1 value, predictor 2 value, ..., quality score

– predictor 1 value, predictor 2 value, ..., quality score

– ...

To get new quality scores

1. Compute predictors for new base call

2. Compare predictors to each line of table.

3. When you find a line where all of the predictors in the table are bigger than the

predictors for the base, use the corresponding quality value

Table example with 2 predictors:

– 0.1 0.5 Q30

– 0.2 0.2 Q25

– 0.25 0.3 Q20

New base: if predictors are 0.15 0.15, then base is Q25

14

Quality score are represented as

ASCII characters (to save space)

– One ASCII character per base

To get Phred score:

Sanger quality scores use the

same principle

– Same as a Phred score but the

ASCII score calculation is different

Quality Score Representation

CharacterASCII

Value

Phred

Score

^ 94 30

_ 95 31

„ 96 32

a 97 33

b 98 34

c 99 35

d 100 36

e 101 37

f 102 38

g 103 39

Examples of ASCII

Note: Get more info on ACII tables at http://www.asciitable.com

ASCII value

– 64

Phred Score

ASCII value

– 33

Sanger Score

15

qseq.txt File

Tab-delimited: easy to parse, easy to import into databases

Split files per read on a read pair / multiple read run

ASCII Character Q-score

PF

(0,1

)Sequence

Instru

ment

Run ID

Lane

Tile

X-c

oord

Y-c

oord

Index #

Read #

16

HiSeq 2000 Real-time Metrics – Extensive/Interactive

17

Status.xml

Status.xml

– Data visualization

– Location: \\RunName\Data\Status.xml

– Provides analysis status/progress, updated throughout run, available off-line

Provides access to the most important runtime statistics

– Run Info

– Title Status

– Charts

– Cluster Density

– Data by Cycle

Status.xml

18

Real Time Metrics: Data by Cycle

Box-plot graphs deciphered:

– Red line – median

– Box – interquartile – middle

50% data

– Error bars – min and max for

the metric

– Outliers – 1.5 below/above

IQR (inner quartile range)

19

Status.xml: Cluster Density

The cluster density plots

per lane shows the

density as a box plot by

lane

20

Quality score by cycle

21

HiSeq 2000 Real-time Metrics – Q Score Distribution

22

Secondary Analysis - CASAVA

23

What is CASAVA?

Consensus Assessment of

Sequence And VAriation (CASAVA)

is a Linux application designed to:

– Demultiplex samples

– Align reads

– Call alleles and SNPs

– Find indels

– Count expression level for exons,

genes and splice junctions in case

of RNA-seq runs

CASAVA‟s output is a folder

structure (called a “CASAVA Build”)

ready for import into GenomeStudio

for visualization and further analysis

Primary Analysis

Secondary Analysis

Data Visualization

Multiple *_export.txt

+ other files

CASAVA Build

Counting

For mRNA

Gerald

(Eland v2)script for multiplexed run

SNP and

Indel Calls

24

Alignment

25

GERALD

Generation of Recursive Analyses

Linked by Dependency (GERALD)

is the alignment module in

CASAVA

Configuration through GERALD

configuration file

It still works essentially the same

as before but with enhanced

alignment options/algorithms

Primary Analysis

Secondary Analysis

Data Visualization

Multiple *_export.txt

+ other files

CASAVA Build

Counting

For mRNA

Gerald

(Eland v2)script for multiplexed run

SNP and

Indel Calls

SampleSheet.csvqseq or .bcl files

26

Ungapped vs. Gapped Alignment

Read with Insertion

True Alignment

Extension Alignment

Ungapped or Gapped

ATCGTTAACGTAA******CCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG|||||||||||||||XXXXXXX||||||||

Reference genome

ATCGTTAACGTAAAACGTCCGATAG

ATCGTTAACGTAA*****CCGATAG|||||||||||||||XXXXXX||||||||

ATCGTTAACGTAACCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG


ATCGTTAACGTAACCGATAG|||||||||||||||XXXXXXXXXXXXXX

|||||||||||||||XXXXXXXXXXXXXX

Read with Deletion

Gapped (up to 20 bases)Ungapped (no extension alignment)ATCGTTAACGTAA******CCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG


ATCGTTAACGTAA*****CCGATAG||||||||||||||| ||||||||

||||||||||||||| |||||||

True Alignment

27

Singleseed vs. Multiseed Alignment

Extension Alignment

Reference genome

Read 1

No Extension Alignment


First 32

base seed


First 32

base seed Second seed

||XXXXXXXXXXXXXXXXXXXXXXXXXXX

ATATGCTTTCCCTGACGTCCGATAG

ATCGTTAACGCCTGACGTCCGATAG||XXXXXXXXX||||||||||||||||||


ATCGTTAACGCCTGACGTCCGATAG

Seed(s)

(up to four seeds)

Seed Alignment

Extension alignment is the second seed

Singleseed Multiseed

||XXXXXXXXX||||||||||||||||||


ATCGTTAACGCCTGACGTCCGATAG

28

Variants Detection (SNPs + Indels) + Read Counts (RNA-

Seq)

Import GERALD Files

Sort

Call Alleles

Call SNPs

Call Indels

Remove Duplicates

genome_size.xml _export.txt _export.txt _export.txt fasta Files

Sorted Text Files Sort Count Files Indel Files Count FilesSNP Text Files

…

Paired-End Read Only

RNA Only

29

CASAVA Build

Note: Pumpkin text denotes folders, blue text represents files

30

Other tools

31

Leveraging the GA Informatics CommunityDe Novo Assembly

Velvet – De novo assembly of short reads– Daniel Zerbino and Ewan Birney, EMBL-EBI

– http://www.ebi.ac.uk/~zerbino/velvet/

SSAKE – Assembly of short reads – Group: Rene Warren, et al; British Columbia

– http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500

Euler SR – Genomic Assembly – Group: Pavel Pevzner, Mark Chaisson; UC San Diego

– http://nbcr.sdsc.edu/euler/

SOAPdenovo

– http://soap.genomics.org.cn/

http://www.ebi.ac.uk/~zerbino/velvet/

http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500

http://nbcr.sdsc.edu/euler/

http://nbcr.sdsc.edu/euler/

32

Leveraging the GA Informatics CommunityAlignment and Polymorphism Detection

SOAP – Short Oligonucleotide Alignment Program– Ruiqiang Li, Beijing Genomics Institute

– http://soap.genomics.org.cn/

BWA

– Heng Li, Sanger Institute

– http://bio-bwa.sourceforge.net/

Bowtie

– Ben Langmead, University of Maryland

– http://bowtie-bio.sourceforge.net/index.shtml

http://soap.genomics.org.cn/

http://bio-bwa.sourceforge.net/



http://bowtie-bio.sourceforge.net/index.shtml



33

iConnect Program:

Connecting with the larger informatics universe

~30 vendors and academic partners in the program

http://www.illumina.com/pagesnrn.ilmn?ID=229

Third-party tools are available for a broad range of genetic analysis

applications including:

Sequence alignment, SNP calling, indel detection

Sequencing informatics workflow and data management

Whole-genome association

Copy number variation analysis

Gene expression analysis

eQTL analysis

Multi-assay data integration

Biological pathway and network analysis

http://www.illumina.com/pagesnrn.ilmn?ID=229

http://www.bcplatforms.com/

http://www.genedata.com/

http://www.genologics.com/products/genomics_solution.php

http://www.inforsense.com/index.php?id=77

http://www.ingenuity.com/company/partners.html

http://www.partek.com/html/products/products.html

http://www.rosettabio.com/about/partners.htm

http://www.sapiosciences.com/general/alliances.html

http://www.jmp.com/genomics

http://www.genesifter.net/web/

http://www.well.ox.ac.uk/QuantiSNP/

illumina primary and secondary analysis · pdf fileprimary data analysis components secondary...

Documents