illumina primary and secondary analysis · pdf fileprimary data analysis components secondary...

33
© 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Illumina Primary and Secondary Analysis David Townley, Ph.D. Bioinformatics specialist Illumina UK

Upload: tranmien

Post on 16-Mar-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

© 2010 Illumina, Inc. All rights reserved.

Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,

GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

Illumina Primary

and Secondary

Analysis

David Townley, Ph.D.

Bioinformatics specialist

Illumina UK

Page 2: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

2

Page 3: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

3

Primary Analysis

Data Analysis is grouped

into three main categories– Primary Analysis

– Secondary Analysis

– Data Visualization

Illumina Data Analysis Overview

Secondary Analysis

Data VisualizationNote: Primary Analysis prior to v1.6 was called Pipeline

Page 4: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

4

Sample Prep

Instrumentation

Analysis

Illumina Sequencing Workflow

cBot

HiSeq 2000Genome AnalyzerIIx,

Paired-end Module

Primary

Analysis

Secondary

Analysis

Page 5: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

5

Sample Prep

Instrumentation

Analysis

Illumina Sequencing Workflow Outcomes

(DNA Library)

Primary

Secondary

Clusters

Images/TIFF files

Intensities

Alignments

Basecalling

Page 6: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

6

Data Volumes

Data Volume Total Final Comment

HiSeq 2000 200G run

Image Data 32 TB 0

Intensity Data 2 TB 0 Optionally transferred

Base Call / Quality Score Data 250 GB 250 GB1 byte/base (raw) assuming

qseq generation offline

Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files

GAIIx 50G run

Image Data 6.9 0 Optionally transferred

Intensity Data 0.93 0.93

Base Call / Quality Score Data 0.17 0.17

Alignment Output 1.2TB 1.2 TB

Page 7: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

7

Primary Analysis

Page 8: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

8

Primary Analysis

Instrument Control Software

– Provides a graphical interface

while running the instrument

Real Time Analysis (RTA)

– Component within the Instrument

Control Software that monitors the

run‟s progress, optimizes run

conditions and provides run-time

quality statistics

Off-Line Base Caller (OLB)

– Provides the option to perform

data analysis off-line

Primary Data Analysis Components

Secondary Analysis

CASAVA Build

Samplesheet.csv

Instrument Control Software

RTA

Off-Line Basecaller

Firecrest(image analysis)

Bustard(base calling)

CIF/files

Qseq

or .bcl

Note: Primary Analysis prior to v1.6 was called Pipeline

CASAVA

GenomeStudio or third party apps.

Page 9: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

9

Image Analysis Algorithm

Threshold

Maximum

Cluster

Page 10: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

10

Base Calling

Page 11: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

11

Base Calling

C

A

Corrected

Intensity

C G T

Page 12: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

12

Phred Quality Scores

Phred Quality Scores

– A quality score is a prediction of the probability of an error in base calling

– A method for assigning quality scores to sequencing data, using numerical predictors

of base quality

– Typically quoted as a log-odds ratio

Phred Quality Scores are produced by a model that uses quality predictors as

inputs and produces Q-scores as outputs

Phred Quality

Score

Probability of

Incorrect Based Call

Base Call

Accuracy

Q-score

10 1 in 10 90% Q10

20 1 in 100 99% Q20

30 1 in 1000 99.9% Q30

40 1 in 10000 99.99% Q40

Page 13: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

13

Predicting Quality Scores for New Data

Phred output is a table

– predictor 1 value, predictor 2 value, ..., quality score

– predictor 1 value, predictor 2 value, ..., quality score

– ...

To get new quality scores

1. Compute predictors for new base call

2. Compare predictors to each line of table.

3. When you find a line where all of the predictors in the table are bigger than the

predictors for the base, use the corresponding quality value

Table example with 2 predictors:

– 0.1 0.5 Q30

– 0.2 0.2 Q25

– 0.25 0.3 Q20

New base: if predictors are 0.15 0.15, then base is Q25

Page 14: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

14

Quality score are represented as

ASCII characters (to save space)

– One ASCII character per base

To get Phred score:

Sanger quality scores use the

same principle

– Same as a Phred score but the

ASCII score calculation is different

Quality Score Representation

CharacterASCII

Value

Phred

Score

^ 94 30

_ 95 31

„ 96 32

a 97 33

b 98 34

c 99 35

d 100 36

e 101 37

f 102 38

g 103 39

Examples of ASCII

Note: Get more info on ACII tables at http://www.asciitable.com

ASCII value

– 64

Phred Score

ASCII value

– 33

Sanger Score

Page 15: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

15

qseq.txt File

Tab-delimited: easy to parse, easy to import into databases

Split files per read on a read pair / multiple read run

ASCII Character Q-score

PF

(0,1

)Sequence

Instru

ment

Run ID

Lane

Tile

X-c

oord

Y-c

oord

Index #

Read #

Page 16: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

16

HiSeq 2000 Real-time Metrics – Extensive/Interactive

Page 17: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

17

Status.xml

Status.xml

– Data visualization

– Location: \\RunName\Data\Status.xml

– Provides analysis status/progress, updated throughout run, available off-line

Provides access to the most important runtime statistics

– Run Info

– Title Status

– Charts

– Cluster Density

– Data by Cycle

Status.xml

Page 18: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

18

Real Time Metrics: Data by Cycle

Box-plot graphs deciphered:

– Red line – median

– Box – interquartile – middle

50% data

– Error bars – min and max for

the metric

– Outliers – 1.5 below/above

IQR (inner quartile range)

Page 19: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

19

Status.xml: Cluster Density

The cluster density plots

per lane shows the

density as a box plot by

lane

Page 20: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

20

Quality score by cycle

Page 21: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

21

HiSeq 2000 Real-time Metrics – Q Score Distribution

Page 22: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

22

Secondary Analysis - CASAVA

Page 23: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

23

What is CASAVA?

Consensus Assessment of

Sequence And VAriation (CASAVA)

is a Linux application designed to:

– Demultiplex samples

– Align reads

– Call alleles and SNPs

– Find indels

– Count expression level for exons,

genes and splice junctions in case

of RNA-seq runs

CASAVA‟s output is a folder

structure (called a “CASAVA Build”)

ready for import into GenomeStudio

for visualization and further analysis

Primary Analysis

Secondary Analysis

Data Visualization

Multiple *_export.txt

+ other files

CASAVA Build

Counting

For mRNA

Gerald

(Eland v2)script for multiplexed run

SNP and

Indel Calls

Page 24: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

24

Alignment

Page 25: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

25

GERALD

Generation of Recursive Analyses

Linked by Dependency (GERALD)

is the alignment module in

CASAVA

Configuration through GERALD

configuration file

It still works essentially the same

as before but with enhanced

alignment options/algorithms

Primary Analysis

Secondary Analysis

Data Visualization

Multiple *_export.txt

+ other files

CASAVA Build

Counting

For mRNA

Gerald

(Eland v2)script for multiplexed run

SNP and

Indel Calls

SampleSheet.csvqseq or .bcl files

Page 26: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

26

Ungapped vs. Gapped Alignment

Read with Insertion

True Alignment

Extension Alignment

Ungapped or Gapped

ATCGTTAACGTAA******CCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG|||||||||||||||XXXXXXX||||||||

Reference genome

ATCGTTAACGTAAAACGTCCGATAG

ATCGTTAACGTAA*****CCGATAG|||||||||||||||XXXXXX||||||||

ATCGTTAACGTAACCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG

ATCGTTAACGTAAAACGTCCGATAG

ATCGTTAACGTAACCGATAG|||||||||||||||XXXXXXXXXXXXXX

|||||||||||||||XXXXXXXXXXXXXX

Read with Deletion

Gapped (up to 20 bases)Ungapped (no extension alignment)ATCGTTAACGTAA******CCGATAG

ATCGTTAACGTAAGTTAGTCCGATAG

ATCGTTAACGTAAAACGTCCGATAG

ATCGTTAACGTAA*****CCGATAG||||||||||||||| ||||||||

||||||||||||||| |||||||

True Alignment

Page 27: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

27

Singleseed vs. Multiseed Alignment

Extension Alignment

Reference genome

Read 1

No Extension Alignment

ATCGTTAACGTAAAACGTCCGATAG

First 32

base seed

ATCGTTAACGTAAAACGTCCGATAG

First 32

base seed Second seed

||XXXXXXXXXXXXXXXXXXXXXXXXXXX

ATATGCTTTCCCTGACGTCCGATAG

ATCGTTAACGCCTGACGTCCGATAG||XXXXXXXXX||||||||||||||||||

ATATGCTTTCCCTGACGTCCGATAG

ATCGTTAACGCCTGACGTCCGATAG

Seed(s)

(up to four seeds)

Seed Alignment

Extension alignment is the second seed

Singleseed Multiseed

||XXXXXXXXX||||||||||||||||||

ATATGCTTTCCCTGACGTCCGATAG

ATCGTTAACGCCTGACGTCCGATAG

Page 28: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

28

Variants Detection (SNPs + Indels) + Read Counts (RNA-

Seq)

Import GERALD Files

Sort

Call Alleles

Call SNPs

Call Indels

Remove Duplicates

genome_size.xml _export.txt _export.txt _export.txt fasta Files

Sorted Text Files Sort Count Files Indel Files Count FilesSNP Text Files

Paired-End Read Only

RNA Only

Page 29: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

29

CASAVA Build

Note: Pumpkin text denotes folders, blue text represents files

Page 30: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

30

Other tools

Page 31: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

31

Leveraging the GA Informatics CommunityDe Novo Assembly

Velvet – De novo assembly of short reads– Daniel Zerbino and Ewan Birney, EMBL-EBI

– http://www.ebi.ac.uk/~zerbino/velvet/

SSAKE – Assembly of short reads – Group: Rene Warren, et al; British Columbia

– http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500

Euler SR – Genomic Assembly – Group: Pavel Pevzner, Mark Chaisson; UC San Diego

– http://nbcr.sdsc.edu/euler/

SOAPdenovo

– http://soap.genomics.org.cn/

Page 32: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

32

Leveraging the GA Informatics CommunityAlignment and Polymorphism Detection

SOAP – Short Oligonucleotide Alignment Program– Ruiqiang Li, Beijing Genomics Institute

– http://soap.genomics.org.cn/

BWA

– Heng Li, Sanger Institute

– http://bio-bwa.sourceforge.net/

Bowtie

– Ben Langmead, University of Maryland

– http://bowtie-bio.sourceforge.net/index.shtml

Page 33: Illumina Primary and Secondary Analysis · PDF filePrimary Data Analysis Components Secondary Analysis CASAVA Build Samplesheet.csv ... is the alignment module in CASAVA

33

iConnect Program:

Connecting with the larger informatics universe

~30 vendors and academic partners in the program

http://www.illumina.com/pagesnrn.ilmn?ID=229

Third-party tools are available for a broad range of genetic analysis

applications including:

Sequence alignment, SNP calling, indel detection

Sequencing informatics workflow and data management

Whole-genome association

Copy number variation analysis

Gene expression analysis

eQTL analysis

Multi-assay data integration

Biological pathway and network analysis