ngs data analysis on the grid · barbera van schaik [email protected] ngs bioassist meeting...

NGS data analysison the grid

Barbera van [email protected]

NGS Bioassist meeting20-08-2010

mailto:[email protected]

2

Outline

Introduction

Port BWA to the grid

3

Hardware sequencing

Run Reads GBs

a 388,850,958 82

b 518.902.304 108

c 500.529.852 105

4

Initial analysis

First analysis

Match sequence reads to the human genome

Generate a SNP list

Visualize the results

Tools:

BWA, Samtools and IGV

http://bio-bwa.sourceforge.net/http://samtools.sourceforge.net/http://www.broadinstitute.org/igv/

http://bio-bwa.sourceforge.net/

http://samtools.sourceforge.net/

http://www.broadinstitute.org/igv/

5

Data size

Used data in grid enabled BWA sequencing:

D10: sequencing Data in the csFasta format, 25-35GBD20: quality files in the .qual format, 50-80GBD30: reference DB in the Fasta BS format, 3.2GB (human genome)

140MB (one chromosome)D35: reference BWA index, 4.5GB (human genome)

240MB (one chromosome)D40: sequencing Data in the FastQ format (fastq.gz), 20-30GBD45: results in .sai format (direct output of BWA), 2-3GBD50: results in .sam format, 55-75GBD60: results in .bam format, 20-30GB

6

Small cluster

Existing hardware

PC

7

Buy a bigger cluster (centralized model)

8

Grid computing

Distributed resources

Computing

Data storage

Open protocols

It's about sharing

Resources

Methods

Collaborations

9

Dutch life science grid (hardware)

grid

http://www.biggrid.nl/

10

Software at the AMC

http://www.vl-e.nl/vlemedhttp://www.bioinformaticslaboratory.nl/

Olabarriaga SD, Glatard T, de Boer PT: A Virtual Laboratory for Medical Image Analysis.IEEE Transactions on Information Technology In Biomedicine, in press

User interface

http://www.vl-e.nl/vbrowser/

SARAAMC

11

12

Workflow pre-processing and split data

csFastaQualityvalues

solid2fastq.pl

fastq

Split (multiples of 4 lines)

fastqfastqfastqfastq

Se q

uen

c e r

e ads

Refer en

ce data

base

chr9.fa

Bwa index

chr9.fa.amb chr9.fa.ann

chr9.fa.bwt chr9.fa.pacchr9.fa.rbwt

chr9.fa.rpacchr9.fa.rsa

chr9.fa.sa

Tar zcvf

chr9.tar.gzfastqfastqfastqfastq.gz

gzip

13

Align reads with BWAchr9.tar.gz

fastq.gz

Tar zxvf

chr9.*chr9.*chr9.*chr9.fa.*Bwa aln

result.sai

result.sam

result.bam

result_sorted.bam

Samtools samse (sai to sam)

Samtools view (sam to bam)

Samtools sort

Do this for every

Split fastq file

Do this for everychromosome

14

Merge results and create a SNP listresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bam

Samtools merge

result_sorted_merged.bam

Samtools index

result_sorted_merged.bai

Samtools pileup

SNP-list.pileup

samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup

SNP-list-filtered.pileup pileup-to-bed.pl SNP-list-filtered.pileup

Local server

15

Integrated Genome Viewer

16

SNP list

Column Definition------- -------------------------------------------------------- 1 Chromosome 2 Position (1-based) 3 Reference base at that position 4 Consensus bases 5 Consensus quality 6 SNP quality 7 Maximum mapping quality 8 Coverage (# reads aligning over that position) 9 Bases within reads where (see Galaxy wiki for more info) 10 Quality values (phred33 scale, see Galaxy wiki for more)

chr9 49 * */+C 47 47 36 14 * +C 10 3 1 1 0chr9 152 * */+A 530 530 33 40 * +A 21 19 0 0 0chr9 190 * */-t 1037 1037 36 78 * -t 47 31 0 0 0chr9 274 * */-c 521 521 30 67 * -c 50 17 0 0 0chr9 340 * +A/+A 13 59 35 5 +A * 3 2 0 0 0chr9 362 c Y 39 40 32 8 .,+1t,+1a,+1atgtt :]J5F/LAchr9 469 g S 52 52 36 11 .CCCC....^F.^F. ]]]Y]]]]]][chr9 576 c S 27 27 35 33 .$.,......,...Gg.gg......,g...,,,^F, ][]]]X]RY]]]]]]R]]]]]]]]]]Z]]LNO]chr9 712 a R 59 59 36 24 ,....g.G.G....,..g,,...G ]]W]]]]]]]W]]V]]UH]U]]]]chr9 869 c G 36 36 25 4 GGG, SF]!chr9 1508 c S 34 34 34 24 ,$,,.,,,..,g,GG,,..GG.,.. ]]]+Y[]SHI\X]]]Z]Z]]]W]]chr9 1547 t Y 157 157 33 32 ..CCCCCC...,,..,...cA,cC.CC,cC., BGYWOSTT\O]K/T]M]T]L!DB]]]]!8]]Q

17

Varscan

Chrom

Position

Ref

Var

Reads1

Reads2

VarFreq

Strands1

Strands2

Qual1

Qual2

Pvalue

MapQual1

MapQual2

Reads1Plus

Reads1Minus

Reads2Plus

Reads2Minus

chr10

83042

G A 29 2 6.45%

2 2 48 60 0.98 1 1 17 12 1 1

chr10

83161

G A 36 11 23.40%

2 2 57 58 0.98 1 1 16 20 3 8

chr10

83763

T C 33 4 10.81%

2 2 57 53 0.98 1 1 13 20 2 2

chr10

83816

C T 14 4 22.22%

2 1 52 60 0.98 1 1 7 7 4 0

chr10

84005

G A 22 3 12.00%

2 2 58 60 0.98 1 1 12 10 1 2

18

Port BWA to grid

Simple shell script to run BWA

Create GASW description (component description)

Create workflow description (perlscript or with Taverna)

Copy all binaries, perl-scripts, gasw-description,workflow description to grid

Copy test-dataset to grid

Test workflow

Execute workflow on real datasets

19

BWA on grid – user interface

lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl

20

BWA on grid – component description

21

BWA on grid – component description

22

BWA on grid – workflow description

23

BWA on grid – workflow design in Taverna

24

BWA on grid – user interface

lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl

25

BWA on grid – monitor jobs

26

BWA on grid – monitor jobs

27

Overview BWA workflow

Mark Santcroos

28

Implementation workflows

SNPanalysis

Compare SNPsWith known

SNPs/mutations

Prioritize SNPs

Qualitycontrol

Basic experimentstatistics

Quality ofreads

Check exoncoverage

Quality ofdbSNP

MutationConsequences

(silent, aa change)

SNPconservation

29

Optimization / IT

Optimal inputFile size

Split and merge

Security ofSequence data

Copy large datasetsTo grid storage

Datastorage

Lisa cluster

Cloudcomputing

Other workflowsystems

30

http://www.bioinformaticslaboratory.nl/

http://www.bioinformaticslaboratory.nl/

ngs data analysis on the grid · barbera van schaik [email protected] ngs bioassist meeting...

Documents