ngs data analysis on the grid · barbera van schaik [email protected] ngs bioassist meeting...
TRANSCRIPT
2
Outline
Introduction
Port BWA to the grid
3
Hardware sequencing
Run Reads GBs
a 388,850,958 82
b 518.902.304 108
c 500.529.852 105
4
Initial analysis
First analysis
Match sequence reads to the human genome
Generate a SNP list
Visualize the results
Tools:
BWA, Samtools and IGV
http://bio-bwa.sourceforge.net/http://samtools.sourceforge.net/http://www.broadinstitute.org/igv/
5
Data size
Used data in grid enabled BWA sequencing:
D10: sequencing Data in the csFasta format, 25-35GBD20: quality files in the .qual format, 50-80GBD30: reference DB in the Fasta BS format, 3.2GB (human genome)
140MB (one chromosome)D35: reference BWA index, 4.5GB (human genome)
240MB (one chromosome)D40: sequencing Data in the FastQ format (fastq.gz), 20-30GBD45: results in .sai format (direct output of BWA), 2-3GBD50: results in .sam format, 55-75GBD60: results in .bam format, 20-30GB
6
Small cluster
Existing hardware
PC
7
Buy a bigger cluster (centralized model)
8
Grid computing
Distributed resources
Computing
Data storage
Open protocols
It's about sharing
Resources
Methods
Collaborations
9
Dutch life science grid (hardware)
grid
http://www.biggrid.nl/
10
Software at the AMC
http://www.vl-e.nl/vlemedhttp://www.bioinformaticslaboratory.nl/
Olabarriaga SD, Glatard T, de Boer PT: A Virtual Laboratory for Medical Image Analysis.IEEE Transactions on Information Technology In Biomedicine, in press
User interface
http://www.vl-e.nl/vbrowser/
SARAAMC
11
12
Workflow pre-processing and split data
csFastaQualityvalues
solid2fastq.pl
fastq
Split (multiples of 4 lines)
fastqfastqfastqfastq
Se q
uen
c e r
e ads
Refer en
ce data
base
chr9.fa
Bwa index
chr9.fa.amb chr9.fa.ann
chr9.fa.bwt chr9.fa.pacchr9.fa.rbwt
chr9.fa.rpacchr9.fa.rsa
chr9.fa.sa
Tar zcvf
chr9.tar.gzfastqfastqfastqfastq.gz
gzip
13
Align reads with BWAchr9.tar.gz
fastq.gz
Tar zxvf
chr9.*chr9.*chr9.*chr9.fa.*Bwa aln
result.sai
result.sam
result.bam
result_sorted.bam
Samtools samse (sai to sam)
Samtools view (sam to bam)
Samtools sort
Do this for every
Split fastq file
Do this for everychromosome
14
Merge results and create a SNP listresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bam
Samtools merge
result_sorted_merged.bam
Samtools index
result_sorted_merged.bai
Samtools pileup
SNP-list.pileup
samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup
SNP-list-filtered.pileup pileup-to-bed.pl SNP-list-filtered.pileup
Local server
15
Integrated Genome Viewer
16
SNP list
Column Definition------- -------------------------------------------------------- 1 Chromosome 2 Position (1-based) 3 Reference base at that position 4 Consensus bases 5 Consensus quality 6 SNP quality 7 Maximum mapping quality 8 Coverage (# reads aligning over that position) 9 Bases within reads where (see Galaxy wiki for more info) 10 Quality values (phred33 scale, see Galaxy wiki for more)
chr9 49 * */+C 47 47 36 14 * +C 10 3 1 1 0chr9 152 * */+A 530 530 33 40 * +A 21 19 0 0 0chr9 190 * */-t 1037 1037 36 78 * -t 47 31 0 0 0chr9 274 * */-c 521 521 30 67 * -c 50 17 0 0 0chr9 340 * +A/+A 13 59 35 5 +A * 3 2 0 0 0chr9 362 c Y 39 40 32 8 .,+1t,+1a,+1atgtt :]J5F/LAchr9 469 g S 52 52 36 11 .CCCC....^F.^F. ]]]Y]]]]]][chr9 576 c S 27 27 35 33 .$.,......,...Gg.gg......,g...,,,^F, ][]]]X]RY]]]]]]R]]]]]]]]]]Z]]LNO]chr9 712 a R 59 59 36 24 ,....g.G.G....,..g,,...G ]]W]]]]]]]W]]V]]UH]U]]]]chr9 869 c G 36 36 25 4 GGG, SF]!chr9 1508 c S 34 34 34 24 ,$,,.,,,..,g,GG,,..GG.,.. ]]]+Y[]SHI\X]]]Z]Z]]]W]]chr9 1547 t Y 157 157 33 32 ..CCCCCC...,,..,...cA,cC.CC,cC., BGYWOSTT\O]K/T]M]T]L!DB]]]]!8]]Q
17
Varscan
Chrom
Position
Ref
Var
Reads1
Reads2
VarFreq
Strands1
Strands2
Qual1
Qual2
Pvalue
MapQual1
MapQual2
Reads1Plus
Reads1Minus
Reads2Plus
Reads2Minus
chr10
83042
G A 29 2 6.45%
2 2 48 60 0.98 1 1 17 12 1 1
chr10
83161
G A 36 11 23.40%
2 2 57 58 0.98 1 1 16 20 3 8
chr10
83763
T C 33 4 10.81%
2 2 57 53 0.98 1 1 13 20 2 2
chr10
83816
C T 14 4 22.22%
2 1 52 60 0.98 1 1 7 7 4 0
chr10
84005
G A 22 3 12.00%
2 2 58 60 0.98 1 1 12 10 1 2
18
Port BWA to grid
Simple shell script to run BWA
Create GASW description (component description)
Create workflow description (perlscript or with Taverna)
Copy all binaries, perl-scripts, gasw-description,workflow description to grid
Copy test-dataset to grid
Test workflow
Execute workflow on real datasets
19
BWA on grid – user interface
lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl
20
BWA on grid – component description
21
BWA on grid – component description
22
BWA on grid – workflow description
23
BWA on grid – workflow design in Taverna
24
BWA on grid – user interface
lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl
25
BWA on grid – monitor jobs
26
BWA on grid – monitor jobs
27
Overview BWA workflow
Mark Santcroos
28
Implementation workflows
SNPanalysis
Compare SNPsWith known
SNPs/mutations
Prioritize SNPs
Qualitycontrol
Basic experimentstatistics
Quality ofreads
Check exoncoverage
Quality ofdbSNP
MutationConsequences
(silent, aa change)
SNPconservation
29
Optimization / IT
Optimal inputFile size
Split and merge
Security ofSequence data
Copy large datasetsTo grid storage
Datastorage
Lisa cluster
Cloudcomputing
Other workflowsystems