parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014
TRANSCRIPT
![Page 1: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/1.jpg)
Parallel computing
in bioinformatics
Dr Torsten Seemann
![Page 2: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/2.jpg)
Ideal world
● A single computer witho one really fast processor
o huge amount of really fast memory
● Compromise #1: a single computer witho lots of processors
o huge memory fast enough for all processors
● Compromise #2: a bunch of computers witho lots of fast processors on each node
o lots of memory on each node
o really fast, low latency inter-node communication
![Page 3: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/3.jpg)
The real world
● None of these exist :-(
● Computer nodeso Good: CPU & RAM on the increase
o Bad: CPU is competing for RAM
● Node:Node communicationo Good: getting faster
o Bad: latency gets worse with more nodes
![Page 4: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/4.jpg)
Types of parallelism
● Clustero distribute workload across networked computers
● SMPo symmetric multiple processing
o use multiple cores on a single computer
o (we’ll ignore NUMA)
● SIMDo single instruction, multiple data
o same machine code instruction on vector of values
o (we’ll ignore MIMD, GPU)
![Page 5: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/5.jpg)
Clusters
![Page 6: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/6.jpg)
Clusters
● Can be ad hoco bunch of PCs over Ethernet (Beowulf)
● Cluster specifico high density, fast interconnect (Blade)
● Highly specialisedo high density, low power, very fast interconnect, low
latency, many switches (eg. IBM BlueGene)
![Page 7: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/7.jpg)
Using clusters
Break task into subtasks:
● Independent tasks○ “pleasantly parallel” is a good situation!
○ Submit these to cluster queue
o Combine results
● Dependent taskso Need to communicate during run
o Various ways to do this (more later)
![Page 8: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/8.jpg)
SMP
![Page 9: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/9.jpg)
SMP: symmetric multi processing
Use multiple cores on one node:
● Simple case○ run multiple subtasks, one per core
● Multi-threading○ use tools that support multiple cores
■ BWA, bowtie, samtools 0.18+
○ use languages that support native threading
■ Java
■ C, C++, Perl, Haskell - with standard libraries
■ Python has issues here
![Page 10: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/10.jpg)
Using SMP
● POSIX threadso standard “C” Unix interface
o a library of functions
● OpenMPo standard “C” Unix interface
o functions and #pragmas to help compiler parallelize
● Unix Shello use job control and ‘&’ and ‘wait’
o Makefiles, GNU parallel, pipelines (more later)
● Use tools that do this natively for you
![Page 11: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/11.jpg)
SMP communication
● Sometimes threads needs to talko Just like cluster nodes need to talk
● IPCo Inter-Process Communication
● Methodso files, time-stamped “touch files”
o pipes, sockets, message passing
o shared memory
o semaphores
o signals
![Page 12: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/12.jpg)
SIMD
![Page 13: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/13.jpg)
Machine code 101
● CPUs run “machine code” instructions:
○ load R0 , [years] # put var in reg
mul R0 , 365 # mult by 365
add R0 , 1 # add 1
store [days], R0 # put reg in mem
● Each instruction does one atomic operation○ to change one piece of data
■ memory location (RAM variable - slow)
■ register (CPU variable - fast)
![Page 14: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/14.jpg)
● Example
○ vector dot product: x ∙ y = Σi=1..|x| xi × yi
● Pseudo-code○ var x, y : integer[8]
var sum : integer
sum := 0;
for i in 0..7:
sum := sum + x[i] * y[i]
● Operations○ 1 + 8 * 3 = 25 ops
Vector operations
![Page 15: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/15.jpg)
Vector operations
● Vector registers and instructions○ assume 8-element operations (actually common!)
● SIMD○ load V0, [x] # put x[] in vec register
load V1, [y] # same for y[]
mult V0, V1 # vector multiply!
vsum R7, V0 # vec sum into scalar reg
● Operations○ 1 + 1 + 1 + 1 = 4 ops
![Page 16: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/16.jpg)
SIMD Instruction Sets
● Specialised since 1970s○ MASPAR
○ Connection Machine
○ Cray super-scalar
○ DEC Alpha MVI
● Consumer grade○ Intel MMX / AMD 3DNow! (integer) [x86]
○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86]
○ IBM Altivec (both) [BlueGene,POWER]
● GPUs also, but they do MIMD too.
![Page 17: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/17.jpg)
Using SIMD
● Not accessible from scripting languageso they are too many layers away from machine code
● Some libraries exploit ito Numpy (uses some SSE in CoreFunc)
o GSL - Gnu Scientific Library
o BLAS - Linear algebra
● Find the tools that use ito HMMER (profile:sequence alignment)
o FASTA 35+, SWIFT (full local/global/semi alignment)
o BWA, Bowtie (short read alignment)
![Page 18: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/18.jpg)
Automatic SIMD vectorization
● Some compilers can recognise patterns that
can be converted into SIMD instructions○ Simple loops
○ Array operations
○ Data copying
● Re-compile your C/C++ code○ GCC (GNU C Compiler)
■ gcc -march=native -O3
○ ICC (Intel C Compiler)■ vectorization is automatic
![Page 19: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/19.jpg)
Using SMP
![Page 20: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/20.jpg)
Spawn multiple jobs
# run 23 alignments, 1 core per chromosome
for CHR in $(seq 1 1 23); do
bwa mem $CHR.fasta reads.fq.gz \
1> $CHR.sam 2> $CHR.err &
done
# wait until all background jobs finish
wait
![Page 21: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/21.jpg)
Use a Makefile
% ls
1.fasta 2.fasta 3.fasta
% vi Makefile
all: 1.sam 2.sam 3.sam
%.sam: %.fasta reads.fq.gz
bwa mem $< reads.fq.gz > $@
% make -j 8 # use 8 cores
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
![Page 22: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/22.jpg)
GNU Parallel
% ls
1.fasta 2.fasta 3.fasta
% parallel -j 8 \
“bwa mem {} reads.fq.gz > {.}.sam” \
::: *.fasta
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
{} replaced by each *.fasta in turn
{.} is {} but with file extension removed
![Page 23: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/23.jpg)
Underused multi-threaded tools
● pigz
○ parallel gzip
○ if you have fast disks, scales to 64 cores easily
○ compression better than decompression○ command line option: --processes=16 or -p 16
● pbzip2
○ parallel bzip2
● sort
○ yes, good ol’ Unix sort!
○ command line option: --parallel=16
![Page 24: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/24.jpg)
Dedicated pipeline system
● Ruffus / Rubra
● BPIPE
● Nesoni
.... and so many more
.......... and so many more still coming!
![Page 25: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/25.jpg)
Implicit Unix SMP
![Page 26: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/26.jpg)
Pipes
● When you pipe two commands together○ two separate processes are started: A and B
○ a “pipe” connects A:stdout to B:stdin (A | B)
● Example○ frequency distribution of initial 4-mers in English
cat /usr/dict/words # already sorted
| cut -c 1-4 # first 4 characters
| tr ‘A-Z’ ‘a-z’ # canonicalize to lc
| uniq -c # count dupes
| sort -n -r # most freq first
| head -10 # top 10
![Page 27: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/27.jpg)
Pipes (result)
428 over
410 inte
300 comp
272 unde
262 cons
261 tran
248 cont
211 disc
197 comm
171 fore
![Page 28: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/28.jpg)
Sub-shells
● Use case:○ software alignerX only accepts .fastq files
○ you have compressed .fastq.gz files
○ your disk is slow and has no space left
● Sub-shells to the rescue!
alignerX ref.fa R1.fq R2.fq
alignerX ref.fa <(zcat R1.fq.gz) \
<(zcat R2.fq.gz)
![Page 29: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/29.jpg)
Sub shells + Pipes
● Use case:○ software alignerX only accepts .fasta files
○ you have compressed .fastq.gz files
● Sub-shells can be pipes too!
alignerX ref.fa \
<(zcat R1.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "\t" "\n") \
<(zcat R2.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "\t" "\n") \
![Page 30: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/30.jpg)
Nested sub shells
HC SVNT DRACONES
(here be dragons)
![Page 31: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/31.jpg)
Putting it all
together
![Page 32: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/32.jpg)
Making BAMs
● Align FASTQ to referenceo bwa mem ref R1.fq.gz R2.fq.gz > SAM
● Convert to BAMo samtools view SAM > BAM
● Sort BAMo samtools sort BAM > SORTBAM
● Remove dupes
o samtools rmdup SORTBAM > SORTBAM
![Page 33: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/33.jpg)
Making BAMs
Look mum! No intermediate files! Less idle CPUs!
% bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz
| samtools view -@ 16 -S -b -u -T ref.fa -
| samtools sort -@ 16 -m 1G -o -
| samtools rmdup - out.bam
-t 16 16 threads for bwa
-@ 16 16 threads for samtools 0.18+
-m 1G 1 GB RAM per thread for RAM sorting
-u pipe an uncompressed BAM
-o use stdout instead of writing to a file
![Page 34: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/34.jpg)
Conclusions
![Page 35: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/35.jpg)
Conclusions
● The “cluster” level○ we are pretty good at that now
● The “SIMD” level○ too low level, depend on others to exploit
○ thankfully many of our key tools already use it
● The “SMP” level○ our pipelines still have single-threaded bottlenecks
○ always check if your tool has --threads option
○ exploit pipes and sub-shells wherever possible
○ and use GNU Parallel - it’s awesome (and Perl)
![Page 36: Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042802/55a5dd401a28ab83558b4572/html5/thumbnails/36.jpg)