parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

Parallel computing

in bioinformatics

Dr Torsten Seemann

Ideal world

● A single computer witho one really fast processor

o huge amount of really fast memory

● Compromise #1: a single computer witho lots of processors

o huge memory fast enough for all processors

● Compromise #2: a bunch of computers witho lots of fast processors on each node

o lots of memory on each node

o really fast, low latency inter-node communication

The real world

● None of these exist :-(

● Computer nodeso Good: CPU & RAM on the increase

o Bad: CPU is competing for RAM

● Node:Node communicationo Good: getting faster

o Bad: latency gets worse with more nodes

Types of parallelism

● Clustero distribute workload across networked computers

● SMPo symmetric multiple processing

o use multiple cores on a single computer

o (we’ll ignore NUMA)

● SIMDo single instruction, multiple data

o same machine code instruction on vector of values

o (we’ll ignore MIMD, GPU)

Clusters

Clusters

● Can be ad hoco bunch of PCs over Ethernet (Beowulf)

● Cluster specifico high density, fast interconnect (Blade)

● Highly specialisedo high density, low power, very fast interconnect, low

latency, many switches (eg. IBM BlueGene)

Using clusters

Break task into subtasks:

● Independent tasks○ “pleasantly parallel” is a good situation!

○ Submit these to cluster queue

o Combine results

● Dependent taskso Need to communicate during run

o Various ways to do this (more later)

SMP: symmetric multi processing

Use multiple cores on one node:

● Simple case○ run multiple subtasks, one per core

● Multi-threading○ use tools that support multiple cores

■ BWA, bowtie, samtools 0.18+

○ use languages that support native threading

■ Java

■ C, C++, Perl, Haskell - with standard libraries

■ Python has issues here

Using SMP

● POSIX threadso standard “C” Unix interface

o a library of functions

● OpenMPo standard “C” Unix interface

o functions and #pragmas to help compiler parallelize

● Unix Shello use job control and ‘&’ and ‘wait’

o Makefiles, GNU parallel, pipelines (more later)

● Use tools that do this natively for you

SMP communication

● Sometimes threads needs to talko Just like cluster nodes need to talk

● IPCo Inter-Process Communication

● Methodso files, time-stamped “touch files”

o pipes, sockets, message passing

o shared memory

o semaphores

o signals

Machine code 101

● CPUs run “machine code” instructions:

○ load R0 , [years] # put var in reg

mul R0 , 365 # mult by 365

add R0 , 1 # add 1

store [days], R0 # put reg in mem

● Each instruction does one atomic operation○ to change one piece of data

■ memory location (RAM variable - slow)

■ register (CPU variable - fast)

● Example

○ vector dot product: x ∙ y = Σi=1..|x| xi × yi

● Pseudo-code○ var x, y : integer[8]

var sum : integer

sum := 0;

for i in 0..7:

sum := sum + x[i] * y[i]

● Operations○ 1 + 8 * 3 = 25 ops

Vector operations

Vector operations

● Vector registers and instructions○ assume 8-element operations (actually common!)

● SIMD○ load V0, [x] # put x[] in vec register

load V1, [y] # same for y[]

mult V0, V1 # vector multiply!

vsum R7, V0 # vec sum into scalar reg

● Operations○ 1 + 1 + 1 + 1 = 4 ops

SIMD Instruction Sets

● Specialised since 1970s○ MASPAR

○ Connection Machine

○ Cray super-scalar

○ DEC Alpha MVI

● Consumer grade○ Intel MMX / AMD 3DNow! (integer) [x86]

○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86]

○ IBM Altivec (both) [BlueGene,POWER]

● GPUs also, but they do MIMD too.

Using SIMD

● Not accessible from scripting languageso they are too many layers away from machine code

● Some libraries exploit ito Numpy (uses some SSE in CoreFunc)

o GSL - Gnu Scientific Library

o BLAS - Linear algebra

● Find the tools that use ito HMMER (profile:sequence alignment)

o FASTA 35+, SWIFT (full local/global/semi alignment)

o BWA, Bowtie (short read alignment)

Automatic SIMD vectorization

● Some compilers can recognise patterns that

can be converted into SIMD instructions○ Simple loops

○ Array operations

○ Data copying

● Re-compile your C/C++ code○ GCC (GNU C Compiler)

■ gcc -march=native -O3

○ ICC (Intel C Compiler)■ vectorization is automatic

Using SMP

Spawn multiple jobs

# run 23 alignments, 1 core per chromosome

for CHR in $(seq 1 1 23); do

bwa mem $CHR.fasta reads.fq.gz \

1> $CHR.sam 2> $CHR.err &

done

# wait until all background jobs finish

wait

Use a Makefile

% ls

1.fasta 2.fasta 3.fasta

% vi Makefile

all: 1.sam 2.sam 3.sam

%.sam: %.fasta reads.fq.gz

bwa mem $< reads.fq.gz > $@

% make -j 8 # use 8 cores

% ls

1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam

GNU Parallel

% ls

1.fasta 2.fasta 3.fasta

% parallel -j 8 \

“bwa mem {} reads.fq.gz > {.}.sam” \

::: *.fasta

% ls

1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam

{} replaced by each *.fasta in turn

{.} is {} but with file extension removed

Underused multi-threaded tools

● pigz

○ parallel gzip

○ if you have fast disks, scales to 64 cores easily

○ compression better than decompression○ command line option: --processes=16 or -p 16

● pbzip2

○ parallel bzip2

● sort

○ yes, good ol’ Unix sort!

○ command line option: --parallel=16

Dedicated pipeline system

● Ruffus / Rubra

● BPIPE

● Nesoni

.... and so many more

.......... and so many more still coming!

Implicit Unix SMP

Pipes

● When you pipe two commands together○ two separate processes are started: A and B

○ a “pipe” connects A:stdout to B:stdin (A | B)

● Example○ frequency distribution of initial 4-mers in English

cat /usr/dict/words # already sorted

| cut -c 1-4 # first 4 characters

| tr ‘A-Z’ ‘a-z’ # canonicalize to lc

| uniq -c # count dupes

| sort -n -r # most freq first

| head -10 # top 10

Pipes (result)

428 over

410 inte

300 comp

272 unde

262 cons

261 tran

248 cont

211 disc

197 comm

171 fore

Sub-shells

● Use case:○ software alignerX only accepts .fastq files

○ you have compressed .fastq.gz files

○ your disk is slow and has no space left

● Sub-shells to the rescue!

alignerX ref.fa R1.fq R2.fq

alignerX ref.fa <(zcat R1.fq.gz) \

<(zcat R2.fq.gz)

Nested sub shells

HC SVNT DRACONES

(here be dragons)

Putting it all

together

Making BAMs

● Align FASTQ to referenceo bwa mem ref R1.fq.gz R2.fq.gz > SAM

● Convert to BAMo samtools view SAM > BAM

● Sort BAMo samtools sort BAM > SORTBAM

● Remove dupes

o samtools rmdup SORTBAM > SORTBAM

Making BAMs

Look mum! No intermediate files! Less idle CPUs!

% bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz

| samtools view -@ 16 -S -b -u -T ref.fa -

| samtools sort -@ 16 -m 1G -o -

| samtools rmdup - out.bam

-t 16 16 threads for bwa

-@ 16 16 threads for samtools 0.18+

-m 1G 1 GB RAM per thread for RAM sorting

-u pipe an uncompressed BAM

-o use stdout instead of writing to a file

Conclusions

Conclusions

● The “cluster” level○ we are pretty good at that now

● The “SIMD” level○ too low level, depend on others to exploit

○ thankfully many of our key tools already use it

● The “SMP” level○ our pipelines still have single-threaded bottlenecks

○ always check if your tool has --threads option

○ exploit pipes and sub-shells wherever possible

○ and use GNU Parallel - it’s awesome (and Perl)

parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

Science

fast memory compromise

fast interconnect blade

multiple coresbwa

ram node

cpu ram

vec sum

sum xi

low power