Download - Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA
Accelerating Error Correction in High-Throughput Short-Read DNA
Sequencing Data with CUDA
Haixiang ShiBertil Schmidt
Weiguo LiuWolfgang Müller-WittigPresenter: Erkan Okuyan
Motivation
• Massive amount of sequencing data (Illumina – 454 - SOLID) (short reads - with high error rate)
• Assembly processes sensitive to errors in reads thus sequencing errors needs to be corrected
• Size of error correction problem is computationally demanding
Definitions- Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L
- Let ri be in {A, C, G, T}L for all 1 ≤ i ≤ k.
- Let m (multiplicity) and l (length) satisfy m>1 and l<L
•Definition1 (Solid and Weak): An l-tuple (a DNA string of length l) is called solid with respect to R and m if it is a substring of at least m reads in R and weak otherwise.
–m-way replicated l-tuple is probably a correct l-tuple •Definition2 (Spectrum): The spectrum of R with respect to m and l, denoted as Tm,l(R), is the set of all solid l-tuples with respect to R and m.
–Spectrum Tm,l(R) is the set of all correct l-tuples
Definitions- Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L
- Let ri be in {A, C, G, T}L for all 1 ≤ i ≤ k.
- Let m (multiplicity) and l (length) satisfy m>1 and l<L
•Definition3 (T-string): A DNA string s is called a Tm,l(R)-string if every l-tuple in s is an element of Tm,l(R).
•Definition4 (SAP): Given a DNA string s and spectrum Tm,l(R). Find a Tm,l(R)-string s* in the set of Tm,l(R)-strings that minimizes the distance function d(s,s*).
CUDA (Compute UnifiedDevice Architecture)
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk,nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk,nTid >>>(args);
•Integrated host+device app program–Serial or modestly parallel parts in host C code–Highly parallel parts in device SPMD kernel C code
CUDA Execution
• A GPU device – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel
• Data-parallel portions of an application are expressed as device kernels which run on many threads
• Differences between GPU and CPU threads – GPU threads are extremely lightweight – Very little creation overhead – GPU needs 1000s of threads for full efficiency
Parallel Error Correction with CUDA
• Each kernel thread is responsible for correction of a single read ri.
• Voting based algorithm– First Step: Calculation of voting matrix
– Second Step: Single-Mutation fixing/trimming/discarding
Step1: Voting Matrix Calculation
Step2: Fixing/Trimming/Discarding Reads
Fast Membership Tests
• First algorithm(kernel) dominates time– (L-l).(l+3.p.l) membership tests required where
p is the number of l-tuples that do not belong in the spectrum.
– Space efficient Bloom filter speeds up membership test of spectrum
• Compute bloom filter on CPU and store it on texture memory (fast read only cache) on device
Bloom Filter
• Probabilistic data structure– No false negatives
– Small percentage of false positives
– Space efficient and fast
• Uses a bit array B of length m and d hash functions – to insert x, we set B[hi(x)] = 1, for i=1,…,d
– to query y, we check if B[hi(y)] all equal 1, for i=1,…,d
Bloom Filter Example
• a and b are inserted to a m=10 n=2 d=3 bloom filter
• Query of c on bloom filter returns false since some bits are 0.
• Query of d on bloom filter returns true since all bits are 1 (False positive).
Overall Algorithm
1) Pre-Computation on the CPU: Program the Bloom filter (counting bloom filter) bit-vector by hashing each l-tuple present on read R.
2) Data transfer from CPU to GPU: Allocate memory/transfer Bloom filter and reads.
3) Execute CUDA kernel.
4) Data transfer from GPU to CPU: Transfer the set of corrected/trimmed reads.
Performance Evaluation
• System Parameters– Nvidia Geforce GTX 280 with 1GB memory– AMD Opteron dual core 2.2Ghz CPU with 2GB
memory
• Datasets– Artificial Sets (1%, 2%, 3% error rates)
• Yeast Chromosomes (S.cer5, S.cer7)• Bacterial Genomes (H.inf, E.col)
– Real Set• Staphylococcus Aureus strain MW2 (H.Aci) (error rate ~1%)
Performance Evaluation
Performance Evaluation
Discussion/Conclusion (GOOD)
• Runtime savings of 10 to 19 times reported.
• Bigger datasets is not an issue as long as Bloom filter fits in texture memory. (More than one round of read-load/read-correct approach)
• Possible to even further parallelize on distributed memory GPU farms.
Discussion/Conclusion (BAD)
• Does not exploit fast shared memory within thread blocks (i.e. each read ri does not really have to be handled by a single thread, voting matrix can be constructed in parallel) thus further speed-up is possible.
• Predetermined read length L is a bit restrictive.
Thank You