cuda - based sequence alignment

CUDA - BASED SEQUENCE ALIGNMENT

By: Galal and Sameh

2013

OUTLINE

• Explore the architecture of the GPU.

• Parallel Programming using CUDA

• Sequence Alignment Algorithms

• Needleman – Wunsch Algorithm

• Smith – Waterman Algorithm

• Project Plan

GRAPHICAL PROCESSING UNITS (GPU)• GPU is a many core processor optimized for graphics workloads

• Example: NVIDIA GeForce GTX 280 GPU with 240 core and in order single instruction heavy multithreads• Each 8 cores shares control and instruction cache

• GPUs memory has high bandwidth in comparison with CPU (10 to 1)

• The combination of GPU and CPU, because

CPUs consists of a few cores optimized for

serial processing, while GPU consists of

many cores (maybe thousands) optimized

for parallel processingGPUThousands of cores

CPUMultiple of cores

GPU ARCHITECTURE

Device Architecture.

• Many Streaming

Multiprocessors.

• Each SM has up to 8

Processors.

• Has different types of Memory.

Ref: 3

COMPUTING UNIFIED DEVICE ARCHITECTURE (CUDA)

• CUDA (Compute Unified Device Architecture)

is an extension of C/C++

• Scalable multi-threaded programs for CUDA-

enabled GPUs

• Facilitate heterogeneous computing: CPU + GPU

CUDA PROGRAM

• Kernels:

• Parallel portions of an application are executed on the device (GPU)

• Invoked as a set of concurrently executing threads

• Threads are organized in a hierarchy consisting of thread blocks and grids

• Grid: set of independent thread blocks

• Thread block: set of concurrent threads

• Each thread has a unique ID (threadIdx, blockIdx) {0,..., dimBlock-1} × {0,..., dimGrid-1}∈

CUDA PROGRAMMING MODEL

7

CPU (host)

GPU (Device)

SEQUENCE ALIGNMENT• In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or

protein to identify regions of similarity that may be a consequence of functional, structural, or

evolutionary relationships between the sequences.

• There are two main types of sequence alignment:

• Global : When the two sequence has almost the same size.

• Local: compare two sequences with different sizes.

LOCAL ALIGNEMENT / GLOBAL ALIGNEMENTSequence A

Sequence B

Global alignment

Sequence alignment on their whole length

G G C T G A C C A C C - T T| | | | | | | G A - T C A C T T C C A T G

Local alignment

Alignment of the high similarity regions

G A C C A C C T T| | | | | | | G A T C A C - T T

Optimal local pair-wise alignement :Smith and Waterman, 1981

Optimal global pair-wise alignement :Needleman and Wunsch, 1970

WHY?

• Computationally intensive task proportionally to the database size.

• This problem has been solved using Iterative methods PSSMs (Position Specific Scoring

Matrices) instead of Hamming distance for simplicity.

• There have been some remarkable efforts toward accelerate the execution time using traditional

computing power based on heuristic techniques such as BLAST, FASTA.

• High computing capabilities are nowadays handy and affordable. For instance, My Laptop has

some GPU power: ( Nvidia® Tesla GForce 315M )

SEQUENCE ALIGNMENT• The NW algorithm has three main steps:

1. Initialization

2. Fill

3. Trace back

LITERATURE REVIEW

SIRIWARDENA AND RANASINGHE• Their motivation sources were:

• Global sequence alignment is the most resource consuming in comparison with local alignment .

• Very few studies conducted on global sequence alignment.

• Their research goal is to evaluate different levels of memory access strategies and different block

sizes.

• They have parallelize the “Fill” step only (computational intensive step)

Accelerating Global Sequence Alignment using CUDA compatible multi-core GPU

SIRIWARDENA AND RANASINGHE• Regardless of the dependency in the “Fill” step, the algorithm shows a pattern.

• Scores in the atni-diagonal locations are independent.

• Memory access has been planed to minimize communication between device and host main memories.

• Intra-block

• Inter-block

• Implementation:

1. without blocking strategy:

1. host is responsible for copying the data forward and backward from device memory

2. Device did the computations only to decide the score of the current cell

2. Blocking strategy:

1. Global memory based strategy (Copied to main mem. Each SP gets a block to shared mem., once they finish it. It is copied back to GM but with barrier implementation)

2. Shared memory based strategy (Explicit synchronization)

SIRIWARDENA AND RANASINGHE: EVALUATION

• Specifications (CPU) :

• 2.4GHz Intel quad core Processor

• 3 GB RAM,

• LINUX OS

• Specifications (GPU):

• Nvidia GeForce 8800 GT GPU

• 512MB graphics memory

• 114 cores and 16KB of shared memory per block

• CUDA version 2.3

CHE et. al

• They investigated the performance of GPUs on different application fields that need more speed.

• A comparison has been made among the GPU implementation with a single core CPU and

multil-core CPU (CUDA vs. OpenMP, CUDA vs. Serial implementation).

• They have reported that the Multi-core CPU implementation has outperform the GPU

implementation (CUDA 1.1).

A PERFORMANCE STUDY OF GENERAL-PURPOSE APPLICATIONS ON GRAPHICS PROCESSORS USING CUDA

CHE et. al

• The parallelization took place on the “Fill” step only.

• They identified two parallelism levels

• Thread level parallelism

• Block level parallelism

• They reported 2.9x speedup of CUDA implementation against

single CPU implementation.

ZHENG et. al

• Smith-Waterman algorithm is used

• Computing the scoring matrix is parallelized

• 32 threads are used to process the sub-matrices in parallel

• Threads in one warp are synchronized

• Maximum score within a block -> shared memory

• Global maximum score -> global memory

Accelerating biological sequence alignment algorithm on GPU with CUDA

ZHENG et. al

• NVIDIA GeForce 9600GT

• 8 SMs with 8 SPs for each

• 8192 registers

• 16KB shared memory

• 64KB constant memory in one SM

• 768 MB global memory

• 256 threads to be executed concurrently

ZHENG et. al

• Swiss-Port protein sequence database

• Query sequence length: 64 to 2048 amino acids

• Speedup: 19x compared with CPU implementation

ZHENG CONT.

CUDASW++

• Based on Smith-Waterman algorithm• Inter-task and intra-task parallelization

• Inter-task: • Each task is assigned to exactly one thread• dimBlock tasks are performed in parallel by different threads in a thread block

querysubject

CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

CUDASW++

• Intra-task parallelization: • Each task is assigned to one thread block • All dimBlock threads in the thread block cooperate to perform the task in parallel• Exploiting the parallel characteristics of cells in the minor diagonals

query

subject

CUDASW++

24

• Database : Swiss-Prot release 56.6• Number of query sequences: 25 • Query Length: 144 ~ 5,478 • Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM) • Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )

OUR PLAN

• Develop and execute the two main algorithms using CUDA as well as OpenMP.

• Study different parallelization scenarios and examine different memory strategies.

• Parallelize the “Fill” step as well as the “Trace back” if it is applicable.

• Performance criteria:

• The GPU implementation compared against sequential code implementation (CPU).

• The performance of GPU will be compared to the pervious work with only “Fill” step parallelism

• OpenMP vs CUDA investigated.

PARALLELISM POSSIBILITIESBlock level parallelism:1. Different block sizes will be evaluated Such as (4x4, 8x8, 16x16, 32X32), parallel parts

will be the anti-diagonal directions2. Another possibility is to split the data into column-wise portions and carry out

execution in row-wise direction

REFERENCES

1) GPU Gems 2: Chapter 30 (https://developer.nvidia.com/content/gpu-gems-2-chapter-30-geforce-6-series-gpu-architecture)

2) What is GPU computing (http://www.nvidia.com/object/what-is-gpu-computing.html)

3) D. Kirk and W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach. Burlington, Massachusetts: Morgan Kaufmann Elsevier, 2010

4) T. R. P. Siriwardena and D. N. Ranasinghe, “Accelerating Global Sequence Alignment Using CUDA Compatible Multi-core GPU,” in 5th International

Conference on Information and Automation for Sustainability (ICIAFs), pp. 201–206, 2010.

5) S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A Performance Study of General-purpose Applications on Graphics Processors Using

CUDA,” J. Parallel Distrib. Comput., vol. 68, no. 10, pp. 1370–1380, Oct. 2008.

6) Zheng, Fang, et al. "Accelerating biological sequence alignment algorithm on gpu with cuda.“IEEE International Conference on Computational and

Information Sciences (ICCIS), 2011

7) Liu, Yongchao, Douglas L. Maskell, and Bertil Schmidt, "CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled

graphics processing units," BMC research notes 2.1 (2009): 73.

https://developer.nvidia.com/content/gpu-gems-2-chapter-30-geforce-6-series-gpu-architecture



http://www.nvidia.com/object/what-is-gpu-computing.html

THANKSQ & A

NEEDLEMAN VS. WATERMAN

• Needleman-Wunsch

• Global sequence

• Matches the whole sequence

• No gap penalty required

• Hij =max{diag+ s, left+e,up +e}

• Score cannot decrease between two cells

of a pathway

• Simth-Waterman

• Local sequence

• Part of the sequence could be matched

• Requires a gap penalty to work effectively

• Score can increase, decrease or stay level

between two cells of a pathway

GPU PROPERTIES

Name: 'Quadro 7000'

Index: 1

ComputeCapability: '2.0'

SupportsDouble: 1

DriverVersion: 5

MaxThreadsPerBlock: 1024

MaxShmemPerBlock: 49152

MaxThreadBlockSize: [1024 1024 64]

MaxGridSize: [65535 65535]

SIMDWidth: 32

TotalMemory: 6.4420e+09

FreeMemory: 6.3484e+09

MultiprocessorCount: 16

ClockRateKHz: 1301000

ComputeMode: 'Default'

GPUOverlapsTransfers: 1

KernelExecutionTimeout: 0

CanMapHostMemory: 1

DeviceSupported: 1

DeviceSelected: 1

cuda - based sequence alignment

Documents

c t t c c

c c t t g

c c t t g

lengthg g c t g

cpu gpu cuda programkernels

t glocal alignmentalignment

combination of gpu

cores shares control