chao “bill” xie, victor bolet, art vandenberg georgia state university, atlanta, ga 30303, usa

Chao “Bill” Xie, Victor Bolet, Art VandenbergGeorgia State University, Atlanta, GA 30303, USA

February 22/23, 2006SURA, Washington DC

Memory Efficient Pairwise Genome Alignment Algorithm –

A Small-Scale Application with Grid Potential

Introduction

• Small scale application is studied in the grid environment• Performances are compared with shared memory

environment, grid environment and cluster environment• Pairwise sequence alignment program is chosen as a small

scale application• The basic algorithm is modified to a memory efficient algorithm• The parallel implementation for pairwise sequence alignment is

studied in different environments

• Based on work done by Nova Ahmed, NMI Integration Testbed

Specification of the Distributed Environments

• Shared Memory environment is a SGI ORIGIN 2000 machine with 24 CPUs

• Cluster environment at UAB was a beowulf cluster with 8 homogenous nodes, each node with four 550 MHz Pentium III processors with 512 MB of RAM

• Grid environment is the same beowulf cluster of the cluster environment with the Globus Toolkit software layer over it.

• Summer 2005 USC HPC resources used

• Two dimensional array - Similarity Matrix - stores the two sequences

• A match or a mismatch is calculated for each position in the pair of sequences to be matched

• Dynamic programming is used

The Basic Pairwise Sequence Alignment Algorithm

0 0 0 0 0 0 0

G A G A A G A G A C

Sequence X

Sequence Y

0 0 0 0

The Reduced Memory Algorithm

• Keep only nonzero elements of the matrix

• Memory dynamically allocated as required

• New data structure for efficiency

The Parallel Method

•The genome sequences are divided among processors•The Similarity Matrix is divided among processors

P1 P2 P3 P4 P5

Part being computedComputation completedPi sends Edge value to Pi+1

Results

Computation Time (seconds)

Number of Processors

Genome length 3000(Grid)

Genome length 3000(Cluster)

Genome length 3000( Shared Memory)

Computation Time (seconds)

Genome length 10000 (Grid)

Genome length 10000( Cluster)

Computation time: Shared Memory, Cluster, Grid-enabled Cluster environment

Computation time: Cluster, Grid-enabled Cluster environment

2 4 6 8 10 12 14 16 18 20 22 24 26

Speed Up

Genome length 3000(Grid)

Genome length 3000(Cluster)

Genome length 3000(Shared Memory)

Comparison of speed up: Shared Memory, Cluster, and

Grid-enabled Cluster environment

Comparison of speed up: Cluster, and Grid-enabled Cluster environment

Speed Up

Genome length 10000 (Grid)

Genome length 10000( Cluster)

Results

UAB multi-cluster

(a) Computation time (b) Speedup

Comparison of multi-Cluster Grid environments

0 5 10 15 20 25 30

Number of processors

Computation time (sec)

Single Cluster

Single Clustered

Multi Clustered

0 5 10 15 20 25 30

Speed up

Single Cluster

Single Clustered

Multi Clustered Grid

Running Example

04.08.2004 (per Nova Ahmed, UAB Beowulf Cluster: Medusa)

Here the steps of running the genome alignment program for grid.

First the sample program which aligns a very small genome sequence is tested. The genome sequences were t1.txt, t2.txt

The object file is:

Grid-proxy-init, RSL script, globusrun

1. First the grid-proxy-init is run to get the grid certificate

Your identity: /O=Grid/OU=UAB Grid/CN=Nova Ahmed

Enter GRID pass phrase for this identity:

Creating proxy .......................................................

Your proxy is valid until: Fri Apr 9 00:54:24 2004

2. Then create the RSL script in genome.rsl to run the job

& (count=4)

(executable=/home/nova/ar7)

(jobtype=mpi)

3. the actual program ran on the grid using globus run command

globusrun -s -r medusa.lab.ac.uab.edu -f ./genome.rsl

Output Output

------------------------------------

MyId = 1 NumProc = 4

[1 : 1 ->2 2]

[1 : 2 ->13 3]

[1 : 3 ->1 1] [1 : 3 ->11 1]

myid = 1 finished

[2 : 0 ->1 1] [2 : 0 ->11 1]

[2 : 2 ->1 1]

[2 : 3 ->2 2]

[2 : 4 ->2 2] [2 : 4 ->13 3]

[2 : 5 ->1 1] [2 : 5 ->13 3]

myid = 2 finished

[3 : 0 ->11 1] [3 : 0 ->21 1]

[3 : 1 ->2 2]

[3 : 2 ->11 1] [3 : 2 ->31 1]

[3 : 3 ->1 1]

[3 : 4 ->1 1] [3 : 4 ->12 2] [3 : 4 ->21 1]

[3 : 5 ->2 2] [3 : 5 ->12 2] [3 : 5 ->23 3] [3 : 5 ->31 1]

myid = 3 finished

tgatggaggt

gatagg

[0 : 0 ->11 1]

[0 : 2 ->1 1]

[0 : 4 ->11 1]

[0 : 5 ->11 1]

Elapsed time is =0.014624

myid = 0 finished

//----------------------

Running the program using longer genome sequences

a1-1000, a1-2000, a1-3000 compared with

a2-1000, a2-2000, a2-3000

USC HPC – Summer 2005

0 50 100 150

Computation time (sec).

Cluster

0 50 100 150 200

Computation time (seconds).

Cluster

(a) for small set sequences (b) for long set sequences

Computation time in Cluster and Grid environment varying number of processors

USC HPC – Summer 2005

(a) for small set sequences (b) for long set sequences

Speed up in the Cluster and Grid environments

0 50 100 150 200

Speed Up

Cluster

0 50 100 150 200

Speed Up

Cluster

Conclusion

• Grid environment shows similar performance to cluster environment • Grid environment adds little overhead• Shared memory environment has better speedup performance compared to cluster and grid• Shared memory environment shows the limitation of memory for computing large genome sequences• Small scale applications (as well as large scale) can run efficiently on a grid• Distributed applications with minimal communication among the processors will see benefit in a grid environment – perhaps even across multiple clusters

Future Work

• Additional work in a SURAgrid environment that includes multiple clusters

• Test data that provides a more computation intensive challenge for grid environments

• Adapt the application to the grid environment such that is is using less inter-process communication

Acknowledgements

• This material is based in part upon work supported by:– National Science Foundation under Grant No. ANI-0123937 - NMI

Integration Testbed Program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)

– SURA Grant SURA-2005-305 - SURAgrid Application Development & Documentation

• Thanks to– Nova Ahmed, currently Georgia Tech Computer Science PhD program, for

original work carried out as part of NMI Integration Testbed Program– John-Paul Robinson and University of Alabama at Birmingham for access to

medusa cluster– Jim Cotillier, Shelley Henderson, University of Southern California, for

access to HPC resources– Chao “Bill” Xie, Georgia State Computer Science PhD program, for continuing

Nova Ahmed’s work– Victor Bolet, Georgia State Information Systems & Technology Advanced

Campus Services unit, for support of Georgia State’s SURAgrid nodes– John McGee, RENCI.org, for discussions of approach using globus

chao “bill” xie, victor bolet, art vandenberg georgia state university, atlanta, ga 30303, usa

Documents

bolet eac 2011def

bolet pib ivtrim11

bolet{in iv

vandenberg philipp - zapomniany pergamin

cv antoni bolet

atlanta, georgia 30303

bolet dic 16

vandenberg village community services...

bolet 2016

bolet - prospecta americas

bolet n 2_nos_dijero_que_entrevista_1_

el quinto evangelio philipp vandenberg

philipp vandenberg - conjuraţia sixtină

tocats del bolet

vandenberg - heavy metal.pdf

bolet abr 16

vandenberg air force base - scgrp.com · vandenberg air...

bolet eam 2006

vandenberg - el secreto de los oráculos

vandenberg he 2011