umit catalyurek , mike gray, eric stahlberg, renato ferreira, tahsin kurc, joel saltz
DESCRIPTION
Improving Performance of Multiple Sequence Alignment Analysis in Multi-client Environments Use of Inexpensive Storage as Grid Cache. Umit Catalyurek , Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz. Department of Biomedical Informatics The Ohio State University. - PowerPoint PPT PresentationTRANSCRIPT
March 2, 2004, BMI 731 - Biomedical Data
Management
Improving Performance of Multiple Sequence Alignment Analysis in
Multi-client Environments
Use of Inexpensive Storage as Grid Cache
Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz
Department of Biomedical InformaticsThe Ohio State University
Ohio Supercomputer Center
March 2, 2004, BMI 731 - Biomedical Data
Management
Outline
• Multi Sequence Alignment• CLUSTALW• Sequence Analysis in Multiple Client
Environment – Caching Intermediate Results– Deployment on SMP Machine– Deployment on Distributed Memory Machine
• Experimental Results• Conclusion
March 2, 2004, BMI 731 - Biomedical Data
Management
Sequence Alignment
• alignment is a mutual arrangement of two sequences– where the two sequences are similar, and
where they differ
Sequence s: AAT AGCAA AGCACACA
Sequence t: TAA ACATA ACACACTA
Hamming Dist: 2 3 6
March 2, 2004, BMI 731 - Biomedical Data
Management
Edit Distance
Unit Cost:
s: AGCACAC-A AG-CACACA
t: A-CACACTA or ACACACT-A
cost 2 cost 4
distance(s, t) = 2
March 2, 2004, BMI 731 - Biomedical Data
Management
Multiple Sequence AlignmentVTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWESNG--
VTISCTGSSSNIG-AGNHVKWYQQLPGVTISCTGTSSNIG--SITVNWYQQLPGLRLSCSSSGFIFS--SYAMYWVRQAPGLSLTCTVSGTSFD--DYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNW--YVDGATLVCLISDFYPG--AVTVAW--KADSAALGCLVKDYFPE--PVTVSW--NS-GVSLTCLVKGFYPS--DIAVEW--ESNG
Optimal: O(2n |si|)6 sequences of length 100 if constant is 10-9 seconds
running time 6.4 x 104 secondsadd 2 sequences
running time 2.6 x 109 seconds
or
March 2, 2004, BMI 731 - Biomedical Data
Management
CLUSTAL W
• Based on Higgins & Sharp CLUSTAL [Gene88]• Progressive alignment-based strategy
– Pairwise Alignment (n2l2)• A distance matrix is computed using either an approximate method
(fast) or dynamic programming (more accurate, slower)– Computation of Guide Tree (n3): phylogenetic tree
• Computed from the distance matrix • Iteratively selecting aligned pairs and linking them.
– Progressive Alignment (nl2)• A series of pairwise alignments computed using full dynamic
programming to align larger and larger groups of sequences.• The order in the Guide Tree determines the ordering of sequence
alignments. • At each step; either two sequences are aligned, or a new sequence is
aligned with a group, or two groups are aligned. • n: number of sequences in the query• l : average sequence length
March 2, 2004, BMI 731 - Biomedical Data
Management
Sequence Analysis in Multiple Client Environment
• Many Gene and Protein databases can be accessed over Internet– Multiple request by multiple client
• Data Caching– Cache pairwise alignments
• Most expensive phase• Computations are independent
March 2, 2004, BMI 731 - Biomedical Data
Management
Data Caching• Low-cost high-performance, high-capacity
commodity hardware– Disks are cheap: 100GB EIDE Disks around $250.– A PC costs around $700-$1000
• no monitor, • no high-end graphics card,• moderate size memory (128MB-512MB)
– Switched fast ethernet • Better performance with channel bonding
– In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000– In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000– BMI Storage Cluster 7.2TB, 24 PCs = $50,000-$55,000 – UMD Storage Cluster 9.5 TB, 50 PCs
March 2, 2004, BMI 731 - Biomedical Data
Management
Caching Pairwise Alignment Scores
• Sequence -> Unique ID (UID): – use Hash (tested 10 hash functions
including MD5; 4 of them gives similar result with MD5)
– Resolve collisions and assign UID to each sequence
• For more than 1 million sequences from GenBank max collision per hash value was 3: constant time
• For each pairwise alignment, store two UIDs and a float score– B-Tree: used GIST B-Tree implementation
March 2, 2004, BMI 731 - Biomedical Data
Management
Sequence -> Unique ID (UID):
Hash Table i
j Sequencec bits=
2c elements
Collision arrays
Unique ID = (i << c) || j
March 2, 2004, BMI 731 - Biomedical Data
Management
Deployment on SMP Machine
• A hash table is used to associate a sequence with a unique integer ID (UID)
• Partitioned B tree stores pairwise alignment results
• Cache partition chosen by min (UID1, UID2)% #Partitions
• Multiple threads for Pairwise alignment computation
March 2, 2004, BMI 731 - Biomedical Data
Management
DataCutter• Component Framework for Combined
Task/Data Parallelism• Core Services
– Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method.
– Filtering Service: Distributed C++ component framework
• User defines sequence of pipelined components (filters and filter groups)
– Pleasingly Parallel– Generalized Reduction
• User directive tells preprocessor/runtime system to generate and instantiate copies of filters
• Stream based communication • Multiple filter groups can be
active simultaneously• Flow control between
transparent filter copies– Replicated individual filters– Transparent: single stream illusion
9/11/2002 DataCutter 19
Combined Data/Task Parallelism
host1
R0
R1
host2
R2
host3
Ra0
host1
E0
EK
host2
EK+1
EN
host4
Ra1
host5
Ra2
host1
M
Cluster 1
Cluster 3
Cluster 2
http://www.datacutter.org
March 2, 2004, BMI 731 - Biomedical Data
Management
Deployment on Distributed Memory Machine
DataCutter version of ClustalW – v1
• Hash Filter– Stores/computes sequence to
unique IDs mapping– Partitioned (declustered) hash
• Cache Filter– Partitioned (declustered) cache– computes pairwise alignment if it
doesn’t exist in the cache• Owner computes: computational
imbalance
• CLUSTALW Filter– computes guide tree generation
and progressive alignment
CLUSTALW
Hash (UniqueID)
Cache & Compute
March 2, 2004, BMI 731 - Biomedical Data
Management
DataCutter version of ClustalW – v2
DC-ClustalW-v1 +• Separate Pairwise Alignment
Filter– Cache misses computed in
Pairwise Align– Balanced computation
• Handles multiple queries– multiple copies of CLUSTALW
filter
CLUSTALW
Hash (UniqueID)
Cache
Pairwise Align
Deployment on Distributed Memory Machine
March 2, 2004, BMI 731 - Biomedical Data
Management
Multiple Query Processing
-QueryManager Filter
-ClustalW Filter-Hash Filter-Cache Filter-Pairwise Alignment
Filter
CW
H
C
P
Host-1
Host-n+1
CW
Host-nH
C
P
Host-2n
QM
Host-0
Deployment on Distributed Memory MachineDataCutter version of ClustalW – v2
March 2, 2004, BMI 731 - Biomedical Data
Management
Experimental Setup
1. Pentium III 650 MHz, 768MB Memory• 1000 random sequences from GPCR• Average length 450 amino acids per sequence
2. 24-Processor Sun Fire 6800, 750MHz, 24GB Memory• 350 MSA queries from GPCR; from 2 sequences per
query to over 200 sequences per query
3. 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein
sequences from GPCR • Average length 450 amino acids per sequence
March 2, 2004, BMI 731 - Biomedical Data
Management
Experiment 1 – Execution Time of CLUSTAL W
Execution Time of CLUSTAL W
1.00
10.00
100.00
1000.00
10000.00
100000.00
25 50 75 100 150 200 400 600 800 1000
Number of GPCR sequences
Exe
cuti
on
tim
e (s
eco
nd
s)
Breakdown of CLUSTAL W Execution Time on PIII-650MHz
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
25 50 75 100 150 200 400 600 800 1000
Number of GPCR Sequences
Tim
e F
ract
ion
s
prog-align
guidetree
pairwise
Pentium III 650 MHz, 768MB Memory• 1000 random sequences from GPCR• Average length 450 amino acids per sequence
March 2, 2004, BMI 731 - Biomedical Data
Management
Experiment 2 - SMP ResultsBreakdown of Average Execution Time on 1-processor
0.00
10.00
20.00
30.00
40.00
50.00
60.00
NOCACHE 12% 25% 50% 75% 100%
Cache Hit Ratio
Exe
cuti
on
Tim
e (s
eco
nd
s)
pairwise
guide tree
prog.align
SMP : 64 Queries Total Execution Time
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8
Number of Processors
Execu
tio
n T
ime (
seco
nd
s)
no cache
directio
no directio
24-Processor Sun Fire 6800, 750MHz, 24GB Memory• 350 MSA queries from GPCR; from 2 sequences per query to
over 200 sequences per query
March 2, 2004, BMI 731 - Biomedical Data
Management
Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1
Average Query Execution Time - v1
0.00
5.00
10.00
15.00
20.00
25.00
1 2 4 8
# Processors
Tim
e (
se
co
nd
s) no-cache
0% hit ratio
25% hit ratio
50% hit ratio
75% hit ratio
100% hit ratio
Breakdown of CLUSTALW Execution Time on 1-processor
0.00
5.00
10.00
15.00
20.00
25.00
no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hit ratio
Cache Hit Ratio
Tim
e (s
eco
nd
s)
pair align
tree gen
prog align
16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein sequences from
GPCR • Average length 450 amino acids per sequence
March 2, 2004, BMI 731 - Biomedical Data
Management
Breakdown of CLUSTALW Execution Time on 2-processor
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hitratio
Cache Hit Ratio
Tim
e F
ract
ion
s prog align
tree gen
insert
compute
search
Breakdown of CLUSTALW Execution Time on 8-processor
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
no-cache 0% hit ratio 25% hit ratio 50% hit ratio 75% hit ratio 100% hit ratio
Cache Hit Ratio
Tim
e (s
eco
nd
s)
pair align
tree gen
prog align
16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein sequences from
GPCR • Average length 450 amino acids per sequence
Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1
March 2, 2004, BMI 731 - Biomedical Data
Management
Average Query Execution Time - v2 (load balanced)
0.00
5.00
10.00
15.00
20.00
25.00
1 2 4 8
Number of Processors
Tim
e (s
eco
nd
s) 0% hit ratio
25% hit ratio
50% hit ratio
75% hit ratio
100% hit ratio
Speedup of DataCutter version of CLUSTALW
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0 1 2 3 4 5 6 7 8 9
# ProcessorsS
pe
ee
du
p
linear
v2 - total
v1 - total
ideal speedup
v2 - pair align
v1 - pair align
Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2
1 ClustalW filter intra-query parallelization
16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 64 queries each consist of 40 unique protein sequences from
GPCR • Average length 450 amino acids per sequence
March 2, 2004, BMI 731 - Biomedical Data
Management
64 Queries Total Execution Time
0
200
400
600
800
1000
1200
1400
1 2 4 8
Number of copies of each filter (ClustalW, Hash, Cache, PairAlign)
Tim
e (s
eco
nd
s)
0% hit ratio
25% hit ratio
50% hit ratio
75% hit ratio
100% hit ratio
Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2
Multiple ClustalW filters inter-query parallelization
16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk• 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW• 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence
March 2, 2004, BMI 731 - Biomedical Data
Management
Conclusion
• Caching intermediate results– computational intensive application data
intensive application
• SMP• Distributed Memory implementation
with DataCutter