04/23/2003 massively parallel solutions for molecular sequence analysis prabhakar r. gudla cmsc 838t...
Post on 21-Dec-2015
218 views
TRANSCRIPT
04/23/2003
Massively Parallel Solutions for Molecular Sequence Analysis
Prabhakar R. GudlaCMSC 838T Presentation
04/23/2003 CMSC 838T – Presentation 2
Outline
Motivation Smith-Waterman Algorithm
Parallelization
High Performance Computing Hybrid Architecture Fuzion 150
Performance Evaluation Conclusions and Comments
04/23/2003 CMSC 838T – Presentation 3
Motivation
Discovered sequences are analyzed by comparison
with databases
Complexity is proportional to the product of query size
times database size
☞ Analysis too slow on sequential computers
04/23/2003 CMSC 838T – Presentation 4
Sequence Alignment
Two possible approaches Heuristics, e.g. BLAST, FASTA, but the more efficient the
heuristics, the worse the quality of the results Parallel Processing, get high-quality results in reasonable time
BLAST, FASTA, Smith-Waterman (S-W)
BLAST
FASTA
Smith-Waterman
Slower
Faster
SearchSpeed
DataQuality
Lower Higher
04/23/2003 CMSC 838T – Presentation 5
Outline
Motivation Smith-Waterman Algorithm
Parallelization
High Performance Computing Hybrid Architecture Fuzion 150
Performance Evaluation Conclusion and Comments
04/23/2003 CMSC 838T – Presentation 6
Parallelization of S-W
matrix cells along a single diagonal are computed in parallel
comparison is performed in l1+l21 steps on l1 PEs
GTCTATC
A T C T C G
l2
l1
P1 P2 P6
0 0 0 0 0 0 00000000
00 00
0 00 20
0 02 1
00
1
0 01 2
02
12
4
0 22 1
2
1
2
2
4
33
1
043
236
6545
4554
344456
A T C T C G
GTCTATC
GTCTATC
T GCTATC
TATC
C T GT C T GATC
TC
A T C T GC T A T C T G CTATCTG
04/23/2003 CMSC 838T – Presentation 7
Parallel Architectures
Embedded Massively Parallel Accelerators
Fuzion 150: 1536 processors on a single chip
Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan
Systola 1024: PC add-on board with 1024 processors
04/23/2003 CMSC 838T – Presentation 8
Outline
Motivation Smith-Waterman Algorithm
Parallelization
High Performance Computing Hybrid Architecture Fuzion 150
Performance Evaluation Conclusion and Comments
04/23/2003 CMSC 838T – Presentation 9
Previous Applications
Volume Visualization [Schmidt `00] Automatic Visual Quality Control (Automobile
Industry) Computer Tomography [Schmidt, Schimmler, and Schröder
`98] Video Compression [Schmidt and Schimmler `99] Range of Transforms (Fourier, Wavelet, Hough,
Radon) [Schmidt, Schimmler and Schröder `99] Image Processing [Schimmler and Lang `96, Lenders and
Schröder `90, Jiang Edirisinghe, and Schröder `97]
04/23/2003 CMSC 838T – Presentation 10
Hybrid Architecture
High speed Myrinet switchHigh speed Myrinet switch
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer
04/23/2003 CMSC 838T – Presentation 11
Architecture of Systola 1024
Interface processors
ISA
RAM NORTH
host computer bus
Controller
RAM WEST
program memory
Instruction Systolic Array: 32 32 mesh of
processing elements wavefront instruction
execution
04/23/2003 CMSC 838T – Presentation 12
Mapping onto Systola 1024
a30a31 a0
a63 a62 a32
a992a1022a1023
bk….b1b0bk….b1b0…c1c0 X
bb: subject sequence
aa: query sequence (equal to 1024)
Subject sequences can be pipelined with only step delay k steps for subject sequence of length k
Efficient routing on the ISA: Row Ringshift and Broadcast
04/23/2003 CMSC 838T – Presentation 13
Fuzion 150 Architecture
0.25-m, single-chip, SIMD architecture 1536 PEs @ 200 MHz 300 GOPS 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth multithreading (control units interact via semaphores) developed by Clearspeed Technology (UK) for graphics, networking processing
Linear SIMD Array1536 PEs
each with 2 Kbytes DRAM
Linear SIMD Array1536 PEs
each with 2 Kbytes DRAM
FUZION BusFUZION Bus
32-bit EPU(ARC)
32-bit EPU(ARC)
VideoI/O
VideoI/O
DisplayDisplay
Instruction FetchInstruction Fetch
SIMD ControllerSIMD Controller
Local MemoryLocal
Memory
1,2 or 4 Channels (6.4 GB/s)
HostHost AGP Rambus
04/23/2003 CMSC 838T – Presentation 14
Fuzion 150 Architecture
PE(0,0)
PE(0,1)
PE(0,255)
Fuz
ion
Bus
PE(1,0)
PE(1,1)
PE(1,255)
PE(5,0)
PE(5,1)
PE(5,255)
Local MemoryLocal
Memory
Block 5
Block 1
Block 0
ALU(8 bits)
Register file32 Bytes
PE Memory2 KByte DRAM
Right PE
Instructions
Block I/O Channel
Left PE
04/23/2003 CMSC 838T – Presentation 15
Mapping onto the Fuzion 150Block 5
Block 1
Block 0
bb: subject sequence
bk….b1b0bk….b1b0
a1a0 a255
a511 a510 a256
a1280a1534a1535aa: query sequence (equal to 1536)
…c1c0 X
No fast global communication 2-step local communication Subject sequence can be pipelined with only step delay
04/23/2003 CMSC 838T – Presentation 16
Contents
Motivation Smith-Waterman Algorithm
Parallelization
High Performance Computing Hybrid Architecture Fuzion 150
Performance Evaluation Conclusion and Comments
04/23/2003 CMSC 838T – Presentation 17
Performance Evaluation
Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths
Parallel implementation scales linearly with sequence lengthComputing time dominates data transfer time
Query sequence length 256 512 1024 2048 4096
Fuzion 150speedup to PIII 1Ghz
1288
2297
42102
82105
162106
Systola 1024speedup to PIII 1Ghz
2944
5774
11374
22414
46114
Cluster of 16 Systolasspeedup to PIII 1GHz
2053
3856
7358
14260
29059
Fuzion 150 is 25 times faster than a single Systola 1024; difference in CMOS technology (0.25 vs 1.0)
04/23/2003 CMSC 838T – Presentation 18
Performance Evaluation
Time comparisons for a 10 Mbase search on different parallel architectures with different query length
1
10
100
SAMBA Fuzion 150 Kestrel 16K-PEMasPar
Sec
on
ds 512
1024
2048
4faster than 16K-PE MasPar 6faster than Kestrel 5faster than SAMBA (special-purpose 3-board architecture)
04/23/2003 CMSC 838T – Presentation 19
Performance Evaluation
USparc : Sun Ultrasparc 140 MHz
B-SYS: 470-PE ISA
Alpha: DEC Alpha – 433 MHz
1K MP2: 1K-PE MasPar
Paragon: 32-node Paragon
Decy-1: 1-board Decypher-II*
Merc1: 1-board Mercury+
Bcll-1: Biocellerator*
Samba: 2-board Samba+
16-MP2: 16K-PE MasPar
FDF-3: 5-Board Paracell FDF+
Kestrel: 1-board Kestrel
Decy-15: 15-board Decypher-II*
+ (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999
04/23/2003 CMSC 838T – Presentation 20
Outline
Motivation Smith-Waterman Algorithm
Parallelization
High Performance Computing Hybrid Architecture Fuzion 150
Performance Evaluation Conclusions and Comments
04/23/2003 CMSC 838T – Presentation 21
Conclusions
Demonstrated how fine-grained and hybrid parallel architectures can be applied efficiently for Comparative Genomics
Significant runtime savings for full genome comparisons and database searching
Same systems can be used for accelerating other bioinformatics applications, e.g. Hidden Markov Models
04/23/2003 CMSC 838T – Presentation 22
Comments
☞ With hardware support, is S-W as fast as BLAST?
Search Tools
(against Swiss-Prot
DB)
Sequence Under Test
ELVIS (5) Metr (276) Arp_arath (536)
Time taken for the search (seconds)
FASTA 3.3 4.3 20.0 25.0
BLAST 2.2 1.0 4.0 10.0
SSearch (SW) 6.0 240.0 565.0
H’Ware Accl. 3.2 16.8 29.7
Comparative search speeds on 600 MHz 21264A Alpha machine (comparable MCUPS as Hybrid System and Fuzion 150)
* Source: Shane Sturrock, SCS, 2(1), April 2002
04/23/2003 CMSC 838T – Presentation 23
Comments
☞ Is it feasible to use S-W as the default ? Currently offered as a default option at EBI (European
Bioinformatics Institute), handles 15K queries per month w/ full implementation of S-W
Depends on the “objectives” of the search
☞ Just how much more accurate is S-W ? 5-10% more “sensitive” towards divergent matches than
BLAST (Shpaer et. al., Genomics 38, 179-191, 1996) BLAST will retrieve most biologically significant similarities,
but will miss a few and will include some chance similarities
04/23/2003 CMSC 838T – Presentation 24
Comparison of S-W VS BLAST
Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996
☞ Is there a real difference in the results ? YES
04/23/2003 CMSC 838T – Presentation 25
Comparison of S-W, FASTA, and BLAST
Note: The numbers in the table show for how many protein SF the method in the column performed better than the one in the row
04/23/2003 CMSC 838T – Presentation 26
Acknowledgements
Dr. Bertil Schmidt
Dr. Chau-Wen Tseng
04/23/2003 CMSC 838T – Presentation 27
Q&A
04/23/2003 CMSC 838T – Presentation 28
Extra Slides
04/23/2003 CMSC 838T – Presentation 29
Full Genome Comparison
related Organisms, but Tuberculosis causes a disease find common and different parts
16106 pairwise sequence comparisons
3918 ProteinSequences1.329.298
AminoAcids
4289 ProteinSequences1.359.008
AminoAcids
04/23/2003 CMSC 838T – Presentation 30
Smith-Waterman Algorithm
Optimal local alignment of two sequences Performs an exhaustive search for the optimal
local alignment Complexity O(nm) for sequence lengths n and m
Based on the 'dynamic programming' (DP) algorithm Fill the DP matrix using a substitution (mutation) matrix Find the maximal value (score) in the matrix Trace back from the score until a 0 value is reached
04/23/2003 CMSC 838T – Presentation 31
Smith-Waterman Algorithm
Aligning S1 and S2 of length l1 and l2 using recurrences:
1 2
0
( , )( , ) max ,1 , 1
( , )
( 1, 1) ( 1 , 2 )i j
E i jH i j i l j l
F i j
H i j Sbt S S
0),0(),0(
0)0,()0,(
jFjH
iEiH
),1(
),1(max),( ,
)1,(
)1,(max),(
jiF
jiHjiF
jiE
jiHjiE
Calculate three possible ways to extend the alignment by one aminoacid (AA) in each sequence by one AA in the first sequence and align it with a gap in the second by one AA in the second sequence and align it with a gap in the first
04/23/2003 CMSC 838T – Presentation 32
Smith-Waterman Algorithm
Align S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC
GTCTATCAC
A T C T C G T A T G A T G
0 0 0 0 0 2 1 0 0 2 1 00000000000
0 0 0 0 0 0 0 0 0 0 0 0 02
0 2 1 2 1 1 4 3 2 1 1 3 20021021
1224321
4323654
3654554
4554657
3444556
3546545
3475576
2569876
1458876
03677
109
2258799
2147788
108
97
534
2
0
else 1
)( if 2),(
yxyxSbt
=1, =1
A T C T C G T A T G A T GA T C T C G T A T G A T G
G T C G T C T A T C A CT A T C A C
)2,1()1,1(
1)1,(
1),1(
0
max),(
ji SSSbtjiH
jiH
jiHjiH
04/23/2003 CMSC 838T – Presentation 33
Principles of the ISA
.......
...
04/23/2003 CMSC 838T – Presentation 34
Principles of the ISA
Communication- Register
04/23/2003 CMSC 838T – Presentation 35
Interface Processors
Interface Processors Interface Processors NorthNorth
Interface Interface Processors WestProcessors West
ISA
. . . ..
. .
.
04/23/2003 CMSC 838T – Presentation 36
Instruction Systolic Array
+
row selectors
columnselectorsinstructions
*
-
+
-
*-
+*+
+*-+
+*
* +-+
+*-
+* +*
+*-
++*
*-*-+
+*
+*
-
-
-
+*
+*- +*- -
wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift)
04/23/2003 CMSC 838T – Presentation 37
Advantage of ISA’s: Performing Aggregate Functions
• Row Broadcast
• Row Sum
• Row Ringshift
C := C[WEST]C := C[WEST]
C := CW
C = 234 C = 0 C = 0 C = 0234
C := C + C[WEST]C := C + C[WEST]
noop
C = 1 C = 2 C = 3 C = 4
C := C[WEST]; C:=C[EAST]C := C[WEST]; C:=C[EAST]
noop
C = 1000 C = 1 C = 1 C = 1
C = 234 C = 234 C = 0 C = 0234
C := CW
C = 1 C = 3
C:=C+CW
C = 3 C = 4
C := CW
C = 1 C = 1000 C = 1 C = 1
C:=CWC := CWC:=CE
C = 234 C = 234 C = 234 C = 0234
C := CW
C = 1 C = 3
C:=C+CW
C = 6 C = 4
C := CW
C = 1 C = 1 C = 1000 C = 1
C:=CWC := CW C:=CE
C = 234 C = 234 C = 234 C = 234234
C := CW
C = 1 C = 3 C = 6
C:=C+CW
C = 10
C := CW
C = 1 C = 1 C = 1 C = 1000
C:=CWC := CW C:=CE
04/23/2003 CMSC 838T – Presentation 38
Data Transfer
In Systola 1024, input of new character (bj) into the lower western IP, and
when l1 > 2048, the input of previously computed H, E, and F
cells and output of H, E, and F cells
For Fuzion 150, during the 16 new H-cells in each PE, one new character is input via Fuzion bus
04/23/2003 CMSC 838T – Presentation 39
Instruction Counts
Instruction Count (IC) to update 2 and 16 H-cells in Systola 1024 and Fuzion 150, respectively:
Operations in each PE per iteration step Systola Fuzion
Get H(i – 1, j), F(i – 1), bj, maxi-1 from neighbor 20 22
Compute t = max{0, H(i – 1, j – 1) + Sbt(ai, bj)} 20 576
Compute F(i, j) = max{H(i – 1, j} – , F(i – 1, j) – } 8 336
Compute E(i, j) = max{H(i, j – 1} – , E(i, j – 1) – } 8 448
Compute F(i, j) = max{t, H(i, j}, F(i, j)} 8 368
Compute maxi = max{H(i, j), maxi-1} 4 184
Sum 68 1934
04/23/2003 CMSC 838T – Presentation 40
Maximum Characters/PE
The memory per PE on Systola is 32 (16-bit) registers 2 characters per PE is the maximal possible (2 chars x 20 AAs substitution row x 8-bit per substitution
value = 20 registers)
The memory per PE on Fuzion is 2Kb maximum chars per PE is 16 restricted due to “indirect addressing” per PE
04/23/2003 CMSC 838T – Presentation 41
Indirect Address
An addressing mode found in many processors' instruction sets where the instruction contains the address of a memory location which contains the address of the operand (the "effective address") or specifies a register which contains the effective address
04/23/2003 CMSC 838T – Presentation 42
Myrinet - Overview
Myrinet is a cost-effective, high-performance, packet-communication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers
Conventional networks (e.g., ethernet) can be used to build clusters, but do not provide the performance/features required for HPC or high-availability clustering
04/23/2003 CMSC 838T – Presentation 43
Myrinet - Characteristics
Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports
Flow control, error control, and "heartbeat" continuity monitoring on every link
Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications
Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts
Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets
04/23/2003 CMSC 838T – Presentation 44
lq processors: Hybrid
Query sequence = M, Number of processors
in ISA = N2, assuming M = k x N:
1. k N: Each k x N subarray computes the alignment of the same query sequence with different subject sequences
2. k ≥ N :• k/N = 2: load 2 chars per PE• k/N > 2: split query sequence into k/2N passes and load 2N2
chars in each pass
04/23/2003 CMSC 838T – Presentation 45
lq processors: Fuzion 150
Length of query sequence = M, Number
of processors = 1536:
1. k x M = 1536: k alignments of same query sequence w/ different subject sequences carried out in parallel
2. k x 1536 = M:• Split into k passes – requires I/O of intermediate results in each
step
• Data transfers can be minimized by assigning k/M chars per PE – currently 16 chars per PE is the limit
04/23/2003 CMSC 838T – Presentation 46
Concept of true and false hits
The following cases were distinguished: true positives, alignments between proteins of similar
structure that fall above a given threshold (defined by the sequence alignment method)
false positives, alignments between proteins of dissimilar structure that fall above a given threshold of the sequence alignment
true negatives, alignments between proteins of dissimilar structure that that fall below a given threshold
false negatives, alignments between proteins of similar structure that fall below a given threshold
04/23/2003 CMSC 838T – Presentation 47
Guidelines
When to use S-W ? if you are looking for a protein distantly related to your query
sequence (e.g., you have a known protein sequence and you want to find possible distant homologues)
if you are looking for the protein encoded in your low-quality DNA query sequence (e.g., you have a badly sequenced cDNA clone)
if you are looking for a DNA sequence corresponding to your protein query sequence (e.g., you want to identify potential homologues of your protein in the EST databases)
When to use BLAST ? if you are looking for close matches and you don't mind missing
lower homology sequences if you want a quick answer
04/23/2003 CMSC 838T – Presentation 48
Performance Evaluation of SAMBA
Query sequence length 10 30 100 300 1000 3000 10000
Time in seconds
Samba 25 25 26 30 40 77 210
DEC-Alpha – 150 Mhz
Speed up
57
2.3
120
4.8
350
13.5
1041
34.7
3468
86.7
11510
150
38450
183
SUN-Sparc 5 – 110 MHz
Speed up
95
3.8
239
9.5
746
28.6
2215
7.4
7300
183
24269
315
80300
382
DEC 5000/250 – 40 MHz
Speed up
182
7.3
548
22
1407
54
4054
135
12920
323
41169
534
131193
625
Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997
☞ The longer the query length, the better the speed-up
04/23/2003 CMSC 838T – Presentation 49
Performance Evaluation of Kestrel
USparc : Sun Ultrasparc 140 MHz
B-SYS: 470-PE ISA
Alpha: DEC Alpha – 433 MHz
1K MP2: 1K-PE MasPar
Paragon: 32-node Paragon
Decy-1: 1-board Decypher-II*
Merc1: 1-board Mercury+
Bcll-1: Biocellerator*
Samba: 2-board Samba+
16-MP2: 16K-PE MasPar
FDF-3: 5-Board Paracell FDF+
Kestrel: 1-board Kestrel
Decy-15: 15-board Decypher-II*
+ (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999
04/23/2003 CMSC 838T – Presentation 50
Performance Evaluation of Splash-2
Hardware Specifics MCUPS
Splash-2 Unidir; 16 boards 43,000
Splash-2 Bidir; 16 boards 34,000
Splash-2 Unidir; 1 board 3,000
Splash-2 Bidir; 1 board 2,100
Splash-1 Bidir; 746 PE’s 370
SPARC 10/30 GX gcc –O2 1.2
VAX 6620 VMS; CC 1.0
SPARC-1 gcc –O2 0.87
486DX-50 PC DOS; gcc –O2 0.67
Source: Hoang, IEEE-CMM, 185-191, 1993