efficient implementation of a string matching algorithm for src and cray reconfigurable computers...
TRANSCRIPT
Efficient Implementation of a String Matching Algorithm for SRC and Cray
Reconfigurable Computers
Efficient Implementation of a String Matching Algorithm for SRC and Cray
Reconfigurable Computers
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1,
Mohamed Abouellail1, Nandakishore Sastry2, and Kris Gaj2
1The George Washington University,2George Mason University
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1,
Mohamed Abouellail1, Nandakishore Sastry2, and Kris Gaj2
1The George Washington University,2George Mason University
El-Araby 2 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 3 1017 / MAPLD2005
Introduction
Interface
P memory
P memory
. . .
P P . . .
I/O Interface
FPGA memory
FPGA memory
. . .
FPGA FPGA . . .
I/O
Microprocessor System Reconfigurable Processor System
El-Araby 4 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 5 1017 / MAPLD2005
Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier
Up to 256 input and 256 output ports with two tiers of switch
Common Memory (CM) has controller with DMA capability
Controller can perform other functions such as scatter/gather
Up to 8 GB DDR SDRAM supported per CM node
SRC Architecture(Hi-BarTM Based Systems)
Storage Area Storage Area Network Network
Local Area Local Area Network Network
Wide Area Wide Area Network Network DiskDisk
Customers’ Existing NetworksCustomers’ Existing Networks
PCI-XPCI-XPCI-XPCI-X
MAPMAP®®
SRC-6SRC-6
MAPMAP
PP
MemoryMemory
SNAPSNAP™™
PP
MemoryMemory
SNAPSNAP
Gig EthernetGig Ethernetetc.etc.
Common Common MemoryMemory
ChainingChainingGPIOGPIO
Common Common MemoryMemory
SRC Hi-Bar SwitchSRC Hi-Bar Switch
El-Araby 6 1017 / MAPLD2005
SRC Reconfigurable Processor
El-Araby 7 1017 / MAPLD2005
SRC Programming Environment
SRC Programming Environment
P system
FPGAsystem
HLL (C) HDL (VHDL)
El-Araby 8 1017 / MAPLD2005
SRC Programming Environment (cnt’d)
Objectfiles
Application sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker.bin files
.edf files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDLsources.c or .f files .vhd or .v files
Objectfiles
Application sourcesUser
Macro sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker
.edf files
.bin files
. files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDL
.c or .f files .vhd or .v files
.v files
El-Araby 9 1017 / MAPLD2005
Main program
Function_1(a, d, e)
Function_2(d, e, f)
Function_1
Function_2
Macro_1(a, b, c)
Macro_2(b, d)Macro_2(c, e)
Macro_3(s, t)
Macro_1(n, b)Macro_4(t, k)
FPGA……
……
……
Macro_1
Macro_2 Macro_2
a
b c
d e
FPGA contents afterthe Function_1 call
Program in C or Fortran
SRC Programming Environment (cnt’d)
El-Araby 10 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 11 1017 / MAPLD2005
Cray XD1 System Architecture(One Chassis)
RapidArray components in a Cray XD1 chassis
FPGA and 2nd RAP are on Expansion Module
Compute 12 AMD Opteron 32/64
bit, x86 processors High Performance LinuxRapidArray Interconnect 12 communications
processors 1 Tb/s switch fabricActive Management Dedicated processorApplication Acceleration 6 co-processors
El-Araby 12 1017 / MAPLD2005
Cray XD1 Application Acceleration Interfaces
XC2VP30-50 running at up to 200 MHz 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s) 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s) QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user
applications
UserLogic
ADDR(20:0)D(35:0)Q(35:0)
TX
RX
RapidArrayTransport
ADDR(20:0)D(35:0)Q(35:0)
ADDR(20:0)D(35:0)Q(35:0)
ADDR(20:0)D(35:0)Q(35:0)
RapidArrayTransport
Core
QDR RAM Interface Core
QDR II SRAMRAP
Virtex-II Pro
El-Araby 13 1017 / MAPLD2005
Cray XD1 Development Flow
Hardware Flow Software Flow
Standard Hardware Flow
El-Araby 14 1017 / MAPLD2005
Cray XD1 Hardware Development Flow
Standard Flow Additional High-Level
Tools
El-Araby 15 1017 / MAPLD2005
Design Methodology using Cray XD1
Write application in C for system microprocessor
Identify computation intense routine(s)
Generate a bitstream using Cray Cores (RT & QDRII) and language of choice Create module in HDL (Verilog, VHDL) Create module using High Level Language Tools Validate Module Synthesize using (XST, Leonardo, Synplify Pro) Create bitstream using Xilinx place & route tools
Replace routines with Cray API calls
Run Application
El-Araby 16 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 17 1017 / MAPLD2005
String Matching - Introduction
String Matching – detecting the occurrence of a particular substring, called the pattern, in another string, called the text
Types of String matching: Exact string matching Approximate string matching
Exact string matching: Involves match patterns, where they exist completely, that
is unbroken and with no irrelevant data in between any letters
Numerous Applications : NIDS, text editing, …etc.
Approximate string matching: Pattern rarely matches the text completely Finds application in Computational biology (DNA matching),
image detection, handwriting recognition…etc.
El-Araby 18 1017 / MAPLD2005
Why align two protein or DNA sequences? Determine whether they are
descended from a common ancestor (homologous)
Infer a common function Locate functional elements Infer protein structure, if the
structure of one of the sequences is known
Problem: find the best pairwise alignment
of GAATC and CATAC
DNA Matching Basics
GAATCCATAC
GAATC-CA-TAC
GAAT-CC-ATAC
GAAT-CCA-TAC
-GAAT-CC-A-TAC
GA-ATCCATA-C
We need a way to measure the quality of a candidate alignment
Alignment scores consist of two parts: substitution matrix gap penalty
El-Araby 19 1017 / MAPLD2005
Purine A G
Pyrimidine C T
Transition (cheap)
Transversion(expensive)
10-5 0-5T
-510-5 0G
0-510-5C
-5 0-510A
TGCA
A hypothetical substitution matrix
GAAT-CCA-TAC
-5 + 10 + ? + 10 + ? + 10 = ?
GAAT-C d=-4CA-TAC
-5 + 10 + -4 + 10 + -4 + 10 = 17
G--AATC d=-4CATA--C e=-1-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5
DNA Matching Basics (cnt’d)
Scoring aligned bases
Scoring gaps
Linear gap penalty: every gap receives a score of d
Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e
El-Araby 20 1017 / MAPLD2005
Read sequencesA & B
Into two arrays
Set traceback & Similarity matrix to
(A+1) * (B+1)
1’s row & column ofSimilarity Matrix = 0
Initialize traceback Arrays by setting to
-1 (default value)
Compute SimilarityMatrix [i] [j]
Update tracebackArray
Traceback for best alignments
0
_1,_,1
,1,1
max,penaltygapjiFpenaltygapjiF
yxsjiF
jiFji
NOTE: Traceback array carries the coordinates of one of three cells involved in the calculation of the cell [i] [j] in the similarity matrix
no
A
A
yes
Similarity Matrix Complete?
Approximate String Matching Algorithm(Smith-Waterman Algorithm)
El-Araby 21 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 22 1017 / MAPLD2005
Software Only Implementation
Software/HardwareImplementation
Hardware OnlyImplementation
C functionfor P
C functionfor MAP
VHDL
VHDLMacro
P System
FPGASystem
Implementation Schemes in SRC
El-Araby 23 1017 / MAPLD2005
Operational Environment
Operational Scenarios for Cray XD1
µP-Initiated Transfers
FPGA-Initiated Transfers
Write-Only Transfers
El-Araby 24 1017 / MAPLD2005
Outline
Introduction
SRC Hardware & Software
Cray XD1 Hardware & Software
String Matching Algorithms
Implementation Methodology
Results and Comparisons
Conclusions
El-Araby 25 1017 / MAPLD2005
Performance Results
Rate = (FPGA freq.) X (cycles/cell) X (# SWPEs)
Opteron Implementation (SSEARCH34)*
100 Million Cell Updates Per Second (CUPS)
Cray Inc. Implementation*
Current unoptimized design 80 MHz X 1 X 32 = 2.56 Billion CUPS (GCUPS)
With optimization 100 MHZ x 1 x 50 = 5.0 GCUPS
With future Virtex 4 FPGA 100 MHZ x 1 x 150 = 15 GCUPS
25x speedup vs. Opteron
Our Implementation SRC-6
Current unoptimized design» 100 MHz X 1 X (16x16) = 25.6 GCUPS
10x speedup vs. Cray 256x speedup vs. Opteron
Cray XD1 Current unoptimized design
» 200 MHz X 1 X (16x16) = 51.2 GCUPS 20x speedup vs. Cray 512x speedup vs. Opteron *CUG’05, New Mexico, May 2005
El-Araby 26 1017 / MAPLD2005
Conclusions
Smith-Waterman sequence alignment algorithm has been implemented on both SRC-6 and Cray XD1 systems
Similarities and differences are highlighted with regard to: System hardware architecture Ease of programming
Programming model Development time Hardware/software libraries
Performance The speed-up vs. microprocessor is reported
Primary bottlenecks limiting the performance of both systems are recognized
The capability to share and port applications between the SRC and Cray systems is explored