Introduction to Custom ComputingIntroduction to Custom Computing
Oskar MencerOskar Mencer [email protected]
ObjectivesObjectives
• What is Custom Computing, and Why
• Compilation issues
• Run-time issues
• Systems: putting it all together
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
some Historysome History
1st “FPGA” Patents
1984 2002
XilinxXilinxmajor companies
AlteraAltera
SRAM-basedlookup tables
(LUTs)
large SRAM blocks FPGA+Multipliers
199719911st FPL Conference
Reconfigurable -Custom
Computing
DEC PRLPAM Project
SAT Solverson FPGAs
Datastructureson FPGAs
Signal Processingon FPGAs
1989TriscendTriscend
FPGA+Processor
Systolic ArraysNeural Networks
Flexible+Variable Structs.DataflowVLIW,…
RSA EncryptionSpeed Record
National SemiconductorNational Semiconductor
FPGA versus MicroprocessorFPGA versus Microprocessor
partly from P. Alfke, Xilinx.
… … and the price per gate goes down …and the price per gate goes down …
data taken partly from Xilinx from P.Alfke, Xilinx.
Time/months
Update 1
Update 2
1 10 20
30 Source: Algotronix Consulting / Xilinx
FPGA versus ASIC Accelerator Card FPGA versus ASIC Accelerator Card
Number of New Users
FPGA accelerator
ASIC accelerator
HW execution
HW compilation
The Speedup ArgumentThe Speedup Argument
>100 speedup
Execution Time
DIMACS Benchmarks, Xilinx FPGAs vs. GRASP Software
Boolean Satisfiability: FPGA vs Pentium
Source: J. Babb PhD thesis, MIT, 2001
Speedup: FPGA vs MIPS ProcessorSpeedup: FPGA vs MIPS Processor
Source: J. Babb PhD thesis, MIT, 2001
Energy Reduction: FPGA vs MIPS CoreEnergy Reduction: FPGA vs MIPS Core
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
A B C D Z
0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . . 1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1
Look-Up Table (LUT): A Universal GateLook-Up Table (LUT): A Universal Gate• combinatorial logic: stored in Look-Up Tables• capacity limited by number of inputs not complexity• LUT delay: independent of logic!
16-bit SRAM
Combinatorial Logic
AB
CD
Z
CLB
CLB
CLB
CLB
S witchMa trix
ProgrammableInterconnect
Xilinx FPGAs: XC4000, Virtex, Virtex IIXilinx FPGAs: XC4000, Virtex, Virtex II
Configurable Logic Block (CLB)
FPGA Chip
LUT
4-LUTs flip-flops
tri-state buffers
carry chain
CLB
Slice 0Slice 0
LUT Carry
LUT Carry D QCE
PRE
CLR
DQ
CE
PRE
CLR
Slice 1Slice 1
LUT Carry
LUT Carry D QCE
PRE
CLR
D QCE
PRE
CLR
Simplified CLB Structure of Xilinx FPGAsSimplified CLB Structure of Xilinx FPGAs
The VLSI CAD Productivity GapThe VLSI CAD Productivity Gap
1
Log i
c Tr
a nsi
sto r
s p e
r Ch i
p(K
)
P
rodu
cti v
i t yTr
ans .
/Sta
f f - M
onth
10
100
1,000
10,000
100,000
1,000,000
10,000,000
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000 Logic Transistors/ChipTransistor/Staff Month
58%/Yr. compoundComplexity growth rate
21%/Yr. compoundProductivity growth rate
Source: SEMATECH
1981
1983
1985
1 987
1989
1991
1993
1995
1 997
1999
2003
2001
200 5
2 007
200 9
xx xx x
x
x
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
Programming FPGAs ToolflowProgramming FPGAs Toolflow
C/Matlab/Premiere
Compiler
Machinecode
Run-timeinterface
Configurationinformation
Fixed processor Custom processorCustom computing system
The “Compiler” for FPGAsThe “Compiler” for FPGAs
Conventional Compiler Passes
Architecture Generation
Module Generation
Gate Level CADFPGA Netlist
loop transformationsbitwidth analysispointer analysismemory managementdatastructure transformarchitecture selection
Module instantiationModule connectionControl generation
“instruction set” generationoptimal technology mappingcarry chain generation
FPGA Vendor Tools
FPGA Configuration
Languages for Architecture GenerationLanguages for Architecture Generation
• generate particular instances of a generic architecture, such as a signal processor, a stream architecture, a neural network, a cellular automaton, etc…
• HW Description Languages: VHDL, Verilog
• SW Languages used for Custom Computing• C/C++, Java, Ruby, etc.
• C++ example: ASC – A Stream Compiler
ASC - A Stream Compiler for FPGAs ASC - A Stream Compiler for FPGAs Bell Labs
ASC generates Stream Architectures based on C++ input.
MemoryMemoryMemoryMemory
featuresdistributed registersdistributed delay FIFOslocal memory blocksexternal MemoryPAM-Blox II modules
(pre-pipelined)
ASC Example: IDEA EncryptionASC Example: IDEA EncryptionLOOP(8); LoopIndex(i);
word1 = HWmul(word1, key[i*6+0]); word2 = word2 + key[i*6+1]; word3 = word3 + key[i*6+2]; word4 = HWmul(word4, key[i*6+3]);
t2 = word1 ^ word3; t2 = HWmul(t2, key[i*6+4]);
t1 = (t2 + (word2 ^ word4)); t1 = HWmul(t1, key[i*6+5]); t2 = (t1 + t2);
word1 ^= t1; word4 ^= t2;
t2 ^= word2; word2 = word3 ^ t1; word3 = t2;LOOP_END();
1
10
100
1000
Processors XCV300 XCV600 XCV2000
Mbits/s
Performance
ratio ioncommunicatncomputatio high
*“unroll once” “unroll 8x”
OPTIMIZE=REDUNDANT;
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
The “Compiler” for FPGAsThe “Compiler” for FPGAs
Conventional Compiler Passes
Architecture Generation
Module Generation
Gate Level CADFPGA Netlist
loop transformationsbitwidth analysispointer analysismemory managementdatastructure transformarchitecture selection
Module instantiationModule connectionControl generation
“instruction set” generationoptimal technology mappingcarry chain generation
FPGA Vendor Tools
FPGA Configuration
Bitwidth Analysis (Type Size Inference)Bitwidth Analysis (Type Size Inference)
• for programs written in a high-level language
• minimum bitwidth required for • each variable at every
static location of the program • each static operation
of the program• reduces memory bandwidth
dependency, MBD (slide 15, above)
int a;int b;char c;
a = c;
a = a / 2;
b = a >> 4;
b = a + b;
8 bits
7 bits
3 bits
8 bits
7 bits
8 bits 7 bits
HAGAR: Hardware Graph AcceleratorsHAGAR: Hardware Graph Accelerators
0010100001000110
3
2
1
0
3210
vvvv
vvvv
operations: insert, delete, reachability, …reachability speedup over PC: 10 to 1000 for 1K nodes
Row 3
Row 2
Row 1
Row 0
Col. 0 Col. 1 Col. 2 Col. 3
Flip-flop
.. .
. ..
..
..
.....
....
.. . ..
.
..
.. ...
11
11
1111
11
TristateBuffer
Memory Latency Dependent, e.g. data structures, pointers implement data-structure + algorithm in hardware
from Lorenz Huelsbergen
1000 Vertices
100 Vertices
high inter-chip delay
low inter-chip delay
Speedup of ReachabilitySpeedup of ReachabilitySpeedup over Software(simulation results)
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
Run-time Reconfiguration Run-time Reconfiguration
FPGAs are reconfigurable within milliseconds!
1. generating configurations• at compile time• at run time
2. scheduling reconfigurations• at compile time - control driven• at run time – data driven
Generating & Scheduling ReconfigurationsGenerating & Scheduling ReconfigurationsApplication
source or executable
Phase 1
Phase 2
FPGA Configurations
C1 C2
Control driven reconfiguration from phase 1 to phase 2.Data driven reconfiguration from C1 to C2=> insert reconfiguration strategy (algorithm) into application
Generating & Scheduling Reconfigurations Generating & Scheduling Reconfigurations
@compile time
Viterbi decoder:adaptive error
correction
@run time
boolean satisfiability
media processing
sche
dulin
g re
conf
igur
atio
ns control driven
generating configurations
data driven
network intrusion detection
image processing
Rijndael AES encryption
Media Processing ExampleMedia Processing Example
• shape-adaptive template matching (SATM)• MPEG4, MPEG7: find arbitrarily shaped object in video
• template contains object of interest• Sum of Absolute Distances of template and video
• adapt design to template and search frame
SATM: FPGA versus PCSATM: FPGA versus PC • HDTV frames: 1920 by 1080 pixels• for 300 HDTV frames, use PC time as reference TPC : execution time of PC with 1.4GHz Pentium 4 SD : speedup of dynamic design SPD : speedup of partially dynamic design SS : speedup of static design
• dynamic design superior when:• a large number of consecutive frames• templates do not change often • small templates
T_D : time for dynamic design T_S : time for static design T_PD : time for partially dynamic design
SATM: Performance ResultsSATM: Performance Results
dynamicpartially-dynamic
static
Overview of Custom ComputingOverview of Custom Computing
1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation
4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card
3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven
2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes - loop transformations - bitwidth analysis
System View of Custom AcceleratorsSystem View of Custom Accelerators
Microprocessor
FPGA
Memory
Interconnect
from the IBM websiteIntel Pentium
“Custom Accelerator”Local Area NetworkPCI BusPC Card/CardbusMemory Bus / DIMMCustom InterconnectOn-chip Interconnect
Xilinx Virtex-II ProXilinx Virtex-II Pro
Source: Xilinx
Case Studies Case Studies
1 PAM project at DIGITAL PRL (Compaq)• various applications on multi-FPGA platform
2 SONIC project at Imperial College and Sony• broadcast-quality video processing PCI card
Case Study 1: The PAM Project Case Study 1: The PAM Project
PProgrammable rogrammable AActive ctive MMemories emories (source: Mark Shand)(source: Mark Shand)
4 Platforms: - Perle-0 (1989)
50k gate reconfigurable VME board - DECPeRLe-1 (1992)
200k gates, 20ms turnaround - TURBOchannel Pamette (1994) - PCI Pamette (1996) PCI card with 40K - 112K gates
People: Jean Vuillemin Patrice Bertin Didier Roncin Herve Touati
Philippe BoucardHarry PrintzMark Shand
Applications: Long Int. Multiplication RSA Cryptography Dynamic Programming Laplace Heat Equation Viterbi Decoder Sound Synthesis Neural Networks
Goals: Maximum performance Rapid turnaround Exploring how to build and program PAM Exploring the application space (Non-goals: high-level synthesis, platform independence, improving FPGAs)
Stereo VisionHough TransformHigh Energy PhysicsImage AquisitionWireless LAN testbed
All applications are implemented by hand at the gate level using PamDC.
DECPeRLe-1DECPeRLe-1
Case Study 2: SONICCase Study 2: SONIC
professional video processing professional video processing at SONY and Imperial Collegeat SONY and Imperial College
PIPE 1 PIPE 2PIPEN-2
PIPEN-1 PIPE N
Host Bus
PIPEFlow BPIPEFlow A
ExternalVideoBuses
LBC(Local BusController)
SONIC architectureSONIC architecture
PIPE = Plug In Processing Element
PIPE 1 PIPE 2PIPEN-2
PIPEN-1 PIPE N
Host Bus
PIPEFlow BPIPEFlow A
ExternalVideoBuses
LBC(Local BusController)PIPE Router
(PR)
PIPE Engine(PE)
PIPEMemory
(PM)
PIPEFlowLeft
PIPEFlowRight
PIPEFlowA
PIPEFlowB
PIPE Bus
PIPEFlowIn
PIPEFlowOut
local busglobal (flow) bus
PIPE ArchitecturePIPE Architecture
PR provides abstractionfor plug-in builder
LBC
PE
PR
PM
PE
PRPM
PE
PR
PM
PE
PR
PM
Host BUS
SONIC Processing an ImageSONIC Processing an Image
Image Transfer In
Processing
Image Transfer Out
Configuration of Plug-In
Image Transfer In
Processing
Image Transfer Out
PE
PR PRPM PM
PE
Configuration of Plug-In
Plug-In
PE PEPMPR PR
PM
SONIC with Multiple Plug-insSONIC with Multiple Plug-ins
PE
PRPM
LBC
PE
PR
PM
PE
PR
PM
PE
PR
PM
PE
PR
PM
PE
PR
PM
Host BUS
Plug-In
PE
PR
PM
Plug-In2
Plug-In3
Plug-In1 1
SONIC versus PentiumSONIC versus Pentium
0
510
15
20
2530
35
40
2-D FIR 3x3 FIR 2-DTransform
ColourCorrector
3x3 FIR +Colour
Corrector
Task
Spe
ed-U
p 1 PIPE2 PIPEs4 PIPEs6 PIPEs
Speedup Compared with an MMX Pentium II 300Mhz
Current Generation: UltraSONICCurrent Generation: UltraSONIC
• PE, PR on one FPGA• 8MB SRAM on PIPE• up to 16 PIPEs• 64-bit, 66MHz PCI• real-time image registration
Pointers to Conferences and JournalsPointers to Conferences and Journals
FPGA ConferencesFPGA Conferences FPGA Conference FCCM FPL FPT ERSAHardware Design ICCAD DACComputer Architecture ISCA, ISC, HPCA, ASPLOS, MICRO
JournalsJournals IEEE Transactions on Computers
IEEE Transactions on VLSI
IEEE Transactions on CAD
Kluwer Journal on VLSI and Signal Processing
ACM Transactions on Architecture and Code Optimization