track f: opencl for altera fpgas, accelerating performance and design productivity/ liad weinberger
DESCRIPTION
TRANSCRIPT
May 1, 2013 1
OpenCL for ALTERA FPGAs
Accelerating performance and design productivity
Liad Weinberger – Appilo
May 1st, 2013
May 1, 2013 2
Technology trends • Over the past years
– Technology scaling favors programmability and parallelism
Fine-Grained Massively
Parallel Arrays
Single Cores Coarse-Grained Massively
Parallel Processor
Arrays
Multi-Cores Coarse-Grained CPUs and DSPs
CPUs DSPs Multi-Cores Array GPGPUs FPGAs
May 1, 2013 3
Technology trends
0
20
40
60
80
100
120
140
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Process node (nm)
• Moore’s law still in effect – More FPGA real-estate
• More potential for parallelism – an extremely good thing!
• Designs that utilize this real-estate, becomes harder to manage and maintain – this is not so good...
May 1, 2013 4
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends Worldwide Interest over the years
Verilog + VHDL• Decreased interest – Number of Google searches for VHDL or
Verilog in decline
May 1, 2013 5
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends Interest over the years
Verilog + VHDL
Python
• Software development keeps momentum – Number of Google searches for Python (as a
representing language)
May 1, 2013 6
FPGA (hardware) development
• Design (programming) is complex – Define state machine, data-paths, arbitration, IP interfaces, etc.
– Sophisticated iterative compilation process
• Synthesis, technology mapping, clustering, placement and routing, timing closure
• Leads to long compilation times (hours vs. minutes in software)
– Debug process is also very time-consuming
• Code is not portable – Written in Verilog / VHDL
• Can’t re-target for CPUs, GPUs, DSPs, etc.
• Not scalable
Co
mp
ilati
on
HDL
Tim
ing
Clo
sure
Set
C
on
stra
ints
May 1, 2013 7
Software development
• Programming is straight-forward – Ideas are expressed in languages such as C/C++/Python/etc.
• Typically, start with simple sequential implementation
• Use parallel APIs / language extensions, in order to exploit multi-core architectures for additional performance
– Compilation times are usually reasonably short
• Simple straight-forward compilation/linking process
– Immediate feedback when debugging/profiling
• An assortment of tools available for both debugging and profiling
• Portability is still an issue – Possible, but require pre-planning
Co
mp
iler
& L
inke
r
C/C++
Python
etc.
C/C++
Python
etc.
C/C++/
Python/
etc.
May 1, 2013 8
Product development point-of-view
• Product producers want: – Lower development and maintenance costs
– Competitive edge
• Higher performance
• Short time-in-market, and short time-to-market
– Agile development methods are becoming more and more popular
– Can’t afford long development cycles
– Trained developers with established experience
• Or cost-effective path for training new developers
– Flexibility
• No vendor-locking is preferred
• Ability to rapidly adapt product to market requirement changes
May 1, 2013 9
Our challenge
• How do we bring FPGA design process closer to the software development model? – Need to make FPGAs more accessible to the software development
community
• Change in mind-set: look at FPGAs as massively multi-core devices that could be used in order to accelerate parallel applications
• A programming model that allows that
• Shorter compilation times and faster feedback for debugging and profiling the design
May 1, 2013 10
An ideal programming environment...
• Based on a standard programming model – Rather than something which is FPGA-specific
• Abstracts away the underlying details of the hardware – VHDL / Verilog are similar to “assembly language” programming
– Useful in rare circumstances where the highest possible efficiency is needed
• The price of abstraction is not too high – Still need to efficiently use the FPGA’s resources to achieve high throughput / low
area
• Allows for software-like compilation & debug cycles – Faster compile times
– Profiling & user feedback
May 1, 2013 11
Introducing OpenCL
Parallel heterogeneous computing
May 1, 2013 12
A case for OpenCL • What is OpenCL?
– An open, royalty-free standard for cross-platform parallel software programming of heterogeneous systems
• CPU + DSPs
• CPU + GPUs
• CPU + FPGAs
– Maintained by KHRONOS group
• An industry consortium creating open, royalty-free standards
• Comprised of hardware and software vendors
– Enables software to leverage silicon acceleration
• Consists of two major parts: – Application Programming Interface (API) for device management
– Device programming language based on C99 with some restrictions and extensions to support explicit parallelism
Or maybe all together
May 1, 2013 13
Benefits of OpenCL
• Cross-vendor software portability – Functional portability—Same code would normally execute on
different hardware, by different vendors
– Not performance portable—Code still needs to be optimized to specific device (at least a device class)
• Allows for the management of available computational resources under a single framework – Views CPUs, GPUs, FPGAs, and other accelerators as devices that
could carry the computational needs of the application
May 1, 2013 14
OpenCL program structure
• Separation between managerial and computational code bases – Managerial code executes on a host CPU
• Any type of conventional micro-processor
• Written in any language that has bindings for the OpenCL API
– The API is in ANSI-C
– There is a formal C++ binding
– Other bindings may exist
– Computational code executes on the compute devices (accelerators)
• Written in a language called OpenCL C
– Based on C99
– Adds restrictions and extensions for explicit parallelism
• Can be compiled either offline, or online, depending on implementation
• Will most likely consist only of those portions of the application we want to accelerate
May 1, 2013 15
OpenCL program structure
Compute Device Host
Local M
em
Glo
bal M
em
Local M
em
Local M
em
Local M
em
Accelerator Accelerator Accelerator Compute
unit
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
Host Program
Kernel Program
May 1, 2013 16
OpenCL host application
• Communicates with the Accelerator Device via a set of library routines – Abstracts away host processor to HW accelerator communication via
a set of API calls
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
Copy data Host FPGA
Ask the FPGA to run a particular kernel
Copy data FPGA Host
May 1, 2013 17
OpenCL kernels • Data-parallel function
– Executes by many parallel threads
• Each thread has an identifier which could be obtained with a call to the get_global_id() built-in function
• Uses qualifiers to define where memory buffers reside
• Executed by a compute device – CPU
– GPU
– FPGA
– Other accelerator
float *a =
float *b =
float *y =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
__kernel void sum( … );
May 1, 2013 18
OpenCL on FPGAs
How does it map?
May 1, 2013 19
Compiling OpenCL to FPGAs
x86
PCIe
SOF X86 binary
ACL Compiler
Standard C Compiler
OpenCL Host Program + Kernels
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
Kernel Programs Host Program
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
May 1, 2013 20
Compiling OpenCL to FPGAs
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
Kernel Programs
Custom Hardware for Your Kernels
May 1, 2013 21
FPGA architecture for OpenCL
FPGA
Kernel
Pipeline
Kernel
Pipeline
Kernel
Pipeline
PCIe D
DR
*
x86 /
External
Processor
External
Memory
Controller
& PHY
Memory
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
Local Memory Interconnect
External
Memory
Controller
& PHY
Kernel System
May 1, 2013 22
Mapping multithreaded kernels to FPGAs
• Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread – Inefficient and wasteful
• Technique: deep pipeline parallelism – Attempt to create a deeply pipelined representation of a kernel
– On each clock cycle, we attempt to send in input data for a new thread
– Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism
May 1, 2013 23
Example pipeline for vector add
• On each cycle, the portions of the pipeline are processing different threads
• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
May 1, 2013 24
Example pipeline for vector add
• On each cycle, the portions of the pipeline are processing different threads
• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Load Load
Store
1 2 3 4 5 6 7
0
8 threads for vector add example
Thread IDs
+
May 1, 2013 25
Example pipeline for vector add
• On each cycle, the portions of the pipeline are processing different threads
• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Load Load
Store
2 3 4 5 6 7
0
1
8 threads for vector add example
Thread IDs
+
May 1, 2013 26
Example pipeline for vector add
• On each cycle, the portions of the pipeline are processing different threads
• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Load Load
Store
3 4 5 6 7
1
2
8 threads for vector add example
Thread IDs
+
0
May 1, 2013 27
Example pipeline for vector add
• On each cycle, the portions of the pipeline are processing different threads
• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Load Load
Store
4 5 6 7
0
2
3
8 threads for vector add example
Thread IDs
+
1
May 1, 2013 28
Some examples
Using ALTERA’s OpenCL solution
May 1, 2013 29
AES encryption
• Counter (CTR) based encryption/decryption – 256-bit key
• Advantage FPGA – Integer arithmetic
– Coarse grain bit operations
– Complex decision making
• Results
Platform Throughput (GB/s)
E5503 Xeon Processor 0.01 (single core)
AMD Radeon HD 7970 0.33
PCIe385 A7 Accelerator 5.20
42% utilization (2 kernels) •Power conservation
•Fill up for even higher performance
May 1, 2013 30
Multi-asset barrier option pricing
• Monte-carlo simulation – Heston model
– ND range
• Assets x paths (64x1000000)
• Advantage FPGA – Complex control flow
• Results
tttt
S
ttttt
dWdtd
dWSdtSdS
Platform Power
(W) Performance
(Msims/s) Msims/W
W3690 Xeon Processor 130 32 0.25
nVidia Tesla C2075 225 63 0.28
PCIe385 D5 Accelerator 23 170 7.40
May 1, 2013 31
Document filtering
• Unstructured data analytics – Bloom Filter
• Advantage FPGA – Integer arithmetic
– Flexible memory configuration
• Results
Platform Power (W) Performance (MTs) MTs/W
W3690 Xeon Processor 130 2070 15.92
nVidia Tesla C2075 215 3240 15.07
DE4 Stratix IV-530 Accelerator 21 1755 83.57
PCIe385 A7 Accelerator 25 3602 144.08
May 1, 2013 32
Fractal video compression
• Best matching codebook – Correlation with SAD
• Advantage FPGA – Integer arithmetic
• Results
Platform Power (W) Performance (FPS) FPS/W
W3690 Xeon Processor 130 4.6 0.035
nVidia Tesla C2075 215 53.1 0.247
DE4 Stratix IV-530 Accelerator 21 70.9 3.376
PCIe385 A7 Accelerator 25 74.4 2.976