track f: opencl for altera fpgas, accelerating performance and design productivity/ liad weinberger

May 1, 2013 1

OpenCL for ALTERA FPGAs

Accelerating performance and design productivity

Liad Weinberger – Appilo

May 1st, 2013

May 1, 2013 2

Technology trends • Over the past years

– Technology scaling favors programmability and parallelism

Fine-Grained Massively

Parallel Arrays

Single Cores Coarse-Grained Massively

Parallel Processor

Arrays

Multi-Cores Coarse-Grained CPUs and DSPs

CPUs DSPs Multi-Cores Array GPGPUs FPGAs

May 1, 2013 3

Technology trends

0

20

40

60

80

100

120

140

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022

Process node (nm)

• Moore’s law still in effect – More FPGA real-estate

• More potential for parallelism – an extremely good thing!

• Designs that utilize this real-estate, becomes harder to manage and maintain – this is not so good...

May 1, 2013 4

Technology trends

2007 2008 2009 2010 2011 2012 2013

Google trends Worldwide Interest over the years

Verilog + VHDL• Decreased interest – Number of Google searches for VHDL or

Verilog in decline

May 1, 2013 5

Technology trends

2007 2008 2009 2010 2011 2012 2013

Google trends Interest over the years

Verilog + VHDL

Python

• Software development keeps momentum – Number of Google searches for Python (as a

representing language)

May 1, 2013 6

FPGA (hardware) development

• Design (programming) is complex – Define state machine, data-paths, arbitration, IP interfaces, etc.

– Sophisticated iterative compilation process

• Synthesis, technology mapping, clustering, placement and routing, timing closure

• Leads to long compilation times (hours vs. minutes in software)

– Debug process is also very time-consuming

• Code is not portable – Written in Verilog / VHDL

• Can’t re-target for CPUs, GPUs, DSPs, etc.

• Not scalable

Co

mp

ilati

on

HDL

Tim

ing

Clo

sure

Set

C

on

stra

ints

May 1, 2013 7

Software development

• Programming is straight-forward – Ideas are expressed in languages such as C/C++/Python/etc.

• Typically, start with simple sequential implementation

• Use parallel APIs / language extensions, in order to exploit multi-core architectures for additional performance

– Compilation times are usually reasonably short

• Simple straight-forward compilation/linking process

– Immediate feedback when debugging/profiling

• An assortment of tools available for both debugging and profiling

• Portability is still an issue – Possible, but require pre-planning

Co

mp

iler

& L

inke

r

C/C++

Python

etc.

C/C++

Python

etc.

C/C++/

Python/

etc.

May 1, 2013 8

Product development point-of-view

• Product producers want: – Lower development and maintenance costs

– Competitive edge

• Higher performance

• Short time-in-market, and short time-to-market

– Agile development methods are becoming more and more popular

– Can’t afford long development cycles

– Trained developers with established experience

• Or cost-effective path for training new developers

– Flexibility

• No vendor-locking is preferred

• Ability to rapidly adapt product to market requirement changes

May 1, 2013 9

Our challenge

• How do we bring FPGA design process closer to the software development model? – Need to make FPGAs more accessible to the software development

community

• Change in mind-set: look at FPGAs as massively multi-core devices that could be used in order to accelerate parallel applications

• A programming model that allows that

• Shorter compilation times and faster feedback for debugging and profiling the design

May 1, 2013 10

An ideal programming environment...

• Based on a standard programming model – Rather than something which is FPGA-specific

• Abstracts away the underlying details of the hardware – VHDL / Verilog are similar to “assembly language” programming

– Useful in rare circumstances where the highest possible efficiency is needed

• The price of abstraction is not too high – Still need to efficiently use the FPGA’s resources to achieve high throughput / low

area

• Allows for software-like compilation & debug cycles – Faster compile times

– Profiling & user feedback

May 1, 2013 11

Introducing OpenCL

Parallel heterogeneous computing

May 1, 2013 12

A case for OpenCL • What is OpenCL?

– An open, royalty-free standard for cross-platform parallel software programming of heterogeneous systems

• CPU + DSPs

• CPU + GPUs

• CPU + FPGAs

– Maintained by KHRONOS group

• An industry consortium creating open, royalty-free standards

• Comprised of hardware and software vendors

– Enables software to leverage silicon acceleration

• Consists of two major parts: – Application Programming Interface (API) for device management

– Device programming language based on C99 with some restrictions and extensions to support explicit parallelism

Or maybe all together

http://en.wikipedia.org/wiki/File:OpenCL_Logo.png

May 1, 2013 13

Benefits of OpenCL

• Cross-vendor software portability – Functional portability—Same code would normally execute on

different hardware, by different vendors

– Not performance portable—Code still needs to be optimized to specific device (at least a device class)

• Allows for the management of available computational resources under a single framework – Views CPUs, GPUs, FPGAs, and other accelerators as devices that

could carry the computational needs of the application

May 1, 2013 14

OpenCL program structure

• Separation between managerial and computational code bases – Managerial code executes on a host CPU

• Any type of conventional micro-processor

• Written in any language that has bindings for the OpenCL API

– The API is in ANSI-C

– There is a formal C++ binding

– Other bindings may exist

– Computational code executes on the compute devices (accelerators)

• Written in a language called OpenCL C

– Based on C99

– Adds restrictions and extensions for explicit parallelism

• Can be compiled either offline, or online, depending on implementation

• Will most likely consist only of those portions of the application we want to accelerate

May 1, 2013 15

OpenCL program structure

Compute Device Host

Local M

em

Glo

bal M

em

Local M

em

Local M

em

Local M

em

Accelerator Accelerator Accelerator Compute

unit

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

Host Program

Kernel Program

May 1, 2013 16

OpenCL host application

• Communicates with the Accelerator Device via a set of library routines – Abstracts away host processor to HW accelerator communication via

a set of API calls


Copy data Host FPGA

Ask the FPGA to run a particular kernel

Copy data FPGA Host

May 1, 2013 17

OpenCL kernels • Data-parallel function

– Executes by many parallel threads

• Each thread has an identifier which could be obtained with a call to the get_global_id() built-in function

• Uses qualifiers to define where memory buffers reside

• Executed by a compute device – CPU

– GPU

– FPGA

– Other accelerator

float *a =

float *b =

float *y =

0 1 2 3 4 5 6 7

7 6 5 4 3 2 1 0

7 7 7 7 7 7 7 7


__kernel void sum( … );

May 1, 2013 18

OpenCL on FPGAs

How does it map?

May 1, 2013 19

Compiling OpenCL to FPGAs

x86

PCIe

SOF X86 binary

ACL Compiler

Standard C Compiler

OpenCL Host Program + Kernels


Kernel Programs Host Program


May 1, 2013 20

Compiling OpenCL to FPGAs

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

PCIe

DDRx


Kernel Programs

Custom Hardware for Your Kernels

May 1, 2013 21

FPGA architecture for OpenCL

FPGA

Kernel

Pipeline

Kernel

Pipeline

Kernel

Pipeline

PCIe D

DR

*

x86 /

External

Processor

External

Memory

Controller

& PHY

Memory

Memory

Memory

Memory

Memory

Memory

Global Memory Interconnect

Local Memory Interconnect

External

Memory

Controller

& PHY

Kernel System

May 1, 2013 22

Mapping multithreaded kernels to FPGAs

• Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread – Inefficient and wasteful

• Technique: deep pipeline parallelism – Attempt to create a deeply pipelined representation of a kernel

– On each clock cycle, we attempt to send in input data for a new thread

– Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism

May 1, 2013 23

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

0 1 2 3 4 5 6 7

8 threads for vector add example

Thread IDs

+

May 1, 2013 24




Load Load

Store

1 2 3 4 5 6 7

0


Thread IDs

+

May 1, 2013 25




Load Load

Store

2 3 4 5 6 7

0

1


Thread IDs

+

May 1, 2013 26




Load Load

Store

3 4 5 6 7

1

2


Thread IDs

+

0

May 1, 2013 27




Load Load

Store

4 5 6 7

0

2

3


Thread IDs

+

1

May 1, 2013 28

Some examples

Using ALTERA’s OpenCL solution

May 1, 2013 29

AES encryption

• Counter (CTR) based encryption/decryption – 256-bit key

• Advantage FPGA – Integer arithmetic

– Coarse grain bit operations

– Complex decision making

• Results

Platform Throughput (GB/s)

E5503 Xeon Processor 0.01 (single core)

AMD Radeon HD 7970 0.33

PCIe385 A7 Accelerator 5.20

42% utilization (2 kernels) •Power conservation

•Fill up for even higher performance

May 1, 2013 30

Multi-asset barrier option pricing

• Monte-carlo simulation – Heston model

– ND range

• Assets x paths (64x1000000)

• Advantage FPGA – Complex control flow

• Results

tttt

S

ttttt

dWdtd

dWSdtSdS

Platform Power

(W) Performance

(Msims/s) Msims/W

W3690 Xeon Processor 130 32 0.25

nVidia Tesla C2075 225 63 0.28

PCIe385 D5 Accelerator 23 170 7.40

May 1, 2013 31

Document filtering

• Unstructured data analytics – Bloom Filter


– Flexible memory configuration

• Results

Platform Power (W) Performance (MTs) MTs/W

W3690 Xeon Processor 130 2070 15.92

nVidia Tesla C2075 215 3240 15.07

DE4 Stratix IV-530 Accelerator 21 1755 83.57

PCIe385 A7 Accelerator 25 3602 144.08

May 1, 2013 32

Fractal video compression

• Best matching codebook – Correlation with SAD


• Results

Platform Power (W) Performance (FPS) FPS/W

W3690 Xeon Processor 130 4.6 0.035

nVidia Tesla C2075 215 53.1 0.247

DE4 Stratix IV-530 Accelerator 21 70.9 3.376

PCIe385 A7 Accelerator 25 74.4 2.976