track f: opencl for altera fpgas, accelerating performance and design productivity/ liad weinberger

32
May 1, 2013 1 OpenCL for ALTERA FPGAs Accelerating performance and design productivity Liad Weinberger – Appilo May 1 st , 2013

Upload: chiportal

Post on 15-Dec-2014

1.340 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 1

OpenCL for ALTERA FPGAs

Accelerating performance and design productivity

Liad Weinberger – Appilo

May 1st, 2013

Page 2: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 2

Technology trends • Over the past years

– Technology scaling favors programmability and parallelism

Fine-Grained Massively

Parallel Arrays

Single Cores Coarse-Grained Massively

Parallel Processor

Arrays

Multi-Cores Coarse-Grained CPUs and DSPs

CPUs DSPs Multi-Cores Array GPGPUs FPGAs

Page 3: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 3

Technology trends

0

20

40

60

80

100

120

140

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022

Process node (nm)

• Moore’s law still in effect – More FPGA real-estate

• More potential for parallelism – an extremely good thing!

• Designs that utilize this real-estate, becomes harder to manage and maintain – this is not so good...

Page 4: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 4

Technology trends

2007 2008 2009 2010 2011 2012 2013

Google trends Worldwide Interest over the years

Verilog + VHDL• Decreased interest – Number of Google searches for VHDL or

Verilog in decline

Page 5: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 5

Technology trends

2007 2008 2009 2010 2011 2012 2013

Google trends Interest over the years

Verilog + VHDL

Python

• Software development keeps momentum – Number of Google searches for Python (as a

representing language)

Page 6: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 6

FPGA (hardware) development

• Design (programming) is complex – Define state machine, data-paths, arbitration, IP interfaces, etc.

– Sophisticated iterative compilation process

• Synthesis, technology mapping, clustering, placement and routing, timing closure

• Leads to long compilation times (hours vs. minutes in software)

– Debug process is also very time-consuming

• Code is not portable – Written in Verilog / VHDL

• Can’t re-target for CPUs, GPUs, DSPs, etc.

• Not scalable

Co

mp

ilati

on

HDL

Tim

ing

Clo

sure

Set

C

on

stra

ints

Page 7: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 7

Software development

• Programming is straight-forward – Ideas are expressed in languages such as C/C++/Python/etc.

• Typically, start with simple sequential implementation

• Use parallel APIs / language extensions, in order to exploit multi-core architectures for additional performance

– Compilation times are usually reasonably short

• Simple straight-forward compilation/linking process

– Immediate feedback when debugging/profiling

• An assortment of tools available for both debugging and profiling

• Portability is still an issue – Possible, but require pre-planning

Co

mp

iler

& L

inke

r

C/C++

Python

etc.

C/C++

Python

etc.

C/C++/

Python/

etc.

Page 8: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 8

Product development point-of-view

• Product producers want: – Lower development and maintenance costs

– Competitive edge

• Higher performance

• Short time-in-market, and short time-to-market

– Agile development methods are becoming more and more popular

– Can’t afford long development cycles

– Trained developers with established experience

• Or cost-effective path for training new developers

– Flexibility

• No vendor-locking is preferred

• Ability to rapidly adapt product to market requirement changes

Page 9: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 9

Our challenge

• How do we bring FPGA design process closer to the software development model? – Need to make FPGAs more accessible to the software development

community

• Change in mind-set: look at FPGAs as massively multi-core devices that could be used in order to accelerate parallel applications

• A programming model that allows that

• Shorter compilation times and faster feedback for debugging and profiling the design

Page 10: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 10

An ideal programming environment...

• Based on a standard programming model – Rather than something which is FPGA-specific

• Abstracts away the underlying details of the hardware – VHDL / Verilog are similar to “assembly language” programming

– Useful in rare circumstances where the highest possible efficiency is needed

• The price of abstraction is not too high – Still need to efficiently use the FPGA’s resources to achieve high throughput / low

area

• Allows for software-like compilation & debug cycles – Faster compile times

– Profiling & user feedback

Page 11: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 11

Introducing OpenCL

Parallel heterogeneous computing

Page 12: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 12

A case for OpenCL • What is OpenCL?

– An open, royalty-free standard for cross-platform parallel software programming of heterogeneous systems

• CPU + DSPs

• CPU + GPUs

• CPU + FPGAs

– Maintained by KHRONOS group

• An industry consortium creating open, royalty-free standards

• Comprised of hardware and software vendors

– Enables software to leverage silicon acceleration

• Consists of two major parts: – Application Programming Interface (API) for device management

– Device programming language based on C99 with some restrictions and extensions to support explicit parallelism

Or maybe all together

Page 13: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 13

Benefits of OpenCL

• Cross-vendor software portability – Functional portability—Same code would normally execute on

different hardware, by different vendors

– Not performance portable—Code still needs to be optimized to specific device (at least a device class)

• Allows for the management of available computational resources under a single framework – Views CPUs, GPUs, FPGAs, and other accelerators as devices that

could carry the computational needs of the application

Page 14: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 14

OpenCL program structure

• Separation between managerial and computational code bases – Managerial code executes on a host CPU

• Any type of conventional micro-processor

• Written in any language that has bindings for the OpenCL API

– The API is in ANSI-C

– There is a formal C++ binding

– Other bindings may exist

– Computational code executes on the compute devices (accelerators)

• Written in a language called OpenCL C

– Based on C99

– Adds restrictions and extensions for explicit parallelism

• Can be compiled either offline, or online, depending on implementation

• Will most likely consist only of those portions of the application we want to accelerate

Page 15: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 15

OpenCL program structure

Compute Device Host

Local M

em

Glo

bal M

em

Local M

em

Local M

em

Local M

em

Accelerator Accelerator Accelerator Compute

unit

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

Host Program

Kernel Program

Page 16: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 16

OpenCL host application

• Communicates with the Accelerator Device via a set of library routines – Abstracts away host processor to HW accelerator communication via

a set of API calls

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

Copy data Host FPGA

Ask the FPGA to run a particular kernel

Copy data FPGA Host

Page 17: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 17

OpenCL kernels • Data-parallel function

– Executes by many parallel threads

• Each thread has an identifier which could be obtained with a call to the get_global_id() built-in function

• Uses qualifiers to define where memory buffers reside

• Executed by a compute device – CPU

– GPU

– FPGA

– Other accelerator

float *a =

float *b =

float *y =

0 1 2 3 4 5 6 7

7 6 5 4 3 2 1 0

7 7 7 7 7 7 7 7

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

__kernel void sum( … );

Page 18: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 18

OpenCL on FPGAs

How does it map?

Page 19: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 19

Compiling OpenCL to FPGAs

x86

PCIe

SOF X86 binary

ACL Compiler

Standard C Compiler

OpenCL Host Program + Kernels

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

Kernel Programs Host Program

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

Page 20: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 20

Compiling OpenCL to FPGAs

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

Load Load

Store

PCIe

DDRx

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

Kernel Programs

Custom Hardware for Your Kernels

Page 21: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 21

FPGA architecture for OpenCL

FPGA

Kernel

Pipeline

Kernel

Pipeline

Kernel

Pipeline

PCIe D

DR

*

x86 /

External

Processor

External

Memory

Controller

& PHY

Memory

Memory

Memory

Memory

Memory

Memory

Global Memory Interconnect

Local Memory Interconnect

External

Memory

Controller

& PHY

Kernel System

Page 22: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 22

Mapping multithreaded kernels to FPGAs

• Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread – Inefficient and wasteful

• Technique: deep pipeline parallelism – Attempt to create a deeply pipelined representation of a kernel

– On each clock cycle, we attempt to send in input data for a new thread

– Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism

Page 23: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 23

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

0 1 2 3 4 5 6 7

8 threads for vector add example

Thread IDs

+

Page 24: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 24

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

1 2 3 4 5 6 7

0

8 threads for vector add example

Thread IDs

+

Page 25: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 25

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

2 3 4 5 6 7

0

1

8 threads for vector add example

Thread IDs

+

Page 26: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 26

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

3 4 5 6 7

1

2

8 threads for vector add example

Thread IDs

+

0

Page 27: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 27

Example pipeline for vector add

• On each cycle, the portions of the pipeline are processing different threads

• While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Load Load

Store

4 5 6 7

0

2

3

8 threads for vector add example

Thread IDs

+

1

Page 28: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 28

Some examples

Using ALTERA’s OpenCL solution

Page 29: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 29

AES encryption

• Counter (CTR) based encryption/decryption – 256-bit key

• Advantage FPGA – Integer arithmetic

– Coarse grain bit operations

– Complex decision making

• Results

Platform Throughput (GB/s)

E5503 Xeon Processor 0.01 (single core)

AMD Radeon HD 7970 0.33

PCIe385 A7 Accelerator 5.20

42% utilization (2 kernels) •Power conservation

•Fill up for even higher performance

Page 30: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 30

Multi-asset barrier option pricing

• Monte-carlo simulation – Heston model

– ND range

• Assets x paths (64x1000000)

• Advantage FPGA – Complex control flow

• Results

tttt

S

ttttt

dWdtd

dWSdtSdS

Platform Power

(W) Performance

(Msims/s) Msims/W

W3690 Xeon Processor 130 32 0.25

nVidia Tesla C2075 225 63 0.28

PCIe385 D5 Accelerator 23 170 7.40

Page 31: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 31

Document filtering

• Unstructured data analytics – Bloom Filter

• Advantage FPGA – Integer arithmetic

– Flexible memory configuration

• Results

Platform Power (W) Performance (MTs) MTs/W

W3690 Xeon Processor 130 2070 15.92

nVidia Tesla C2075 215 3240 15.07

DE4 Stratix IV-530 Accelerator 21 1755 83.57

PCIe385 A7 Accelerator 25 3602 144.08

Page 32: TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 32

Fractal video compression

• Best matching codebook – Correlation with SAD

• Advantage FPGA – Integer arithmetic

• Results

Platform Power (W) Performance (FPS) FPS/W

W3690 Xeon Processor 130 4.6 0.035

nVidia Tesla C2075 215 53.1 0.247

DE4 Stratix IV-530 Accelerator 21 70.9 3.376

PCIe385 A7 Accelerator 25 74.4 2.976