from software to circuits: high-level synthesis for fpga-based processor/accelerator systems

Post on 23-Feb-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems. Jason Anderson Tools to Tackle Big Data – Big Data Workshop 3 July 2014. Dept. of Electrical and Computer Engineering University of Toronto . LegUp Research Team. - PowerPoint PPT Presentation

TRANSCRIPT

From Software to Circuits: High-Level Synthesis for FPGA-Based

Processor/Accelerator SystemsJason Anderson

Tools to Tackle Big Data – Big Data Workshop3 July 2014

Dept. of Electrical and Computer EngineeringUniversity of Toronto

LegUp Research Team

• Undergrad Researchers: Mathew Hall, Stefan Hadjis, Joy Chen

• Faculty: Stephen Brown and myself • Industry Liaison: Tomasz Czajkowski, Altera

AndrewCanis

JamesChoi

NazaninCalagar

LannyLian

Blair Fort

Computations in Two Ways

Write Software

Computations in Two Ways

Write Software

Computations in Two Ways

Write Software

Computations in Two Ways

Write Software

Computations in Two Ways

Write Software Design Custom Circuits

Computations in Two Ways

Write Software Design Custom Circuits

Computations in Two Ways

Write Software Design Custom Circuits

Computations in Two Ways

Design Methodology

Write software

Design Methodology

Write software• Easy

Design Methodology

Write software• Easy• Flexibility lower performance

Design Methodology

Write software• Easy• Flexibility lower performance

Design Custom Circuits

Design Methodology

Write software• Easy• Flexibility lower performance

Design Custom Circuits• Efficient, low power

Design Methodology

Write software• Easy• Flexibility lower performance

Design Custom Circuits• Efficient, low power• Need specialized knowledge

Design Methodology

Hardware’s Potential

• Implementing computations in FPGA hardware can have speed/energy advantages over software:– Lithography simulation: 15X speed-up [Cong & Zou, TRETS’09]– Linear system solver: 2.2X speed-up, 5X more energy efficient

[Zhang, Betz, Rose, TRETS’12]– Monte Carlo simulation for photodynamic therapy: 80X faster,

45X more energy efficient [Lo et al., J. Biomed Optics’09]– Options pricing: 4.6X faster, 25X more energy efficient

[Tse, Thomas, Luk, TVLSI’12]

So Why Doesn’t Everybody Use Hardware?

• Hardware design is difficult and skills are rare:– Requires use of hardware description languages:

Verilog and VHDL• Low-level of abstraction (individual bits)

– 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware

design for software engineers

*US Bureau of Labour Statistics 2012

• High-Level Synthesis– Design circuits using software languages– From a software program, high-level

synthesis tool automatically “synthesizes” circuit that does the same computations as the program

– Benefits of software programmability and hardware performance

A Solution

LegUp High-Level Synthesis for FPGAs

• LegUp is a high-level synthesis tool we have been developing since 2009.

• Takes a C program as input, and produces a circuit.

• 1000+ downloads of our tool since its first release in 2011.

• http://legup.eecg.toronto.edu

legup.eecg.toronto.edu

Why Use FPGAs to Implement Circuits?

• Building fully fabricated custom chips is hard– Very complex design process– Costs $millions to prototype a chip– Takes 2-3 months to fabricate– Only done for high volume applications or apps that

require high speed or lowest power

• Alternative: pre-fabricated, programmable chips

Field-Programmable Gate Arrays (FPGAs)

Field-Programmable Gate Arrays• Pre-fabricated chip consists of “array” of logic blocks Surrounded by

programmable interconnect• Hardware “becomes” what you want by programming blocks and

interconnect (electrically)

Channels ofprogrammableinterconnect

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

CLB

Blo

ck R

AM

Blo

ck R

AM

Blo

ck R

AM

Har

d IP

Blo

ckH

ard

IP B

lock

Configurablelogic block

Common blocks: multiplier, DSP,

processor,PCI, ADC, DLL

SRAM block(e.g., 18 kbits)

A Real FPGA – Altera Stratix III

FPGA Advantages over “Hard” Chips

• “Manufacture” takes seconds vs. months• Design, test and manufacture:

$single-digit millions vs. $tens of millions• Giving:

– Faster time-to-market for products– FPGA vendor handles difficult design & manufacture issues– FPGA vendor shares inventory risk across many customers– FPGA vendor does test

• Two largest FPGA vendors: Xilinx and Altera

FPGAs and High-Level Synthesis

• FPGAs mainly accessible to HW engineers– Vendors want to expand user-base:

make FPGAs useable as computing platforms• Area/power/delay gap between HLS-generated

HW and manually crafted HW– In custom Si, user must “pay” for area gap– Power/performance one of main reasons to go custom

• FPGAs likely the IC media through which HLS goes “mainstream”

LegUp: Top-Level Vision

Program code

C Compiler Processor(MIPS/ARM)

Self-ProfilingProcessor

Profiling Data:

Execution CyclesPower

Cache Misses

High-levelsynthesis Suggested

programsegments to

target to HWFPGA fabric

P Hardenedprogramsegments

Altered SW binary (calls HW accelerators)

int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum);}....

LegUp: Key Features• C to Verilog high-level synthesis• Many benchmarks (incl. 12 CHStone)• Automated verification tests• Support for four different FPGAs:

– Altera Cyclone II, Stratix IV, Cyclone IV, Cyclone V-SoC

• Open source, freely downloadable

How Does High-Level Synthesis Work?

Digital Circuits

• Example: you buy a “1 GHz processor”

Digital Circuits

• Example: you buy a “1 GHz processor”

1 GHz = 1 nanosecond time-steps

Some computation is done in each time step

Digital Circuits

• Example: you buy a “1 GHz processor”

1 GHz = 1 nanosecond time-steps

time

Some computation is done in each time step

Digital Circuits

• Example: you buy a “1 GHz processor”

1 GHz = 1 nanosecond time-steps

time

1ns

Some computation is done in each time step

Example Circuit

1ns

A B

+ Calculate A+B

Example Circuit

1ns

A B

+

Store computation after each step

Example Circuit

1ns

A B C D E F

+ – *

Example Circuit

1ns

1ns

A B C D E F

+

*

– *

Example Circuit

1ns

1ns

1ns

(A+B)*(C–D) – (E*F)

A B C D E F

+

*

*–

Scheduling: Key Aspect of HLS

• How to assign the computations of a program into the hardware time steps?

C language snippet:

z = a+b;x = c+d;q = z+x;q = q-2;r = q*2;

Programs do not contain the notionof “time steps”.Here, we have: 3 add operations 1 subtract operation 1 multiplication operation

Scheduling

Questions:• Which operations can be scheduled

in the same time step?• Which operations are dependent

on others?• If addition takes 5ns, subtraction

takes 5ns and multiplication takes 10ns, how to schedule?– Target clock step length is 10ns

C language snippet:

z = a+b;x = c+d;q = z+x;q = q-2;r = q*2;

Scheduling

10ns

10ns

10ns

+ +

+

-

*

2

2

a b c d

Scheduling

10ns

10ns

10ns

+ +

+

-

*

2

2

a b c d

chaining

parallel operations

HLS Challenges• Performance of HLS-generated circuits not

as good as human-designed circuits

• However, HLS-generated circuits are already better than SW in many cases

• Much of our research is aimed towards improving HLS quality

Loop Pipelining

Loop Pipeliningfor (int i = 0; i < N; i++) {

sum[i] = a + b + c + d}

+

a b

+

c

+

d

cycle

1

2

3

• Cycles: 3N• Adders: 3• Utilization: 33%

Loop PipeliningCycle 1 2 3 4 5 … N N+1 N+2

i=0 + + +

i=1 + + +

i=3 + + +

…. …. … …. …

i=N-2 + + +

i=N-1 + + +

• Cycles: N+2 (~1 cycle per iteration)• Adders: 3• Utilization: 100% in steady state

Steady State

Loop Pipelining

• Ideally, we could start a loop iteration every clock cycle– Initiation interval (II) = 1

• However,– Loops may have dependencies across iterations– There may be constraints on resources

• e.g. only two memory accesses in a cycle

• Loop pipelining seeks to minimize II subject to constraints

Exploiting Spatial Parallelism

Motivation• Speed benefits of HW arise from spatial

parallelism• Extracting parallelism from a sequential

program is difficult• Auto-parallelizing compilers do not work well!

• Easier to start from parallel code• Pthreads/OpenMP can help!

Background

Programming Models

Background

Programming Models

SequentialC/C++

Background

Programming Models

SequentialC/C++

Massively ParallelCUDA/OpenCL

Background

Programming Models

SequentialC/C++

Massively ParallelCUDA/OpenCLPthreads/OpenMP

Standard API in C!

OpenMP example

#pragma omp parallel for num_threads(2) private(i)for (i = 0; i < SIZE; i++) { output[i] = A_array[i]*B_array[i];}

Pthread Examplestruct thread_data{ int start; int end;};

int main() { pthread_t thread1, thread2; struct thread_data data1, data2;

data1.start = 0; data1.end = SIZE/2; data2.start = SIZE/2; data2.end = SIZE;

pthread_create( &thread1, NULL, product, (void*)&data1); pthread_create( &thread2, NULL, product, (void*)&data2);

pthread_join( thread1, NULL); pthread_join( thread2, NULL);}

void *product(void *threadarg) { int i, startIdx, endIdx;

struct thread_data* arg = (struct thread_data*) threadarg; stardIdx = arg->start; endIdx = arg->end;

for (i = startIdx; i < endIdx; i++){ output[i] = A_array[i]*B_array[i]; }}

Pthreads vs OpenMP

• OpenMP provides an easy/implicit way for parallelizing a section of code (e.g. loops)

• Pthreads require explicit thread forks/joins• Pthreads can be more work but gives more

control to programmer• Pthreads can execute different functions in

parallel

OpenMP/Pthreads Support in LegUp

• Allow Pthreads and OpenMP to be used to specify parallel hardware.

• Automatically infer parallel-operating accelerators for the parallel-operating threads.

• Permits a easy exploration of a broad parallelization landscape.– Incl. support for nested parallelism.

Nested ParallelismPthreads

Nested ParallelismPthreads

add sub mult

Nested ParallelismPthreads

add sub mult

Nested ParallelismPthreads

add sub mult

OMP OMP OMP

Nested ParallelismPthreads

add sub mult

OMP OMP OMP

Nested Parallelism

Processor

On-chip Cache

Off-chip Mem

Accel 1 Accel 2 Accel 3

Nested Parallelism

Processor

Multi-ported Cache

Off-chip Mem

Accel 1 Accel 2 Accel 3

1 2 3 2 3 1 2 3

On-chip Cache

Processor 1

Case Study

Computing the Mandelbrot Set

• Highly compute-bound application

• Each pixel is computed independently

• Fixed point calculations

• Our target image: 128x128 pixels

Target Platform: Altera Cyclone V-SoC

• 28nm FPGA with embedded dual-core ARM processor in the FPGA fabric– 800 MHz ARM with L1 + L2 caches– FPGA accelerators can access ARM cache

ARM processor

AlteraCyclone V FPGAfabric

Speed Performance Results

5.7X speed-upvs ARM SW

ARM software

1 HLS accel

2 HLS accels

4 HLS accels

8 HLS accels

0

5

10

15

20

25

30

0

20

40

60

80

100

120

140

Wall-clock time (ms)MHz

High-Level Synthesis for Big Data

• Seeking big data applications we can collaborate on and accelerate with HLS

• Ideal characteristics:– Compute bound (not I/O bound)– Integer or fixed point (not floating point)– Data parallel

• Please reach out to us

Summary

• LegUp is an open-source high-level synthesis tool being developed at Univ. of Toronto.– Targets a hybrid FPGA-based

processor/accelerator system.– Distribution includes many benchmark programs

and other infrastructure.• Active development continues.

– Pthreads + OpenMP, debugging, memory architecture synthesis, improved HW quality.

Questions?

legup.eecg.toronto.edu

top related