sciences usc information institute pedro c. diniz, mary w. hall, joonseok park, byoungro so and...

SCIENCESSCIENCES

USCUSCINFORMATIONINFORMATION

INSTITUTEINSTITUTE

Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler

University of Southern California / Information Sciences Institute4676 Admiralty Way, Suite 1001Marina del Rey, California 90292

DEFACTO: Combining Parallelizing Compiler Technology with Hardware

Behavioral Synthesis*

* The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F30602-98-2-0113.

2SCIENCESSCIENCES


INSTITUTEINSTITUTE

Outline

Background & Motivation Part 1: Application Mapping Example Part 2: Design Space Exploration Part 3: Challenges for Future FPGAs Related Work Conclusion

3SCIENCESSCIENCES


INSTITUTEINSTITUTE

DEFACTO Objective & Goals

Objectives:• Automatically Map High-Level Applications

to Field-Programmable Hardware (FPGAs)• Explore Multiple Design Choices

Goal:• Make Reconfigurable Technology

Accessible to the Average Programmer

4SCIENCESSCIENCES


INSTITUTEINSTITUTE

What Are FPGAs: Key Concepts

• Configurable Hardware• Reprogrammable (ms latency)

Architecture• Configurable Logic Blocks

“Universal” logic Some input/outputs latched

• Passive network between CLBs• Memories, processor cores

5SCIENCESSCIENCES


INSTITUTEINSTITUTE

Why Use FPGAs?

Advantages over Application-Specific Integrated Circuits (ASICs)

• Faster Time to Market• “Post silicon” Modification Possible• Reconfigurable, Possibly Even at Run-time

Advantages Over General-Purpose Processors• Application-Specific Customization (e.g., parallelism, small

data-widths, arithmetic, bandwidth)

Disadvantages• Slow (typical automatic design @25MHz)• Low Density of Transistors

6SCIENCESSCIENCES


INSTITUTEINSTITUTE

How to Program FPGAs?

Hardware-Oriented Languages• VHDL or Verilog• Very Low-Level Programming

Commercial Tools (e.g., MonetTM)• Choose Implementation Based on User Constraints• Time and Space Trade-Off• Provide Estimations for Implementation

Problem: Too Slow for Large Complex Designs• Place-and-Route Can Take up to 8 Hours for Large Designs• Unclear What to Do When Things Go Wrong

7SCIENCESSCIENCES


INSTITUTEINSTITUTE

Behavioral Synthesis Example

variable A is std_logic_vector(0..7)…X <= (A * B) - (C * D) + F

6 Registers2 Multipliers

2 Adders/Subtractors1 (long) clock cycle

9 Registers2 Multipliers

2 Adders/Subtractors2 (shorter) clock cycles

13 Registers1 Multiplier

2 Adders/Subtractors3 (shorter) clock cycles

8SCIENCESSCIENCES


INSTITUTEINSTITUTE

Synthesizing FPGA Designs: Status

Technology Advances have led to Increasingly Large Parts• FPGAs now have Millions of “gates”

Current Practice is to Handcode Designs for FPGAs in Structural VHDL • Tedious and Error Prone• Requires Weeks to Months Even for Fairly

Simple Designs Higher-level Approach Needed!

9SCIENCESSCIENCES


INSTITUTEINSTITUTE

DEFACTO: Key Ideas Parallelizing Compiler Technology

• Complements Behavioral Synthesis

• Adjusts Parallelism and Data Reuse

• Optimizes External Memory Accesses

Design Space Exploration• Evaluates and Compares Designs before Committing

to Hardware

• Improves Design Time Efficiency

• a form of Feedback-directed Optimization

10SCIENCESSCIENCES


INSTITUTEINSTITUTE

Opportunities: Parallelism & Storage

Behavioral Synthesis Parallelizing Compiler

Optimizations: Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across

Iterations

Supports User-Controlled Analysis Guides Automatic LoopLoop Unrolling Transformations

Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip

Considers one FPGA System-level View

Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation

SCIENCESSCIENCES


INSTITUTEINSTITUTE

Part 1: Mapping Complete Designs from C to FPGAs

Sobel Edge Detection Example

12SCIENCESSCIENCES


INSTITUTEINSTITUTE

Example - Sobel Edge Detection

char img[IMAGE_SIZE][IMAGE_SIZE], edge [IMAGE_SIZE][IMAGE_SIZE];

int uh1, uh2, threshold;

for (i=0; i < IMAGE_SIZE - 4; i++) {

for (j=0; j < IMAGE_SIZE - 4; j++) {

uh1= (((-img[i][j]) + (- (2* img[i+1][j])) + (-img[i+2][j])) +

((img[i][j-2]) + (2* img[i+1][j-2]) + (img[i+2][j-2])));

uh2= (((-img[i][j]) + (img[i+2][j])) + (- (2* img[i][j-1])) +

(2* img[i+2][j-1]) + ((- img[i][j-2]) + (img[i][j-2])));

if ((abs(uh1) + abs(uh2)) < threshold)

edge[i][j] =”0xFF”;

else

edge[i][j] =”0x00;

}

}edgeimg

-1 -2 -1 0 0 0 1 2 1

1 0 -1 2 0 -2 1 0 -1

threshold

13SCIENCESSCIENCES


INSTITUTEINSTITUTE

Sobel - A Naïve Implementation

Large Number of Adders and Multipliers (shifts in this case) Too Many Memory Accesses !

• 8 Reads and 1 Write per Iteration of the Loop• Observation

Across 2 Iterations 4 out of 8 Values Can Be Reused

0x00

0xFF

img[i][j] img[i][j+1] img[i][j+2]

img[i+2][j] img[i+2][j+1] img[i+2][j+2]

img[i+1][j+2]img[i+1][j]

edge[i][j]

14SCIENCESSCIENCES


INSTITUTEINSTITUTE

Data Reuse Analysis - Sobelimg[i][j+1]

img[i+2][j+1]

img[i][j+2]

img[i+2][j+2]

img[i+1][j+2]

img[i][j]

img[i+2][j]

img[i+1][j]

d = (1,0) d = (2,0) d = (1,0)

img[i][j+1]

img[i+2][j+1]

img[i][j+2]

img[i+2][j+2]

img[i+1][j+2]

img[i][j]

img[i+2][j]

img[i+1][j]

d = (0,1)

d = (0,2)

d = (0,1)

15SCIENCESSCIENCES


INSTITUTEINSTITUTE

Data Reuse using Tapped-Delay Lines Reduce the Number of Memory Accesses Exploit Array Layout and Distribution

• Packing• Stripping

Examples:

0x00

0xFF

img[i][j]

img[i+1][j]

img[i+2][j]edge[i][j] edge[i][j]

img[i][j] img[i][j+1] img[i][j+2]

0x00

0xFF

Accesses = 0.25 + 0.25 + 0.25 + 0.25 = 1.0 Accesses = 1.0 + 1.0 + 1.0 + 1.0 = 4.0

16SCIENCESSCIENCES


INSTITUTEINSTITUTE

Overall Design Approach

MEM

MEM

MEM

MEM

ApplicationData-path

ApplicationData-path

Application Data-paths• Extract Body of Loops• Uses Behavioral Synthesis

Memory Interfaces• Uses Data Access Patterns to Generate Channel Specs• VHDL Library Templates

17SCIENCESSCIENCES


INSTITUTEINSTITUTE

WildStarTM: A Complex Memory Hierarchy

SharedSharedMemory0Memory0SharedShared

Memory0Memory0


Memory2Memory2SharedShared

Memory3Memory3SharedShared

Memory3Memory3


Memory1Memory1

FPGA 1FPGA 1FPGA 1FPGA 1 FPGA 0FPGA 0FPGA 0FPGA 0 FPGA 2FPGA 2FPGA 2FPGA 2

SRAM1SRAM1SRAM1SRAM1




PCIPCIControllerController

PCIPCIControllerController

To Off-BoardMemory

32bits

64bits

18SCIENCESSCIENCES


INSTITUTEINSTITUTE

Project Status Complex Infrastructure• Different Programming

Languages (C vs. VHDL)• Different EDA Tools• Different Vendors• Experimental Target• In-House Tools

Combines Compiler Techniques and Behavioral Synthesis

• Different Execution Models• Reconcile Representation

It Works!• Fully Automated for Single

FPGA designs• Modest Manual Intervention for

Multi-FPGA designs (simulation OK)

Compiler AnalysisCompiler Analysis

SUIF2VHDLSUIF2VHDL

Behavioral SynthesisBehavioral Synthesis& Estimation (Monet)& Estimation (Monet)

Logic SynthesisLogic Synthesis(Synplicity)(Synplicity)

Place & RoutePlace & Route(Xilinx Foundations)(Xilinx Foundations)

Code TransformationsCode Transformationsand Annotationsand Annotations

Annapolis WildStar Board

Algorithm Description

Computation & Data

Partitioning

Design Space Exploration

Memory Access

Protocols

19SCIENCESSCIENCES


INSTITUTEINSTITUTE

Sobel on the Annapolis WildStar BoardInput Image Output Image

Manual vs. AutomatedMetrics Manual Automated

Space (slices) 2238 2279 (2% increase)

Cycles 326K 518K

Clock Rate (MHz) 42 40

Execution Time (one frame) 7.7 ms (100%) 12.95 ms (159%)

Design Time about 1 week 42 minutes

SCIENCESSCIENCES


INSTITUTEINSTITUTE

Part 2: Design Space Exploration

Using Behavioral Synthesis Estimates

21SCIENCESSCIENCES


INSTITUTEINSTITUTE

Design Space Exploration (Current Practice)

Logic Synthesis / Place&Route

Design Specification (Low-level VHDL)

DesignModification

Validation / Evaluation

2 Weeks for a Working Design 2 Months for an Optimized Design

Correct?Good design?

22SCIENCESSCIENCES


INSTITUTEINSTITUTE

Design Space Exploration (Our Approach)

Algorithm (C/Fortran)

Compiler Optimizations (SUIF)• Unroll and Jam• Scalar Replacement• Custom Data Layout

SUIF2VHDL Translation

Behavioral Synthesis Estimation

Unroll Factor Selection

Logic Synthesis / Place&Route

Overall, Less than 2 hours 5 Minutes for Optimized Design Selection

23SCIENCESSCIENCES


INSTITUTEINSTITUTE

Problem Statement

Execution Time

Exploit parallelism,Reuse dataon chip

Space Requirements

More copies of operators, More on-chip registers

Constraint: Size of design less than FPGA capacity Goal: Minimal execution time Selection Criteria: For given performance, minimal space

• Frees up more space for other computations• Better clock rate achieved• Desirable to use on-chip space efficiently

24SCIENCESSCIENCES


INSTITUTEINSTITUTE

Balance Definition: Data Fetch Rate Consumption Rate

• Consumption Rate[bits/cycle] = data bits consumed per computation time

Limited by the Data Dependences of the Computation

• Data fetch Rate[bits/cycle] = data bits required per computation time

Limited by the FPGA’s Effective Memory Bandwidth

If balance > 1, Compute Bound If balance < 1, Memory Bound Balance suggests whether more resources should

be devoted to enhance computation or storage.

25SCIENCESSCIENCES


INSTITUTEINSTITUTE

Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1)

Loop Unrolling

Exposes fine-grain parallelism by replicating the loop body.

Do I=1, N A(I) = A(I-2) + B(I)

A(I)

B(I)A(I)

A(I-2)A(I)

A(I+1)A(I-1)

B(I+1)

B(I)

As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase.

2 2

2

26SCIENCESSCIENCES


INSTITUTEINSTITUTE

Monotonicity Properties

unroll factor

Data Fetch Rate (bits/cycle)

Saturationpoint

Data Consumption Rate (bits/cycle)

Saturation point

Balance(= Fetch/Consumption)

unroll factor

unroll factor

Saturation point: unroll factor that saturates memory bandwidth for a given architecture

27SCIENCESSCIENCES


INSTITUTEINSTITUTE

Balance & Optimal Unroll Factor

Unroll factor

Rate (bits/cycle)1

3

5

2

4

Saturation point max

Data fetch rate

Data consumption rate

Optimal solution

Balance Guides the Design Space Exploration.

28SCIENCESSCIENCES


INSTITUTEINSTITUTE

Experiments Multimedia Kernels

• FIR (Finite Impulse Response)• Matrix Multiply• Sobel (Edge Detection)• Pattern Matching• Jacobi (Five Point Stencil)

Methodology• Compiler Translates C to SUIF and Behavioral VHDL• Synthesis Tool Estimates Space and Computational Latency• Compiler Computes Balance and Execution Time Accounting for

Memory Latency Memory Latency

• Pipelined: 1 cycle for read and write• Non-pipelined: 7 cycles for read and 3 cycles for write

29SCIENCESSCIENCES


INSTITUTEINSTITUTE

FIR

+ +x x

* *

Outer Loop Unroll Factor 1







Speedup: 17.26

Selected Design

30SCIENCESSCIENCES


INSTITUTEINSTITUTE

Matrix Multiply

+ +x x







Selected Design

Speedup: 13.36

31SCIENCESSCIENCES


INSTITUTEINSTITUTE

Efficiency of Design Space Exploration

Program Search Space Searched points

FIR 2048 (24) 3

Matrix Multiply 2048 (24) 4

Jacobi 512 (27) 5

Pattern 768 (20) 4

Sobel 2048 (24) 2

On average, only 0.3% (15%) of Space Searched.

32SCIENCESSCIENCES


INSTITUTEINSTITUTE

FIR: Estimation vs. Accurate Data

Larger Designs Lead to Degradation in Clock Rates Compiler Can Use a Statistical Approach to Derive Confidence

Intervals for Space Our case: Compiler Makes Correct Decision using Imperfect Data

SCIENCESSCIENCES


INSTITUTEINSTITUTE

Part 3: Challenges for Future FPGAs

Heterogeneous Functional and Storage Resources

Data/Computation Partitioning and Scheduling Revisited

34SCIENCESSCIENCES


INSTITUTEINSTITUTE

Field-Programmable-Core-Arrays Large Number of Transistors

• Multiple Application Specific Cores• Customization of Interconnect• Other Specialized Logic

Challenges:• Data Partitioning:

Custom Storage Structures Allocation, Binding and Scheduling Replication and Reorganization

• Computation Partition Scheduling between Cores Coarse-Grain Pipelining

Revisiting Issues with Parallelizing Compiler Technology

IP CoreS-RAMD-RAM

IP CoreDSP

IP CoreARM

35SCIENCESSCIENCES


INSTITUTEINSTITUTE

Related Work Compilers for Special-purpose Configurable

Architectures• PipeRench (CMU), RaPiD (UW), RAW (MIT)

High-level Languages Oriented towards Hardware

• Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL)

Integrated Compiler and Logic Synthesis• Babb (MIT), Nimble (Synopsys)

Compiling from MatLab to FPGAs• Match compiler (Northwestern)

36SCIENCESSCIENCES


INSTITUTEINSTITUTE

Conclusion Combines Behavioral Synthesis and Parallelizing

Compiler Technologies Fast & Automated Design Space Exploration

• Trades Space with Functional Units via Loop Unrolling• Uses Balance and Monotonicity Properties• Searches only 0.3% of the Entire Design Space

Near-optimal Performance and Smallest Space Future FPGAs

• Coarser-grained, Custom Functional and Storage Structures• Multiprocessor on a Chip• Data and Computation Partitioning and Coarse Grain Scheduling

37SCIENCESSCIENCES


INSTITUTEINSTITUTE

Thank You

sciences usc information institute pedro c. diniz, mary w. hall, joonseok park, byoungro so and...

Documents

bi ai bi ai ai

data bits

data dependences

wrong slide

small datawidths

average programmer slide

use fpgas

consumption rate increase