streamroller: compiler orchestrated synthesis of accelerator pipelines

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan


Automated C to Gates Solution

• SoC design– 10-100 Gops, 200 mW power

budget– Low level tools ineffective

• Automated accelerator synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA


Streaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packets

Data in Data outCRC Conv./

TurboBlock

Interleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter


System Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput


Input Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels


System Level Decisions

• Throughput of each LA – Initiation Interval• Grouping of loops into a multifunction LA

– More loops in a single LA → LA occupied for longer time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles


System Decisions (Contd..)

• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers


Case Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles


Prescribed Throughput Accelerators

• Traditional behavioral synthesis– Directly translate C operators

into gates

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing

Application Architecture

Operation graph Datapath


Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop


Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

RF

FU FU

BuildDatapath

ConcreteArch

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

.v

LoopAccelerator


LA1

LA2

LA4

AcceleratorPipeline

LoopAccelerator

LA3

LA5

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

…

Block 5

LA1

LA2

LA3

AcceleratorPipeline

…

LoopAccelerator

MultifunctionLoopAccelerator

MultifunctionLoopAccelerator


Union

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

Cost SensitiveModulo Scheduler

FU FU

FU FU

FU FUDatapathUnion

• 43% average savings over sum of accelerators• Smart union within 3% of joint scheduling solution


• Algorithm-level pipeline retiming– Splitting loops based on tiling– Co-scheduling adjacent loops

Challenges: Throughput Enabling Transformations

Loop 2

Loop 3

Loop 4

Loop 1 Loop 1

Loop 2a

Loop 2b

Loop 3,4

Critical loop

Critical loop


Challenges: Programmable Loop Accelerator

• Support bug fixes, evolving standards• Accelerate loops not known at design time• Minimize additional control overhead

Interconnect

FU

… …

FU

… …

MEM

… …

LocalMem

Control

II

Controlsignals


Challenges: Timing Aware Synthesis

• Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

• Strategies to eliminate long wires– Preemptive: predict & prevent long wires– Reactive: use feedback from floorplanner

FU1 FU2 FU3- Insert flip flop on long path- Reschedule with added latency


Challenges: Adaptable Voltage/Frequency Levels

• Allow voltage scaling beyond margins

• Using shadow latches in loop accelerator– Localized error detection– Control is predefined:

simple error recovery

D

CLK

Q

error

flip-flop

shadowlatch

delay

FU FU

Shadowlatch Extra queue

entries


For More Information

• Visit http://cccp.eecs.umich.edu

streamroller: compiler orchestrated synthesis of accelerator pipelines

Documents