streamroller: compiler orchestrated synthesis of accelerator pipelines
DESCRIPTION
Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective - PowerPoint PPT PresentationTRANSCRIPT
1 University of MichiganElectrical Engineering and Computer Science
Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines
Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke
University of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Automated C to Gates Solution
• SoC design– 10-100 Gops, 200 mW power
budget– Low level tools ineffective
• Automated accelerator synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market
app.c
LA
LA LA
LA
3 University of MichiganElectrical Engineering and Computer Science
Streaming Applications
Quantizer
MotionEstimator
Transform Coder
InverseQuantizer
InverseTransform
MotionPredictor
Image Coded Image
H.264 Encoder
• Data “streaming” through kernels
• Kernels are tight loops– FIR, Viterbi, DCT
• Coarse grain dataflow between kernels– Sub-blocks of images,
network packets
Data in Data outCRC Conv./
TurboBlock
Interleaver
OVSFGenerator
Spreader/Scrambler
BasebandTrasmitter
W-CDMA Transmitter
RRCFilter
4 University of MichiganElectrical Engineering and Computer Science
System Schema Overview
Kernel 1
Kernel 2
Kernel 4
LA 1
LA 2
LA 3
Kernel 3
Kernel 5
Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3
time
Task throughput
5 University of MichiganElectrical Engineering and Computer Science
Input Specification
for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}
row_trans(char inp[8][8], char out[8][8] ) {
}
col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);
dct(char inp[8][8], char out[8][8]) {
row_trans
col_trans
zigzag_trans
inp
tmp1
tmp2
out
• Sequential C program• Kernel specification
– Perfectly nested FOR loop– Wrapped inside C function– All data access made
explicit
char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}
• System specification
– Function with main input/output
– Local arrays to pass data– Sequence of calls to kernels
6 University of MichiganElectrical Engineering and Computer Science
System Level Decisions
• Throughput of each LA – Initiation Interval• Grouping of loops into a multifunction LA
– More loops in a single LA → LA occupied for longer time in current task
K1
K2
K3
TC=100
TC=100
TC=100
K3TC=100
LA 2
LA 3
LA 1
K1
K2
K3
K4LA 1 occupied for 200 cycles
K1
K2
K3
100
200
300
K4400
Throughput = 1 task / 200 cycles
7 University of MichiganElectrical Engineering and Computer Science
System Decisions (Contd..)
• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance
II=1
II=1
II=1
K1
K2
K3
TC=100
TC=100
TC=100
tmp1
tmp2
LA 1
LA 2
LA 3
K1
K2
K3
K1
K2
K3
100
200
300
LA 1
LA 2
LA 3
tmp1 buffer in use by LA2
K1
K2
K3
K1
K2
K3
100
200
300
Adjacent tasks use different
buffers
8 University of MichiganElectrical Engineering and Computer Science
Case Study : “Simple” benchmarkLoop graph
TC=256
1
1
1
1
1
1
1
1
512 cycles LA 1
LA 2
LA 3
LA 4
1
1
2
1
1
1
3
3
1792 cycles
1536 cycles
LA 1
LA 2
1
1
1
1
1
1
1
1
LA 12048 cycles
9 University of MichiganElectrical Engineering and Computer Science
Prescribed Throughput Accelerators
• Traditional behavioral synthesis– Directly translate C operators
into gates
• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing
Application Architecture
Operation graph Datapath
10 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Template
• Parameterized execution resources, storage, connectivity
• Hardware realization of modulo scheduled loop
11 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Design Flow
FU Alloc.c
C Code,Performance(Throughput)
AbstractArch
ModuloSchedule
Op1 Op2Op3 …tim
e
FUs
ScheduledOps
RF
FU FU
BuildDatapath
ConcreteArch
FU FUInstantiateArch
Synthesize
Verilog,Control Signals
.v
LoopAccelerator
12 University of MichiganElectrical Engineering and Computer Science
LA1
LA2
LA4
AcceleratorPipeline
LoopAccelerator
LA3
LA5
Multifunction Accelerator
• Map multiple loops to single accelerator
• Improve hardware efficiency via reuse
• Opportunities for sharing– Disjoint stages
(loops 2, 3)– Pipeline slack
(loops 4, 5)
FrameType?
Loop 2 Loop 3
Loop 1
Loop 4
Application
…
Block 5
LA1
LA2
LA3
AcceleratorPipeline
…
LoopAccelerator
MultifunctionLoopAccelerator
MultifunctionLoopAccelerator
13 University of MichiganElectrical Engineering and Computer Science
Union
Loop 1
Loop 2
Cost SensitiveModulo Scheduler
Cost SensitiveModulo Scheduler
FU FU
FU FU
FU FUDatapathUnion
• 43% average savings over sum of accelerators• Smart union within 3% of joint scheduling solution
14 University of MichiganElectrical Engineering and Computer Science
• Algorithm-level pipeline retiming– Splitting loops based on tiling– Co-scheduling adjacent loops
Challenges: Throughput Enabling Transformations
Loop 2
Loop 3
Loop 4
Loop 1 Loop 1
Loop 2a
Loop 2b
Loop 3,4
Critical loop
Critical loop
15 University of MichiganElectrical Engineering and Computer Science
Challenges: Programmable Loop Accelerator
• Support bug fixes, evolving standards• Accelerate loops not known at design time• Minimize additional control overhead
Interconnect
FU
… …
FU
… …
MEM
… …
LocalMem
Control
II
Controlsignals
16 University of MichiganElectrical Engineering and Computer Science
Challenges: Timing Aware Synthesis
• Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance
• Strategies to eliminate long wires– Preemptive: predict & prevent long wires– Reactive: use feedback from floorplanner
FU1 FU2 FU3- Insert flip flop on long path- Reschedule with added latency
17 University of MichiganElectrical Engineering and Computer Science
Challenges: Adaptable Voltage/Frequency Levels
• Allow voltage scaling beyond margins
• Using shadow latches in loop accelerator– Localized error detection– Control is predefined:
simple error recovery
D
CLK
Q
error
flip-flop
shadowlatch
delay
FU FU
Shadowlatch Extra queue
entries
18 University of MichiganElectrical Engineering and Computer Science
For More Information
• Visit http://cccp.eecs.umich.edu