target updated track f
DESCRIPTION
TRANSCRIPT
1 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Design of Programmable Accelerators for SoCs
Gert GoossensCEO
Target Compiler Technologies
4 ©Target Compiler Technologies – Slide Israel, May 4, 2010
• What do you do when the performance of your main processor is insufficient?
– Go multicore?• Application mapping difficult,
resource utilisation unbalanced– Add hardwired accelerators?
• Balanced but inflexible SoC
SoC Design
5 ©Target Compiler Technologies – Slide Israel, May 4, 2010
• What do you do when the performance of your main processor is insufficient?
– ASIPs: application-specific processors• Anything between general-purpose uP and hardwired datapath• Flexibility through programmability and design-time reconfigurability• High-throughput and low energy, through parallelism and specialisation• Balanced and flexible SoC
SoC Design
6 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Agenda
•ASIPs as accelerators in SoCs
•How to design ASIPs
•Programmable datapath examples–WLAN–FFT
•Conclusions
7 ©Target Compiler Technologies – Slide Israel, May 4, 2010
How to Design ASIPs?• IP Designer tool-suite
9 ©Target Compiler Technologies – Slide Israel, May 4, 2010
How to Design ASIPs?
• Benefits– Speed-up design Few weeks per ASIP– Design exploration Wide architectural scope, based on
processor description language– Formal approach increases 40 production chips, 0 bugs
correctness– Automatic generation of RTL Competitive to hand-coded RTL– Automatic generation of SDK C compiler “no-assembly-required”
10 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Tool Comparison
Programmable Architectural
specialisation Resource sharing Business model
Architectural style
Example vendors
Approach
YesHigh
YesEDA license
Flexible, using processor description language
Target (IP Designer), CoWare (Processor Designer)
Retargetable ASIP design tools
Yes Low (within template boundaries)YesRoyalties
Configurable ASIP template + extension instructions
Tensilica, ARC, ASIP Solutions, SiliconHive
Configurable ASIP templates
No High
Depends on toolEDA license
Hardwired datapath, no programmability
Mentor (CatapultC), Forte, Synfora, Cadence (C2S)
High-level synthesis from C
— (*)
(*) No strong focus for CoWare?
11 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Agenda
•ASIPs as accelerators in SoCs
•How to design ASIPs
•Programmable datapath examples–WLAN–FFT
•Conclusions
12 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Programmable Datapath Examples
Examples shown Examples shown
Served byIP Designer
Served byIP Designer
13 ©Target Compiler Technologies – Slide Israel, May 4, 2010
What is a Programmable Datapath?• Hardwired datapath
– Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow
• Hardwired datapath with resource sharing– Superposition of multiple data-flow patterns– Hardware saving benefit, if permitted by throughput spec– Requires local modifications to datapath structure and
addition of small amounts of control• Modification of connectivity multiplexers• Modification of operator behaviour programmable i.s.o.
fixed operators• Store intermediate data local register files i.s.o. registers
– Controlled from FSM
• Programmable datapath– Datapath with resource sharing, controlled from software– Microcode in ROM (design-time programmable), or
RAM/flash (post-silicon programmable)
SEQ
PM
DEC
s0
s1
s2
d+=(a+b)*c; g+=(e-f)*f;
14 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: WLAN• Algorithm
– Design by Motorola Labs [1]
– 802.11n, equalisation– Characteristics
• Matrix calculations• Specialised operators in
complex domain:cmpy, conjugate, sqmod
– Equalisation matrix: multiple dataflow patterns depending on MIMO scheme
• SDM• Symmetric
SDM + STBC• SDM + STBC
)(4142
1112
4241
1211
1
**
**
kHH
HH
HH
HH
H
)(4142
1112
4241
1211
1
**
**
kHH
HH
HH
HH
H
)(4344
1314
4443
1413
2
**
**
kHH
HH
HH
HH
H
)(4344
1314
4443
1413
2
**
**
kHH
HH
HH
HH
H
)(2
1
2121
2122)(
)(
)(
k
H
H
HH
H
kH
H
IdHH
HHIdG
)(2
1
2121
2122)(
)(
)(
k
H
H
HH
H
kH
H
IdHH
HHIdG
Matrix inversion
Matrix inversion +Address computations
Addresscomputations
Complex conjugate
Square modulus
[1] Medea+ project “Uppermost”
15 ©Target Compiler Technologies – Slide Israel, May 4, 2010
)()( 3~
21~
0 ddddaa )( 21 ddaa
22
21 ddaa
• Programmable datapath design– Sample expressions: equalisation matrix
– Sample expression: matrix inversion
– 4 identical datapaths in SIMD unit
Prog. Datapath Example: WLAN
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
DualPort
Memory
CommonProgramControl
GMAC 0
DualPort
Memory
GMAC 1
DualPort
Memory
GMAC 2
DualPort
Memory
GMAC 3
ChannelEstimation
ASIP
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
IndexCarrierSubk
HHHH
HHHH
HHHH
HHHH
H
k
k
)(
)(
44434241
34333231
24232221
14131211
))()(( 54321 dddddaa
GMAC
16 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: WLAN• nML code of gmac instruction
reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;
pipe P0 <vcmpl>;pipe P1 <vcmpl>;
trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;
reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;
pipe P0 <vcmpl>;pipe P1 <vcmpl>;
trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;
enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}
enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}
Resources
Instruction-set grammar
17 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: WLAN• C compiler uses advanced
graph matching techniques to map dataflow patterns on programmable datapath
ApplicationC
Machine codeElf / Dwarf
Processor modelnML
ISG
sub_AB sub_BA add_AB add_BA
A B
C
<<_C
AR_w
COMPILATIONENGINE
(PHASE COUPLING)
CDFG
+
<<
nML FRONT-ENDC FRONT-END
SOURCE-LEVEL TRANSF.
CODE SELECTION
REGISTER ALLOCATION
SCHEDULING
CODE EMISSION
18 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: FFT
• Algorithm– Decimation in time– Radix-2, radix-4, mixed radix– Coefficients: complex (16,16)– Data: complex (24,24)
19 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: FFT• Programmable datapath design
Mdata Mcoef
A[4] B[4]
CMPY
BFLY
ld A/B
Ld C
stA/B
* * * *- +
+ + - -
– Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler
– CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function
– Intrinsic’s behaviour is modelled in C, automatically converted to RTL
20 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: FFT• Instruction-level parallelism: ILP=5
– Efficient register allocation, scheduling and SW pipelining needed
– E.g. inner-loop for radix-4 FFT– Compiled code
• 4 cycles / iteration• 100% resource utilisation
LDA
LDC
MPY
LDA
LDC
MPY
LDA
LDC
MPY
LDA
LDC
MPY
BFLY BFLY
BFLY BFLY
STB STB STB STB
LDA STB LDC MPY BFLY
/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)
/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)
21 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: FFT• C compiler uses advanced
graph search techniques to– optimise register utilisation – schedule instructions
on programmable datapath
ApplicationC
Machine codeElf / Dwarf
Processor modelnML
ISG
sub_AB sub_BA add_AB add_BA
A B
C
<<_C
AR_w
COMPILATIONENGINE
(PHASE COUPLING)
CDFG
+
<<
nML FRONT-ENDC FRONT-END
SOURCE-LEVEL TRANSF.
CODE SELECTION
REGISTER ALLOCATION
SCHEDULING
CODE EMISSION
22 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Prog. Datapath Example: FFT
• Results– Performance
• Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly• 4096-point FFT (radix-4): 24,671 cycles• 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles
– RTL metrics• 26K gates, 123 MHz clock, 130 nm, DesignWare Basic
– 600 lines of nML code• Custom data path, complex butterfly unit
23 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Agenda
•ASIPs as accelerators in SoCs
•How to design ASIPs
•Programmable datapath examples–WLAN–FFT
•Conclusions
24 ©Target Compiler Technologies – Slide Israel, May 4, 2010
Conclusion
• ASIPs allow to make accelerators in SoCs programmable
• With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently
• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level
synthesis
• With ASIPs, multicore SoC architectures become even more prolific