target updated track f

1 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Design of Programmable Accelerators for SoCs

Gert GoossensCEO

Target Compiler Technologies


• What do you do when the performance of your main processor is insufficient?

– Go multicore?• Application mapping difficult,

resource utilisation unbalanced– Add hardwired accelerators?

• Balanced but inflexible SoC

SoC Design


• What do you do when the performance of your main processor is insufficient?

– ASIPs: application-specific processors• Anything between general-purpose uP and hardwired datapath• Flexibility through programmability and design-time reconfigurability• High-throughput and low energy, through parallelism and specialisation• Balanced and flexible SoC

SoC Design


Agenda

•ASIPs as accelerators in SoCs

•How to design ASIPs

•Programmable datapath examples–WLAN–FFT

•Conclusions


How to Design ASIPs?• IP Designer tool-suite


How to Design ASIPs?

• Benefits– Speed-up design Few weeks per ASIP– Design exploration Wide architectural scope, based on

processor description language– Formal approach increases 40 production chips, 0 bugs

correctness– Automatic generation of RTL Competitive to hand-coded RTL– Automatic generation of SDK C compiler “no-assembly-required”


Tool Comparison

Programmable Architectural

specialisation Resource sharing Business model

Architectural style

Example vendors

Approach

YesHigh

YesEDA license

Flexible, using processor description language

Target (IP Designer), CoWare (Processor Designer)

Retargetable ASIP design tools

Yes Low (within template boundaries)YesRoyalties

Configurable ASIP template + extension instructions

Tensilica, ARC, ASIP Solutions, SiliconHive

Configurable ASIP templates

No High

Depends on toolEDA license

Hardwired datapath, no programmability

Mentor (CatapultC), Forte, Synfora, Cadence (C2S)

High-level synthesis from C

— (*)

(*) No strong focus for CoWare?


Agenda




•Conclusions


Programmable Datapath Examples

Examples shown Examples shown

Served byIP Designer

Served byIP Designer


What is a Programmable Datapath?• Hardwired datapath

– Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow

• Hardwired datapath with resource sharing– Superposition of multiple data-flow patterns– Hardware saving benefit, if permitted by throughput spec– Requires local modifications to datapath structure and

addition of small amounts of control• Modification of connectivity multiplexers• Modification of operator behaviour programmable i.s.o.

fixed operators• Store intermediate data local register files i.s.o. registers

– Controlled from FSM

• Programmable datapath– Datapath with resource sharing, controlled from software– Microcode in ROM (design-time programmable), or

RAM/flash (post-silicon programmable)

SEQ

PM

DEC

s0

s1

s2

d+=(a+b)*c; g+=(e-f)*f;


Prog. Datapath Example: WLAN• Algorithm

– Design by Motorola Labs [1]

– 802.11n, equalisation– Characteristics

• Matrix calculations• Specialised operators in

complex domain:cmpy, conjugate, sqmod

– Equalisation matrix: multiple dataflow patterns depending on MIMO scheme

• SDM• Symmetric

SDM + STBC• SDM + STBC

)(4142

1112

4241

1211

1

**

**

kHH

HH

HH

HH

H

)(4142

1112

4241

1211

1

**

**

kHH

HH

HH

HH

H

)(4344

1314

4443

1413

2

**

**

kHH

HH

HH

HH

H

)(4344

1314

4443

1413

2

**

**

kHH

HH

HH

HH

H

)(2

1

2121

2122)(

)(

)(

k

H

H

HH

H

kH

H

IdHH

HHIdG

)(2

1

2121

2122)(

)(

)(

k

H

H

HH

H

kH

H

IdHH

HHIdG

Matrix inversion

Matrix inversion +Address computations

Addresscomputations

Complex conjugate

Square modulus

[1] Medea+ project “Uppermost”


)()( 3~

21~

0 ddddaa )( 21 ddaa

22

21 ddaa

• Programmable datapath design– Sample expressions: equalisation matrix

– Sample expression: matrix inversion

– 4 identical datapaths in SIMD unit

Prog. Datapath Example: WLAN

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

DualPort

Memory

CommonProgramControl

GMAC 0

DualPort

Memory

GMAC 1

DualPort

Memory

GMAC 2

DualPort

Memory

GMAC 3

ChannelEstimation

ASIP

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

))()(( 54321 dddddaa

GMAC


Prog. Datapath Example: WLAN• nML code of gmac instruction

reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;

pipe P0 <vcmpl>;pipe P1 <vcmpl>;

trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;

reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;

pipe P0 <vcmpl>;pipe P1 <vcmpl>;

trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;

enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}

enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}

Resources

Instruction-set grammar


Prog. Datapath Example: WLAN• C compiler uses advanced

graph matching techniques to map dataflow patterns on programmable datapath

ApplicationC

Machine codeElf / Dwarf

Processor modelnML

ISG

sub_AB sub_BA add_AB add_BA

A B

C

<<_C

AR_w

COMPILATIONENGINE

(PHASE COUPLING)

CDFG

+

<<

nML FRONT-ENDC FRONT-END

SOURCE-LEVEL TRANSF.

CODE SELECTION

REGISTER ALLOCATION

SCHEDULING

CODE EMISSION


Prog. Datapath Example: FFT

• Algorithm– Decimation in time– Radix-2, radix-4, mixed radix– Coefficients: complex (16,16)– Data: complex (24,24)


Prog. Datapath Example: FFT• Programmable datapath design

Mdata Mcoef

A[4] B[4]

CMPY

BFLY

ld A/B

Ld C

stA/B

* * * *- +

+ + - -

– Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler

– CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function

– Intrinsic’s behaviour is modelled in C, automatically converted to RTL


Prog. Datapath Example: FFT• Instruction-level parallelism: ILP=5

– Efficient register allocation, scheduling and SW pipelining needed

– E.g. inner-loop for radix-4 FFT– Compiled code

• 4 cycles / iteration• 100% resource utilisation

LDA

LDC

MPY

LDA

LDC

MPY

LDA

LDC

MPY

LDA

LDC

MPY

BFLY BFLY

BFLY BFLY

STB STB STB STB

LDA STB LDC MPY BFLY

/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)

/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)


Prog. Datapath Example: FFT• C compiler uses advanced

graph search techniques to– optimise register utilisation – schedule instructions

on programmable datapath

ApplicationC

Machine codeElf / Dwarf

Processor modelnML

ISG

sub_AB sub_BA add_AB add_BA

A B

C

<<_C

AR_w

COMPILATIONENGINE

(PHASE COUPLING)

CDFG

+

<<

nML FRONT-ENDC FRONT-END

SOURCE-LEVEL TRANSF.

CODE SELECTION

REGISTER ALLOCATION

SCHEDULING

CODE EMISSION


Prog. Datapath Example: FFT

• Results– Performance

• Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly• 4096-point FFT (radix-4): 24,671 cycles• 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles

– RTL metrics• 26K gates, 123 MHz clock, 130 nm, DesignWare Basic

– 600 lines of nML code• Custom data path, complex butterfly unit


Agenda




•Conclusions


Conclusion

• ASIPs allow to make accelerators in SoCs programmable

• With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently

• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level

synthesis

• With ASIPs, multicore SoC architectures become even more prolific

target updated track f

Education