target updated track f

21
1 © Target Compiler Technologies – Slide Israel, May 4, 2010 Design of Programmable Accelerators for SoCs Gert Goossens CEO Target Compiler Technologies

Upload: alonagradman

Post on 05-Dec-2014

522 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Target updated   track f

1 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Design of Programmable Accelerators for SoCs

Gert GoossensCEO

Target Compiler Technologies

Page 2: Target updated   track f

4 ©Target Compiler Technologies – Slide Israel, May 4, 2010

• What do you do when the performance of your main processor is insufficient?

– Go multicore?• Application mapping difficult,

resource utilisation unbalanced– Add hardwired accelerators?

• Balanced but inflexible SoC

SoC Design

Page 3: Target updated   track f

5 ©Target Compiler Technologies – Slide Israel, May 4, 2010

• What do you do when the performance of your main processor is insufficient?

– ASIPs: application-specific processors• Anything between general-purpose uP and hardwired datapath• Flexibility through programmability and design-time reconfigurability• High-throughput and low energy, through parallelism and specialisation• Balanced and flexible SoC

SoC Design

Page 4: Target updated   track f

6 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Agenda

•ASIPs as accelerators in SoCs

•How to design ASIPs

•Programmable datapath examples–WLAN–FFT

•Conclusions

Page 5: Target updated   track f

7 ©Target Compiler Technologies – Slide Israel, May 4, 2010

How to Design ASIPs?• IP Designer tool-suite

Page 6: Target updated   track f

9 ©Target Compiler Technologies – Slide Israel, May 4, 2010

How to Design ASIPs?

• Benefits– Speed-up design Few weeks per ASIP– Design exploration Wide architectural scope, based on

processor description language– Formal approach increases 40 production chips, 0 bugs

correctness– Automatic generation of RTL Competitive to hand-coded RTL– Automatic generation of SDK C compiler “no-assembly-required”

Page 7: Target updated   track f

10 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Tool Comparison

Programmable Architectural

specialisation Resource sharing Business model

Architectural style

Example vendors

Approach

YesHigh

YesEDA license

Flexible, using processor description language

Target (IP Designer), CoWare (Processor Designer)

Retargetable ASIP design tools

Yes Low (within template boundaries)YesRoyalties

Configurable ASIP template + extension instructions

Tensilica, ARC, ASIP Solutions, SiliconHive

Configurable ASIP templates

No High

Depends on toolEDA license

Hardwired datapath, no programmability

Mentor (CatapultC), Forte, Synfora, Cadence (C2S)

High-level synthesis from C

— (*)

(*) No strong focus for CoWare?

Page 8: Target updated   track f

11 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Agenda

•ASIPs as accelerators in SoCs

•How to design ASIPs

•Programmable datapath examples–WLAN–FFT

•Conclusions

Page 9: Target updated   track f

12 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Programmable Datapath Examples

Examples shown Examples shown

Served byIP Designer

Served byIP Designer

Page 10: Target updated   track f

13 ©Target Compiler Technologies – Slide Israel, May 4, 2010

What is a Programmable Datapath?• Hardwired datapath

– Datapath structure (hardware operators and connectivity) mimics the algorithm’s data flow

• Hardwired datapath with resource sharing– Superposition of multiple data-flow patterns– Hardware saving benefit, if permitted by throughput spec– Requires local modifications to datapath structure and

addition of small amounts of control• Modification of connectivity multiplexers• Modification of operator behaviour programmable i.s.o.

fixed operators• Store intermediate data local register files i.s.o. registers

– Controlled from FSM

• Programmable datapath– Datapath with resource sharing, controlled from software– Microcode in ROM (design-time programmable), or

RAM/flash (post-silicon programmable)

SEQ

PM

DEC

s0

s1

s2

d+=(a+b)*c; g+=(e-f)*f;

Page 11: Target updated   track f

14 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: WLAN• Algorithm

– Design by Motorola Labs [1]

– 802.11n, equalisation– Characteristics

• Matrix calculations• Specialised operators in

complex domain:cmpy, conjugate, sqmod

– Equalisation matrix: multiple dataflow patterns depending on MIMO scheme

• SDM• Symmetric

SDM + STBC• SDM + STBC

)(4142

1112

4241

1211

1

**

**

kHH

HH

HH

HH

H

)(4142

1112

4241

1211

1

**

**

kHH

HH

HH

HH

H

)(4344

1314

4443

1413

2

**

**

kHH

HH

HH

HH

H

)(4344

1314

4443

1413

2

**

**

kHH

HH

HH

HH

H

)(2

1

2121

2122)(

)(

)(

k

H

H

HH

H

kH

H

IdHH

HHIdG

)(2

1

2121

2122)(

)(

)(

k

H

H

HH

H

kH

H

IdHH

HHIdG

Matrix inversion

Matrix inversion +Address computations

Addresscomputations

Complex conjugate

Square modulus

[1] Medea+ project “Uppermost”

Page 12: Target updated   track f

15 ©Target Compiler Technologies – Slide Israel, May 4, 2010

)()( 3~

21~

0 ddddaa )( 21 ddaa

22

21 ddaa

• Programmable datapath design– Sample expressions: equalisation matrix

– Sample expression: matrix inversion

– 4 identical datapaths in SIMD unit

Prog. Datapath Example: WLAN

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

DualPort

Memory

CommonProgramControl

GMAC 0

DualPort

Memory

GMAC 1

DualPort

Memory

GMAC 2

DualPort

Memory

GMAC 3

ChannelEstimation

ASIP

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

IndexCarrierSubk

HHHH

HHHH

HHHH

HHHH

H

k

k

)(

)(

44434241

34333231

24232221

14131211

))()(( 54321 dddddaa

GMAC

Page 13: Target updated   track f

16 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: WLAN• nML code of gmac instruction

reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;

pipe P0 <vcmpl>;pipe P1 <vcmpl>;

trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;

reg R[8] <vcmpl> read(tR0, tR1, tR2, tR3, tR4, tR5);reg ACC <vcmpl>;

pipe P0 <vcmpl>;pipe P1 <vcmpl>;

trn tC0 <vcmpl>;trn tC1 <vcmpl>;trn tM0 <vcmpl>;trn tM1 <vcmpl>;

enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}

enum gmac_op {mpy_mpy_mac, mac, sq_sq_mac, minv, ...};opn gmac(g:gmac_op, r0:c3, r1:c3, r2:c3, r3:c3, r4:c3, r5:c3) { action { stage E1: switch (g) { case mpy_mpy_mac: tC0 = ccnj(tR2 = R[r2]); P0 = cmpy(tR1 = R[r1], tC0); tC1 = ccnj(tR3 = R[r3]); P1 = cmpy(tR4 = R[r4], tC1); case mac: P0 = tR0 = R[r0]; P1 = tR5 = R[r5]; case sq_sq_mac: P0 = cmpy(tR1 = R[r1], tR2 = R[r1]); P1 = cmpy(tR4 = R[r4], tR3 = R[r4]); case minv: P0 = tR0 = R[r0]; tM0 = cmpy(tR1 = R[r1], tR2 = R[r2]); tM1 = cmpy(tR4 = R[r4], tR3 = R[r3]); P1 = csub(tM0, tM1); case ... } stage E2: tM = cmpy(P0, P1); ACC = cadd(tM, ACC); }}

Resources

Instruction-set grammar

Page 14: Target updated   track f

17 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: WLAN• C compiler uses advanced

graph matching techniques to map dataflow patterns on programmable datapath

ApplicationC

Machine codeElf / Dwarf

Processor modelnML

ISG

sub_AB sub_BA add_AB add_BA

A B

C

<<_C

AR_w

COMPILATIONENGINE

(PHASE COUPLING)

CDFG

+

<<

nML FRONT-ENDC FRONT-END

SOURCE-LEVEL TRANSF.

CODE SELECTION

REGISTER ALLOCATION

SCHEDULING

CODE EMISSION

Page 15: Target updated   track f

18 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: FFT

• Algorithm– Decimation in time– Radix-2, radix-4, mixed radix– Coefficients: complex (16,16)– Data: complex (24,24)

Page 16: Target updated   track f

19 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: FFT• Programmable datapath design

Mdata Mcoef

A[4] B[4]

CMPY

BFLY

ld A/B

Ld C

stA/B

* * * *- +

+ + - -

– Datapath structure for CMPY and BFLY can be described in nML and exposed to C compiler

– CMPY and BFLY each implement a single, fixed dataflow pattern, which can alternatively be hidden in intrinsic function

– Intrinsic’s behaviour is modelled in C, automatically converted to RTL

Page 17: Target updated   track f

20 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: FFT• Instruction-level parallelism: ILP=5

– Efficient register allocation, scheduling and SW pipelining needed

– E.g. inner-loop for radix-4 FFT– Compiled code

• 4 cycles / iteration• 100% resource utilisation

LDA

LDC

MPY

LDA

LDC

MPY

LDA

LDC

MPY

LDA

LDC

MPY

BFLY BFLY

BFLY BFLY

STB STB STB STB

LDA STB LDC MPY BFLY

/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)

/* 0 */ DO cnt,LE/* 1 */ /* delay slot *//* 2 */ md=*pa(next_bfly) | *pb(+s)=b1 | mc=*pr(next_bfly_rdx4) | a2=md*mc | b3,b2=bfly(a2,a3)/* 3 */ md=*pa(+s) | *pb(+s)=b3 | mc=*pr(+s) | a3=md*mc | b1,a2=bfly(a1,a2)/* 4 */ md=*pa(+s) | *pb(+s)=b0 | mc=*pr(+s) | a1=md*mc | b0,a3=bfly(a0,a3)/* 5 */ md=*pa(+s) | *pb(next_bfly)=b2 | mc=*pr(+s) |a0=md*mc | b1,b0=bfly(b1,b0)

Page 18: Target updated   track f

21 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: FFT• C compiler uses advanced

graph search techniques to– optimise register utilisation – schedule instructions

on programmable datapath

ApplicationC

Machine codeElf / Dwarf

Processor modelnML

ISG

sub_AB sub_BA add_AB add_BA

A B

C

<<_C

AR_w

COMPILATIONENGINE

(PHASE COUPLING)

CDFG

+

<<

nML FRONT-ENDC FRONT-END

SOURCE-LEVEL TRANSF.

CODE SELECTION

REGISTER ALLOCATION

SCHEDULING

CODE EMISSION

Page 19: Target updated   track f

22 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Prog. Datapath Example: FFT

• Results– Performance

• Radix-4: 4 cycles/ butterfly, radix-2: 2 cycles/butterfly• 4096-point FFT (radix-4): 24,671 cycles• 2048-point FFT (2x 1024-pt radix-4 + 1x 2048-pt radix-2): 12,288 cycles

– RTL metrics• 26K gates, 123 MHz clock, 130 nm, DesignWare Basic

– 600 lines of nML code• Custom data path, complex butterfly unit

Page 20: Target updated   track f

23 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Agenda

•ASIPs as accelerators in SoCs

•How to design ASIPs

•Programmable datapath examples–WLAN–FFT

•Conclusions

Page 21: Target updated   track f

24 ©Target Compiler Technologies – Slide Israel, May 4, 2010

Conclusion

• ASIPs allow to make accelerators in SoCs programmable

• With the IP Designer tool-suite, ASIPs can be designed quickly and programmed efficiently

• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators IP Designer as an alternative to high-level

synthesis

• With ASIPs, multicore SoC architectures become even more prolific