software parallelisation & platform generation for heterogeneous multicore architectures

24
May 2, 2012 1 Software Parallelisation and Platform Generation for Heterogeneous Multicore Architectures Sven Wuytack, Erik Brockmeyer, Wouter Vermaelen, Gert Goossens Target Compiler Technologies Leuven, Belgium [email protected]

Upload: chiportal

Post on 13-Dec-2014

633 views

Category:

Technology


7 download

DESCRIPTION

GertGoossens, TargetCompiler Technologies

TRANSCRIPT

Page 1: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 1

© 2012 Target Compiler Technologies NV

Software Parallelisation and Platform Generation for Heterogeneous

Multicore Architectures

Sven Wuytack, Erik Brockmeyer, Wouter Vermaelen,Gert Goossens

Target Compiler TechnologiesLeuven, Belgium

[email protected]

Page 2: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 2

© 2012 Target Compiler Technologies NV

ASIP: Application-Specific Processor Anything between general-purpose P and hardwired data-path Flexibility through programmability and design-time reconfigurability High throughput, low energy through parallelism and specialization

ASIP is foundation of heterogeneous multi-core SoC Balanced SoC architecture offers best performance at lowest energy and cost

ARMcore

ARMcore

Multi-Core SoCs for Smart Devices

Page 3: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 3

© 2012 Target Compiler Technologies NV

No MPSoC Design Without Tools

Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation

→ IP Designer** Commercially deployed since 1999

Page 4: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 4

© 2012 Target Compiler Technologies NV

No MPSoC Design Without Tools

Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation

→ IP Designer** Commercially deployed since 1999

Tools at IP subsystem level (multicore) Code parallelisation Communication and synchronization Multicore platform generation

→ MP Designer **** Beta now; commercial release later in 2012

Page 5: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 5

© 2012 Target Compiler Technologies NV

MP Designer Tool Suite

Page 6: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 6

© 2012 Target Compiler Technologies NV

Multicore Programming Language?

We prefer sequential C Fits human way of thinking Abundantly used

MP Designer performs C source-to-source transformation Formatting preserved as much as possible

MP Designer guarantees dependencies in parallelised code Validation only required on sequential C code Reduced design complexity and validation effort

Page 7: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 7

© 2012 Target Compiler Technologies NV

State-of-the-ArtArchitecture Parallelisation Dependency

analysisPerformance analysis

MP Designer(Target)

• Heterogeneous (multi-ASIP) and homogeneous

• Point-to-point & globally shared memory

• User-guided• Pragmas

separate from C

• Static global analysis (data & pointers)

• Correctness guaranteed

• Processor loads derived from ISS, shown in task graphs

Page 8: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 8

© 2012 Target Compiler Technologies NV

State-of-the-ArtArchitecture Parallelisation Dependency

analysisPerformance analysis

MP Designer(Target)

• Heterogeneous (multi-ASIP) and homogeneous

• Point-to-point & globally shared memory

• User-guided• Pragmas

separate from C

• Static global analysis (data & pointers)

• Correctness guaranteed

• Processor loads derived from ISS, shown in task graphs

OpenMP • Homogeneous• Globally shared

memory

• User-guided• Pragmas in C

• User’s responsability

Page 9: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 9

© 2012 Target Compiler Technologies NV

State-of-the-ArtArchitecture Parallelisation Dependency

analysisPerformance analysis

MP Designer(Target)

• Heterogeneous (multi-ASIP) and homogeneous

• Point-to-point & globally shared memory

• User-guided• Pragmas

separate from C

• Static global analysis (data & pointers)

• Correctness guaranteed

• Processor loads derived from ISS, shown in task graphs

OpenMP • Homogeneous• Globally shared

memory

• User-guided• Pragmas in C

• User’s responsibility

vfEmbedded(Vector Fabrics)

• Homogeneous (multi-ARM or multi-x86)

• Globally shared memory

• User-guided• Via code GUI

• Mixed static/dynamic analysis

• Correctness depends on data-set used

• Processor loads derived from execution, shown in code GUI

MPA(IMEC)

• Heterogeneous• Point-to-point

• User-guided• Pragmas

separate from C

• Static global analysis for arrays (“Clean-C”)

Page 10: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 10

© 2012 Target Compiler Technologies NV

Platform Model

Preferred Point-to-point links between

communicating processors Source processor writes to

destination processor’s local data-memory, avoiding local copies

Write conflicts resolved through local buffering

Address decoding logic

Additionally supported Globally shared memory In case of caches, coherency is

assumed to be resolved in hardware

Core DM Core DM

CoreDM

Localmemory

ITC

Single Tile

Page 11: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 11

© 2012 Target Compiler Technologies NV

VLCQDCTLD

Example: High-Res JPEG Encoding

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise

Quan-tise VLCVLC Write

OutputWrite

Output

Last Coef

Last Coef

Zigzag ReorderZigzag

ReorderR G B Ctl

Page 12: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 12

© 2012 Target Compiler Technologies NV

int main(int argc,char *argv[]){ init_all(); parsection: { jpg_open: { jpg_fopen(JPG_filename); writeword(0xFFD8); //SOI write_APP0info(); } main_encoder(&in_img);

jpg_close: { writeword(0xFFD9); //EOI jpg_fclose(); } } free(in_img.RGB_buffer); return 0;}

Label C code blocks for parallelisation

User-Guided Parallelisation

void main_encoder(struct image* img) { vlc_init: { DCY=0;DCCb=0;DCCr=0; } for (ypos=0..height) { for (xpos=0....width) { for (blk=0..5) { SBYTE DU[64]; loading: load_data_unit_from_RGB_buffer(img, xpos, ypos, blk, DU); process_DU(DU,blk); } } } vlc_fini: { // Bit-alignment of EOI marker if (bytepos>=0) { writebits((1<<(bytepos+1))-1, bytepos+1); } }}

void process_DU(SBYTE* CDU, int blk) {

SWORD DU[64]; fdct_main: { int* int_fdtbl = blk_fdtbl[blk]; fdct_and_quantization(CDU, int_fdtbl,DU); }

vlc_main: { while (i<=end0pos) { ... } if (end0pos!=63) writebits(EOB); }}

Page 13: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 13

© 2012 Target Compiler Technologies NV

Parallelisation pragmas

Sample parallelisation on 3-core architecture

User-Guided Parallelisation

processor P0 type dlxprocessor P1 type dlxprocessor P2 type dlx

parallel ParRegion lbl main::parsection

task LOAD target P0 include lbl main_encoder::loading

task DCT target P1 include lbl process_DU::fdct_main

task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close

Page 14: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 14

© 2012 Target Compiler Technologies NV

FIFO Communication Model

int A[10];int* pA = A;int* qA = A;int* ptr;int foo(int i) { return qA[i];}parsection: { producer: { A[i] = ...; ptr = (...) ? &A[x] : &A[y]; ... = pA[j]; } consumer: { ... = A[s]; ... = foo(*ptr); }}

int* A; // transformedint* pA = A; // pA = NULL !int* ptr;DEFINE_SRC_FIFO(FIFO_A, int, 10); // src fifo defDEFINE_SRC_FIFO(FIFO_ptr, int, 1); // src fifo defparsection: { producer: { int offs_ptr; // declaration A = FIFO_A.acq_put(); // acq put pA = A; // ptr copy A[i] = ...; ptr = (...) ? &A[x] : &A[y]; offs_ptr = ptr - A; // offset calc FIFO_ptr.put(offs_ptr); // offset put ... = pA[j]; // uses pA, A FIFO_A.rel_put(); // rel put }}

Acquire/release interface enables use of other processor’s data memory for storage of arrays (avoiding local copies)

FIFO communication implemented in comm. library produced by platform generator Synchronisation implemented by polling on FIFO queue’s status (empty/full)

Page 15: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 15

© 2012 Target Compiler Technologies NV

For each parallelisation, MP Designer shows task graph with estimated processor loads

E.g.: task graph for JPEGencoding on 3-DLX architecture

Exploration

Task 0 "LOAD"Proc 0 "P0" (dlx)

main_encoder::loading:35.8 %

IN:<none>

par_region [b0]

main_encoder() [b17]

for [b18]

for [b19]

for [b20]

OUT:<none>

Task 1 "DCT"Proc 1 "P1" (dlx)

process_DU::fdct_main: 22.2 %process_DU::quant_main: 25.0 %*TOTAL* 47.1 %

IN:in_img.heightin_img.width

par_region [b0]

main_encoder() [b17]

for [b18]

for [b19]

for [b20]

process_DU() [b22]

OUT:<none>

[NC dep T0 -> T1 @ b20]DU (64) [FF]

Task 2 "VLC"Proc 2 "P2" (dlx)

main::jpg_open: 0.0 %main_encoder::vlc_init: 0.0 %process_DU::vlc_main: 16.9 %main_encoder::vlc_fini: 0.0 %main::jpg_close: 0.0 %*TOTAL* 16.9 %

IN:JPG_filenameSOF0info.heightSOF0info.widthin_img.heightin_img.width

par_region [b0]

main_encoder() [b17]

for [b18]

for [b19]

for [b20]

process_DU() [b22]

OUT:<none>

[NC dep T1 -> T2 @ b22]DU_ZZ (128) [FF]

end0pos (4)

Page 16: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 16

© 2012 Target Compiler Technologies NV

E.g.: task graph for JPEG encoding on 5-DLX architecture Global dependency analysis automatically ensures correct

communication & synchronisation Manual design would be error-prone

Exploration

Page 17: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 17

© 2012 Target Compiler Technologies NV

E.g.: task graph for H263 encoding on 8 cores Global dependency analysis

automatically ensures correct communication & synchronisation

Manual design would be error-prone

Exploration

Page 18: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 18

© 2012 Target Compiler Technologies NV

JPEG encoding on multi-DLX architecture

Objective: obtain efficient load balancing Entire exploration in only days of time

Exploration

Algorithm #Cores

Parallelisation Mcycles*seqpar

Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100

Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86

Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69

Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52

* Cycles for 256x160-pixel image

Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85

Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68

Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78

Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69

Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87

Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971

74

Page 19: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 19

© 2012 Target Compiler Technologies NV

Exploration

Algorithm #Cores

Parallelisation Mcycles*seqpar

Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100

Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86

Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69

Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52

* Cycles for 256x160-pixel image

Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85

Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68

Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78

Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69

Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87

Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971

74

JPEG encoding on multi-DLX architecture

3-core analysis suggests splitting of “q”

Page 20: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 20

© 2012 Target Compiler Technologies NV

Exploration

Algorithm #Cores

Parallelisation Mcycles*seqpar

Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100

Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86

Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69

Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52

* Cycles for 256x160-pixel image

Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85

Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68

Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78

Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69

Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87

Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971

74

JPEG encoding on multi-DLX architecture

4-core analysis suggests splitting of “ld”

Page 21: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 21

© 2012 Target Compiler Technologies NV

MP Designer - Highlights

Homogeneous and heterogeneous multicore SoCs User-guided parallelization (pragmas) Global dataflow analysis to check correctness of

chosen parallelization Validation only required on sequential C code Graphical feedback (task graphs) enables

exploration C source-to-source transformation Software code for (FIFO) communication and

synchronization inserted automatically Communication fabric (platform) generated

automatically, if needed

Page 22: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 22

© 2012 Target Compiler Technologies NV

Alternative Heterogeneous Implementation(IP Designer)

JEMBJEMA

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise

Quan-tise VLCVLC Write

OutputWrite

Output

Last Coef

Last Coef

Zigzag ReorderZigzag

ReorderR G B Ctl

CshCVA6x128-bit

VU1 VU2

VB6x128-bit

T8x8x16-bit

P8x16-bit

R8x16-bit

ALU AGU

DM128-bit

8x16-bit

AGU1 AGU2 ALU

R S4x21-bit

WriteBits

outport

shr32shr16

DM324-bit

(ht)

add

r

DM28-bit

(zz)(izz)

add

r

DM116-bit

(data)(end0)

add

rd

i8x16-bit

JEMA: vector ASIP 2 vector units: vadd, vsub, vscale,

vmul, vaddsub Transposable register file – write

columns, read rows

JEMB: scalar ASIP Multiple data memories, special AGUs ALU: bit concat, Huffman addr, code calc Instruction-level parallelism

Source: G. Goossens et al, Multicore Expo 2009

Page 23: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 23

© 2012 Target Compiler Technologies NV

Performance Heterogeneous dual-ASIP (IP Designer): 1

cycle/pixel Gate count: 76K

Homogeneous 5-core (MP Designer): 9 cycles/pixel

Data splitting (i.e. process loop iterations in parallel) will result in 1 cycle/pixel on ~45 cores

Consistent with literature, e.g. Plurality requires 64 cores* Estimated gate count: 500K

*Source: Plurality, MC Expo 2009

Tradeoff between flexibility and area/power

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise

Quan-tise VLCVLC Write

OutputWrite

Output

Last Coef

Last Coef

Zigzag ReorderZigzag

ReorderR G B Ctl

Alternative Heterogeneous Implementation(IP Designer)

Page 24: Software Parallelisation & Platform Generation for Heterogeneous Multicore Architectures

May 2, 2012 24

© 2012 Target Compiler Technologies NV

Conclusions Multicore SoCs are here to stay

No (efficient) multicore SoC design without tools

Design and programming of individual ASIP cores Architectural exploration, SDK generation, RTL

generation Multicore parallelisation and platform

generation Parallelisation from sequential C code Static global dependency analysis Exploration for efficient load balancing