software parallelisation & platform generation for heterogeneous multicore architectures

May 2, 2012 1

© 2012 Target Compiler Technologies NV

Software Parallelisation and Platform Generation for Heterogeneous

Multicore Architectures

Sven Wuytack, Erik Brockmeyer, Wouter Vermaelen,Gert Goossens

Target Compiler TechnologiesLeuven, Belgium

[email protected]

May 2, 2012 2


ASIP: Application-Specific Processor Anything between general-purpose P and hardwired data-path Flexibility through programmability and design-time reconfigurability High throughput, low energy through parallelism and specialization

ASIP is foundation of heterogeneous multi-core SoC Balanced SoC architecture offers best performance at lowest energy and cost

ARMcore

ARMcore

Multi-Core SoCs for Smart Devices

May 2, 2012 3


No MPSoC Design Without Tools

Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation

→ IP Designer** Commercially deployed since 1999

May 2, 2012 4


No MPSoC Design Without Tools

Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation

→ IP Designer** Commercially deployed since 1999

Tools at IP subsystem level (multicore) Code parallelisation Communication and synchronization Multicore platform generation

→ MP Designer **** Beta now; commercial release later in 2012

May 2, 2012 5


MP Designer Tool Suite

May 2, 2012 6


Multicore Programming Language?

We prefer sequential C Fits human way of thinking Abundantly used

MP Designer performs C source-to-source transformation Formatting preserved as much as possible

MP Designer guarantees dependencies in parallelised code Validation only required on sequential C code Reduced design complexity and validation effort

May 2, 2012 7


State-of-the-ArtArchitecture Parallelisation Dependency

analysisPerformance analysis

MP Designer(Target)

• Heterogeneous (multi-ASIP) and homogeneous

• Point-to-point & globally shared memory

• User-guided• Pragmas

separate from C

• Static global analysis (data & pointers)

• Correctness guaranteed

• Processor loads derived from ISS, shown in task graphs

May 2, 2012 8




MP Designer(Target)




separate from C




OpenMP • Homogeneous• Globally shared

memory

• User-guided• Pragmas in C

• User’s responsability

May 2, 2012 9




MP Designer(Target)




separate from C




OpenMP • Homogeneous• Globally shared

memory

• User-guided• Pragmas in C

• User’s responsibility

vfEmbedded(Vector Fabrics)

• Homogeneous (multi-ARM or multi-x86)

• Globally shared memory

• User-guided• Via code GUI

• Mixed static/dynamic analysis

• Correctness depends on data-set used

• Processor loads derived from execution, shown in code GUI

MPA(IMEC)

• Heterogeneous• Point-to-point


separate from C

• Static global analysis for arrays (“Clean-C”)

May 2, 2012 10


Platform Model

Preferred Point-to-point links between

communicating processors Source processor writes to

destination processor’s local data-memory, avoiding local copies

Write conflicts resolved through local buffering

Address decoding logic

Additionally supported Globally shared memory In case of caches, coherency is

assumed to be resolved in hardware

Core DM Core DM

CoreDM

Localmemory

ITC

Single Tile

May 2, 2012 11


VLCQDCTLD

Example: High-Res JPEG Encoding

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise

Quan-tise VLCVLC Write

OutputWrite

Output

Last Coef

Last Coef

Zigzag ReorderZigzag

ReorderR G B Ctl

May 2, 2012 12


int main(int argc,char *argv[]){ init_all(); parsection: { jpg_open: { jpg_fopen(JPG_filename); writeword(0xFFD8); //SOI write_APP0info(); } main_encoder(&in_img);

jpg_close: { writeword(0xFFD9); //EOI jpg_fclose(); } } free(in_img.RGB_buffer); return 0;}

Label C code blocks for parallelisation

User-Guided Parallelisation

void main_encoder(struct image* img) { vlc_init: { DCY=0;DCCb=0;DCCr=0; } for (ypos=0..height) { for (xpos=0....width) { for (blk=0..5) { SBYTE DU[64]; loading: load_data_unit_from_RGB_buffer(img, xpos, ypos, blk, DU); process_DU(DU,blk); } } } vlc_fini: { // Bit-alignment of EOI marker if (bytepos>=0) { writebits((1<<(bytepos+1))-1, bytepos+1); } }}

void process_DU(SBYTE* CDU, int blk) {

SWORD DU[64]; fdct_main: { int* int_fdtbl = blk_fdtbl[blk]; fdct_and_quantization(CDU, int_fdtbl,DU); }

vlc_main: { while (i<=end0pos) { ... } if (end0pos!=63) writebits(EOB); }}

May 2, 2012 13


Parallelisation pragmas

Sample parallelisation on 3-core architecture

User-Guided Parallelisation

processor P0 type dlxprocessor P1 type dlxprocessor P2 type dlx

parallel ParRegion lbl main::parsection

task LOAD target P0 include lbl main_encoder::loading

task DCT target P1 include lbl process_DU::fdct_main

task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close

May 2, 2012 14


FIFO Communication Model

int A[10];int* pA = A;int* qA = A;int* ptr;int foo(int i) { return qA[i];}parsection: { producer: { A[i] = ...; ptr = (...) ? &A[x] : &A[y]; ... = pA[j]; } consumer: { ... = A[s]; ... = foo(*ptr); }}

int* A; // transformedint* pA = A; // pA = NULL !int* ptr;DEFINE_SRC_FIFO(FIFO_A, int, 10); // src fifo defDEFINE_SRC_FIFO(FIFO_ptr, int, 1); // src fifo defparsection: { producer: { int offs_ptr; // declaration A = FIFO_A.acq_put(); // acq put pA = A; // ptr copy A[i] = ...; ptr = (...) ? &A[x] : &A[y]; offs_ptr = ptr - A; // offset calc FIFO_ptr.put(offs_ptr); // offset put ... = pA[j]; // uses pA, A FIFO_A.rel_put(); // rel put }}

Acquire/release interface enables use of other processor’s data memory for storage of arrays (avoiding local copies)

FIFO communication implemented in comm. library produced by platform generator Synchronisation implemented by polling on FIFO queue’s status (empty/full)

May 2, 2012 15


For each parallelisation, MP Designer shows task graph with estimated processor loads

E.g.: task graph for JPEGencoding on 3-DLX architecture

Exploration

Task 0 "LOAD"Proc 0 "P0" (dlx)

main_encoder::loading:35.8 %

IN:<none>

par_region [b0]

main_encoder() [b17]

for [b18]

for [b19]

for [b20]

OUT:<none>

Task 1 "DCT"Proc 1 "P1" (dlx)

process_DU::fdct_main: 22.2 %process_DU::quant_main: 25.0 %*TOTAL* 47.1 %

IN:in_img.heightin_img.width

par_region [b0]


for [b18]

for [b19]

for [b20]

process_DU() [b22]

OUT:<none>

[NC dep T0 -> T1 @ b20]DU (64) [FF]

Task 2 "VLC"Proc 2 "P2" (dlx)

main::jpg_open: 0.0 %main_encoder::vlc_init: 0.0 %process_DU::vlc_main: 16.9 %main_encoder::vlc_fini: 0.0 %main::jpg_close: 0.0 %*TOTAL* 16.9 %

IN:JPG_filenameSOF0info.heightSOF0info.widthin_img.heightin_img.width

par_region [b0]


for [b18]

for [b19]

for [b20]

process_DU() [b22]

OUT:<none>

[NC dep T1 -> T2 @ b22]DU_ZZ (128) [FF]

end0pos (4)

May 2, 2012 16


E.g.: task graph for JPEG encoding on 5-DLX architecture Global dependency analysis automatically ensures correct

communication & synchronisation Manual design would be error-prone

Exploration

May 2, 2012 17


E.g.: task graph for H263 encoding on 8 cores Global dependency analysis

automatically ensures correct communication & synchronisation

Manual design would be error-prone

Exploration

May 2, 2012 18


JPEG encoding on multi-DLX architecture

Objective: obtain efficient load balancing Entire exploration in only days of time

Exploration

Algorithm #Cores

Parallelisation Mcycles*seqpar

Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100

Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86

Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69

Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52

* Cycles for 256x160-pixel image

Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85

Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68

Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78

Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69

Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87

Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971

74

May 2, 2012 19


Exploration

Algorithm #Cores


Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100











74


3-core analysis suggests splitting of “q”

May 2, 2012 20


Exploration

Algorithm #Cores


Speedup

Load (%)P0 P1 P2 P3

P4

Effi-ciency

(%)

Original 1 7.1 - 1 100 100











74


4-core analysis suggests splitting of “ld”

May 2, 2012 21


MP Designer - Highlights

Homogeneous and heterogeneous multicore SoCs User-guided parallelization (pragmas) Global dataflow analysis to check correctness of

chosen parallelization Validation only required on sequential C code Graphical feedback (task graphs) enables

exploration C source-to-source transformation Software code for (FIFO) communication and

synchronization inserted automatically Communication fabric (platform) generated

automatically, if needed

May 2, 2012 22


Alternative Heterogeneous Implementation(IP Designer)

JEMBJEMA

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise


OutputWrite

Output

Last Coef

Last Coef


ReorderR G B Ctl

CshCVA6x128-bit

VU1 VU2

VB6x128-bit

T8x8x16-bit

P8x16-bit

R8x16-bit

ALU AGU

DM128-bit

8x16-bit

AGU1 AGU2 ALU

R S4x21-bit

WriteBits

outport

shr32shr16

DM324-bit

(ht)

add

r

DM28-bit

(zz)(izz)

add

r

DM116-bit

(data)(end0)

add

rd

i8x16-bit

JEMA: vector ASIP 2 vector units: vadd, vsub, vscale,

vmul, vaddsub Transposable register file – write

columns, read rows

JEMB: scalar ASIP Multiple data memories, special AGUs ALU: bit concat, Huffman addr, code calc Instruction-level parallelism

Source: G. Goossens et al, Multicore Expo 2009

May 2, 2012 23


Performance Heterogeneous dual-ASIP (IP Designer): 1

cycle/pixel Gate count: 76K

Homogeneous 5-core (MP Designer): 9 cycles/pixel

Data splitting (i.e. process loop iterations in parallel) will result in 1 cycle/pixel on ~45 cores

Consistent with literature, e.g. Plurality requires 64 cores* Estimated gate count: 500K

*Source: Plurality, MC Expo 2009

Tradeoff between flexibility and area/power

RGB2YUVRGB2YUV

Sub-Sample(4:2:0)

Sub-Sample(4:2:0)

ColumnDCT

ColumnDCT

RowDCTRowDCT

Quan-tise


OutputWrite

Output

Last Coef

Last Coef


ReorderR G B Ctl

Alternative Heterogeneous Implementation(IP Designer)

May 2, 2012 24


Conclusions Multicore SoCs are here to stay

No (efficient) multicore SoC design without tools

Design and programming of individual ASIP cores Architectural exploration, SDK generation, RTL

generation Multicore parallelisation and platform

generation Parallelisation from sequential C code Static global dependency analysis Exploration for efficient load balancing

software parallelisation & platform generation for heterogeneous multicore architectures

Technology

c compiler

c cleanc

pragmas analysis data

sequential c code

c pointers

tools tools

shared pragmas

usedmpa heterogeneous