software parallelisation & platform generation for heterogeneous multicore architectures
DESCRIPTION
GertGoossens, TargetCompiler TechnologiesTRANSCRIPT
May 2, 2012 1
© 2012 Target Compiler Technologies NV
Software Parallelisation and Platform Generation for Heterogeneous
Multicore Architectures
Sven Wuytack, Erik Brockmeyer, Wouter Vermaelen,Gert Goossens
Target Compiler TechnologiesLeuven, Belgium
May 2, 2012 2
© 2012 Target Compiler Technologies NV
ASIP: Application-Specific Processor Anything between general-purpose P and hardwired data-path Flexibility through programmability and design-time reconfigurability High throughput, low energy through parallelism and specialization
ASIP is foundation of heterogeneous multi-core SoC Balanced SoC architecture offers best performance at lowest energy and cost
ARMcore
ARMcore
Multi-Core SoCs for Smart Devices
May 2, 2012 3
© 2012 Target Compiler Technologies NV
No MPSoC Design Without Tools
Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation
→ IP Designer** Commercially deployed since 1999
May 2, 2012 4
© 2012 Target Compiler Technologies NV
No MPSoC Design Without Tools
Tools at IP level (ASIP cores) Architectural exploration SDK generation: C compiler, ISS, debugger… RTL generation
→ IP Designer** Commercially deployed since 1999
Tools at IP subsystem level (multicore) Code parallelisation Communication and synchronization Multicore platform generation
→ MP Designer **** Beta now; commercial release later in 2012
May 2, 2012 5
© 2012 Target Compiler Technologies NV
MP Designer Tool Suite
May 2, 2012 6
© 2012 Target Compiler Technologies NV
Multicore Programming Language?
We prefer sequential C Fits human way of thinking Abundantly used
MP Designer performs C source-to-source transformation Formatting preserved as much as possible
MP Designer guarantees dependencies in parallelised code Validation only required on sequential C code Reduced design complexity and validation effort
May 2, 2012 7
© 2012 Target Compiler Technologies NV
State-of-the-ArtArchitecture Parallelisation Dependency
analysisPerformance analysis
MP Designer(Target)
• Heterogeneous (multi-ASIP) and homogeneous
• Point-to-point & globally shared memory
• User-guided• Pragmas
separate from C
• Static global analysis (data & pointers)
• Correctness guaranteed
• Processor loads derived from ISS, shown in task graphs
May 2, 2012 8
© 2012 Target Compiler Technologies NV
State-of-the-ArtArchitecture Parallelisation Dependency
analysisPerformance analysis
MP Designer(Target)
• Heterogeneous (multi-ASIP) and homogeneous
• Point-to-point & globally shared memory
• User-guided• Pragmas
separate from C
• Static global analysis (data & pointers)
• Correctness guaranteed
• Processor loads derived from ISS, shown in task graphs
OpenMP • Homogeneous• Globally shared
memory
• User-guided• Pragmas in C
• User’s responsability
May 2, 2012 9
© 2012 Target Compiler Technologies NV
State-of-the-ArtArchitecture Parallelisation Dependency
analysisPerformance analysis
MP Designer(Target)
• Heterogeneous (multi-ASIP) and homogeneous
• Point-to-point & globally shared memory
• User-guided• Pragmas
separate from C
• Static global analysis (data & pointers)
• Correctness guaranteed
• Processor loads derived from ISS, shown in task graphs
OpenMP • Homogeneous• Globally shared
memory
• User-guided• Pragmas in C
• User’s responsibility
vfEmbedded(Vector Fabrics)
• Homogeneous (multi-ARM or multi-x86)
• Globally shared memory
• User-guided• Via code GUI
• Mixed static/dynamic analysis
• Correctness depends on data-set used
• Processor loads derived from execution, shown in code GUI
MPA(IMEC)
• Heterogeneous• Point-to-point
• User-guided• Pragmas
separate from C
• Static global analysis for arrays (“Clean-C”)
May 2, 2012 10
© 2012 Target Compiler Technologies NV
Platform Model
Preferred Point-to-point links between
communicating processors Source processor writes to
destination processor’s local data-memory, avoiding local copies
Write conflicts resolved through local buffering
Address decoding logic
Additionally supported Globally shared memory In case of caches, coherency is
assumed to be resolved in hardware
Core DM Core DM
CoreDM
Localmemory
ITC
Single Tile
May 2, 2012 11
© 2012 Target Compiler Technologies NV
VLCQDCTLD
Example: High-Res JPEG Encoding
RGB2YUVRGB2YUV
Sub-Sample(4:2:0)
Sub-Sample(4:2:0)
ColumnDCT
ColumnDCT
RowDCTRowDCT
Quan-tise
Quan-tise VLCVLC Write
OutputWrite
Output
Last Coef
Last Coef
Zigzag ReorderZigzag
ReorderR G B Ctl
May 2, 2012 12
© 2012 Target Compiler Technologies NV
int main(int argc,char *argv[]){ init_all(); parsection: { jpg_open: { jpg_fopen(JPG_filename); writeword(0xFFD8); //SOI write_APP0info(); } main_encoder(&in_img);
jpg_close: { writeword(0xFFD9); //EOI jpg_fclose(); } } free(in_img.RGB_buffer); return 0;}
Label C code blocks for parallelisation
User-Guided Parallelisation
void main_encoder(struct image* img) { vlc_init: { DCY=0;DCCb=0;DCCr=0; } for (ypos=0..height) { for (xpos=0....width) { for (blk=0..5) { SBYTE DU[64]; loading: load_data_unit_from_RGB_buffer(img, xpos, ypos, blk, DU); process_DU(DU,blk); } } } vlc_fini: { // Bit-alignment of EOI marker if (bytepos>=0) { writebits((1<<(bytepos+1))-1, bytepos+1); } }}
void process_DU(SBYTE* CDU, int blk) {
SWORD DU[64]; fdct_main: { int* int_fdtbl = blk_fdtbl[blk]; fdct_and_quantization(CDU, int_fdtbl,DU); }
vlc_main: { while (i<=end0pos) { ... } if (end0pos!=63) writebits(EOB); }}
May 2, 2012 13
© 2012 Target Compiler Technologies NV
Parallelisation pragmas
Sample parallelisation on 3-core architecture
User-Guided Parallelisation
processor P0 type dlxprocessor P1 type dlxprocessor P2 type dlx
parallel ParRegion lbl main::parsection
task LOAD target P0 include lbl main_encoder::loading
task DCT target P1 include lbl process_DU::fdct_main
task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close
May 2, 2012 14
© 2012 Target Compiler Technologies NV
FIFO Communication Model
int A[10];int* pA = A;int* qA = A;int* ptr;int foo(int i) { return qA[i];}parsection: { producer: { A[i] = ...; ptr = (...) ? &A[x] : &A[y]; ... = pA[j]; } consumer: { ... = A[s]; ... = foo(*ptr); }}
int* A; // transformedint* pA = A; // pA = NULL !int* ptr;DEFINE_SRC_FIFO(FIFO_A, int, 10); // src fifo defDEFINE_SRC_FIFO(FIFO_ptr, int, 1); // src fifo defparsection: { producer: { int offs_ptr; // declaration A = FIFO_A.acq_put(); // acq put pA = A; // ptr copy A[i] = ...; ptr = (...) ? &A[x] : &A[y]; offs_ptr = ptr - A; // offset calc FIFO_ptr.put(offs_ptr); // offset put ... = pA[j]; // uses pA, A FIFO_A.rel_put(); // rel put }}
Acquire/release interface enables use of other processor’s data memory for storage of arrays (avoiding local copies)
FIFO communication implemented in comm. library produced by platform generator Synchronisation implemented by polling on FIFO queue’s status (empty/full)
May 2, 2012 15
© 2012 Target Compiler Technologies NV
For each parallelisation, MP Designer shows task graph with estimated processor loads
E.g.: task graph for JPEGencoding on 3-DLX architecture
Exploration
Task 0 "LOAD"Proc 0 "P0" (dlx)
main_encoder::loading:35.8 %
IN:<none>
par_region [b0]
main_encoder() [b17]
for [b18]
for [b19]
for [b20]
OUT:<none>
Task 1 "DCT"Proc 1 "P1" (dlx)
process_DU::fdct_main: 22.2 %process_DU::quant_main: 25.0 %*TOTAL* 47.1 %
IN:in_img.heightin_img.width
par_region [b0]
main_encoder() [b17]
for [b18]
for [b19]
for [b20]
process_DU() [b22]
OUT:<none>
[NC dep T0 -> T1 @ b20]DU (64) [FF]
Task 2 "VLC"Proc 2 "P2" (dlx)
main::jpg_open: 0.0 %main_encoder::vlc_init: 0.0 %process_DU::vlc_main: 16.9 %main_encoder::vlc_fini: 0.0 %main::jpg_close: 0.0 %*TOTAL* 16.9 %
IN:JPG_filenameSOF0info.heightSOF0info.widthin_img.heightin_img.width
par_region [b0]
main_encoder() [b17]
for [b18]
for [b19]
for [b20]
process_DU() [b22]
OUT:<none>
[NC dep T1 -> T2 @ b22]DU_ZZ (128) [FF]
end0pos (4)
May 2, 2012 16
© 2012 Target Compiler Technologies NV
E.g.: task graph for JPEG encoding on 5-DLX architecture Global dependency analysis automatically ensures correct
communication & synchronisation Manual design would be error-prone
Exploration
May 2, 2012 17
© 2012 Target Compiler Technologies NV
E.g.: task graph for H263 encoding on 8 cores Global dependency analysis
automatically ensures correct communication & synchronisation
Manual design would be error-prone
Exploration
May 2, 2012 18
© 2012 Target Compiler Technologies NV
JPEG encoding on multi-DLX architecture
Objective: obtain efficient load balancing Entire exploration in only days of time
Exploration
Algorithm #Cores
Parallelisation Mcycles*seqpar
Speedup
Load (%)P0 P1 P2 P3
P4
Effi-ciency
(%)
Original 1 7.1 - 1 100 100
Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86
Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69
Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52
* Cycles for 256x160-pixel image
Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85
Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68
Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78
Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69
Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87
Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971
74
May 2, 2012 19
© 2012 Target Compiler Technologies NV
Exploration
Algorithm #Cores
Parallelisation Mcycles*seqpar
Speedup
Load (%)P0 P1 P2 P3
P4
Effi-ciency
(%)
Original 1 7.1 - 1 100 100
Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86
Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69
Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52
* Cycles for 256x160-pixel image
Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85
Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68
Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78
Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69
Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87
Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971
74
JPEG encoding on multi-DLX architecture
3-core analysis suggests splitting of “q”
May 2, 2012 20
© 2012 Target Compiler Technologies NV
Exploration
Algorithm #Cores
Parallelisation Mcycles*seqpar
Speedup
Load (%)P0 P1 P2 P3
P4
Effi-ciency
(%)
Original 1 7.1 - 1 100 100
Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86
Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69
Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52
* Cycles for 256x160-pixel image
Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85
Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68
Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78
Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69
Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87
Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 9971
74
JPEG encoding on multi-DLX architecture
4-core analysis suggests splitting of “ld”
May 2, 2012 21
© 2012 Target Compiler Technologies NV
MP Designer - Highlights
Homogeneous and heterogeneous multicore SoCs User-guided parallelization (pragmas) Global dataflow analysis to check correctness of
chosen parallelization Validation only required on sequential C code Graphical feedback (task graphs) enables
exploration C source-to-source transformation Software code for (FIFO) communication and
synchronization inserted automatically Communication fabric (platform) generated
automatically, if needed
May 2, 2012 22
© 2012 Target Compiler Technologies NV
Alternative Heterogeneous Implementation(IP Designer)
JEMBJEMA
RGB2YUVRGB2YUV
Sub-Sample(4:2:0)
Sub-Sample(4:2:0)
ColumnDCT
ColumnDCT
RowDCTRowDCT
Quan-tise
Quan-tise VLCVLC Write
OutputWrite
Output
Last Coef
Last Coef
Zigzag ReorderZigzag
ReorderR G B Ctl
CshCVA6x128-bit
VU1 VU2
VB6x128-bit
T8x8x16-bit
P8x16-bit
R8x16-bit
ALU AGU
DM128-bit
8x16-bit
AGU1 AGU2 ALU
R S4x21-bit
WriteBits
outport
shr32shr16
DM324-bit
(ht)
add
r
DM28-bit
(zz)(izz)
add
r
DM116-bit
(data)(end0)
add
rd
i8x16-bit
JEMA: vector ASIP 2 vector units: vadd, vsub, vscale,
vmul, vaddsub Transposable register file – write
columns, read rows
JEMB: scalar ASIP Multiple data memories, special AGUs ALU: bit concat, Huffman addr, code calc Instruction-level parallelism
Source: G. Goossens et al, Multicore Expo 2009
May 2, 2012 23
© 2012 Target Compiler Technologies NV
Performance Heterogeneous dual-ASIP (IP Designer): 1
cycle/pixel Gate count: 76K
Homogeneous 5-core (MP Designer): 9 cycles/pixel
Data splitting (i.e. process loop iterations in parallel) will result in 1 cycle/pixel on ~45 cores
Consistent with literature, e.g. Plurality requires 64 cores* Estimated gate count: 500K
*Source: Plurality, MC Expo 2009
Tradeoff between flexibility and area/power
RGB2YUVRGB2YUV
Sub-Sample(4:2:0)
Sub-Sample(4:2:0)
ColumnDCT
ColumnDCT
RowDCTRowDCT
Quan-tise
Quan-tise VLCVLC Write
OutputWrite
Output
Last Coef
Last Coef
Zigzag ReorderZigzag
ReorderR G B Ctl
Alternative Heterogeneous Implementation(IP Designer)
May 2, 2012 24
© 2012 Target Compiler Technologies NV
Conclusions Multicore SoCs are here to stay
No (efficient) multicore SoC design without tools
Design and programming of individual ASIP cores Architectural exploration, SDK generation, RTL
generation Multicore parallelisation and platform
generation Parallelisation from sequential C code Static global dependency analysis Exploration for efficient load balancing