power reduction through software-programmable accelerators for … · 2012-11-30 · power...

ARM Technology Symposium, Shenzhen, November 30, 2012

© 2012 Target Compiler Technologies 1

Power Reduction through

Software-Programmable Accelerators

for ARM-based Subsystems

Gert Goossens CEO Target Compiler Technologies

www.retarget.com [email protected]



Low Power: Back to Basics

Dynamic power Pdyn = C (A fclock) Vdd

2

Leakage power Pleak = Ileak Vdd

~ (a-Vt Ngates Wdev) Vdd

["Moore's Law Meets Static Power",

Computer, IEEE Comp. Soc., Dec. 2003]

• Low-V technology

• V-scaling

• Locality of reference

• Avoid switching

• f-scaling

• Concurrency:task-, data-, instr.-level

• Power gating

• Minimal logic

• Multi-treshold libraries



Heterogeneous Multicore SoC

Dual and quad-core ARM

Concurrency in control processing: task-level parallelism Big.LITTLE: minimal logic for given task; voltage and frequency

scaling




Multiple ASIPs

“Application-Specific Instruction-set Processors”

Concurrency in data processing: task-, data-, instr.-level parallelism Minimal logic for given task: architectural specialization Locality of reference: local memories and interconnect Power gating of ASIPs when not in use




Hardwired datapath – an endangered species?

Except in a few cases, market dynamics require programmability ASIPs enable programmability and efficiency ARM & ASIPs offer a familiar “software approach” to SoC design:

quick algorithm mapping from C/Matlab through SDK/debug tools



“No MPSoC Design Without Tools”

Tools at IP level (ASIP cores)

Architectural exploration

SDK generation: C compiler, ISS, debugger…

RTL generation

→ IP Designer™

Tools at IP subsystem level (multicore)

Code parallelisation

Communication and synchronization

Multicore platform generation

→ MP Designer™



IP Designer Tool Suite



Architectural optimisation space

ASIP architectural optimisation space

Parallelism Specialisation

Instruction- level

parallelism

Data- level

parallelism

Task- level

parallelism

Orthogonal instruction set (VLIW)

Encoded instruction

set

Vector processing

(SIMD)

Multi-core

App.-specific

data types

App.-specific

instructions

Connectivity & storage matching application’s

data-flow

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs, sub-ranges

Multiple mem’s, sub-ranges

Jumps, subroutines, interrupts, HW do-loops,

residual control, predication…

Direct, indirect, post-modification, indexed,

stack indirect…

Any exotic operator

Integer, fractional, floating-point, bits, complex, vector…

Single or multi-cycle

Relative or absolute, address range, delay slots…

Pipeline

nML and IP Designer…

Support a wide range of ASIP architectures

Enable true architectural exploration

Make ASIP design easy

Multi-threading



DSPstone benchmark on TI C55x

Graph-based C compiler technology offers retargetability and efficiency at same time

Compilable sub-set of TI C55x modelled in nML in 2.5 person-months

Only few C code modifications made: repeat loop, restrict pointers

Target’s compiler TI’s compiler Gain (Target vs TI) Cycles Code Size Cycles Code Size Cycles Code Size

Small-scale examples

FIR restrict 45 39 61 37 26% -5%

Convolution repeat 26 17 26 13 0% -31%

LMS original 98 56 117 64 16% 13%

Matrix repeat 1429 46 2541 53 44% 13%

IIR, N=4 restrict 53 51 66 62 20% 18%

IIR, N=16 restrict 149 51 198 62 25% 18%

22% 4% Large-scale examples

FFT bit reverse original 46361 91 49387 91 6% 0%

FFT butterfly original 179374 167 187621 177 4% 6%

ADPCM original 144216 3067 162978 3367 12% 9%

7% 5%

IP Designer’s C Compiler



0

20

40

60

80

100

A B C D E

0

20

40

60

80

100

Area

(kGates)

Power

(W/MHz)

IP Designer’s RTL Generator

Example: audio DSP (90 nm, clock 220 MHz, 0.9V)

Low-power optimisations yield 60% savings

Low-power optimisations have small area cost

Area and power within percentages from hand-optimized design

-60%

IP Designer configuration options

A Standard RTL generation

B Clock gating + operand isolation for functional units

C Operand isolation for multiplexers

D Latching of register addresses in instruction decoder

E Manual design by customer



IP Designer Example

“Wolverine” platform

Ultra-low power multi-core platform, optimised for audio coding

Used in hearing instruments and Bluetooth headsets

20-bit precision

4 “micro-DSP” VLIW-ASIPs + 4 filter accelerators + 1 micro-processor core

0.04 mW/M-MAC, 42 MIPS at fclock = 2 MHz (0.13u CMOS)

© Sound Design Technologies – Reproduced with permission



Reed-Solomon coding ASIP

FEC for wireless link in personal health-care systems

IEEE 802.15.4a, supports RS(63, 55) encoding/decoding

Concurrency

Data-level: 8 elements SIMD, 6-bit/element

Instruction-level: 2 scalar + 2 vector issue slots

Specialisation

Data and instruction-level parallelism

Hardware for finite-field multiplication and addition

© IMEC – Reproduced with permission

IP Designer Example



© IMEC – Reproduced with permission


Performance

Gate count

IP Designer Example



© 2009 IMEC – Reproduced with permission


Energy

IP Designer Example



IP Designer Market Adoption

Audio

Video & imaging

Wireless

Wireline

Medical

Network processing

Automotive

TMTM

High-perf. computing

Crypto & identification

Shown are publicly announced

IP Designer customers

Estimate about 150 unique SoC products

based on IP Designer in the market today



“No MPSoC Design Without Tools”

Tools at IP level (ASIP cores)

Architectural exploration

SDK generation: C compiler, ISS, debugger…

RTL generation

→ IP Designer™

Tools at IP subsystem level (multicore)

Code parallelisation

Communication and synchronization

Multicore platform generation

→ MP Designer™



MP Designer Tool Suite



int main(int argc,char *argv[]){ init_all(); parsection: { jpg_open: { jpg_fopen(JPG_filename); writeword(0xFFD8); //SOI write_APP0info(); } main_encoder(&in_img); jpg_close: { writeword(0xFFD9); //EOI jpg_fclose(); } } free(in_img.RGB_buffer); return 0; }

Label C code blocks for parallelisation

User-Guided Parallelisation

void main_encoder(struct image* img) { vlc_init: { DCY=0;DCCb=0;DCCr=0; } for (ypos=0..height) { for (xpos=0....width) { for (blk=0..5) { SBYTE DU[64]; loading: load_data_unit_from_RGB_buffer(img, xpos, ypos, blk, DU); process_DU(DU,blk); } } } vlc_fini: { // Bit-alignment of EOI marker if (bytepos>=0) { writebits((1<<(bytepos+1))-1, bytepos+1); } } }

void process_DU(SBYTE* CDU, int blk) { SWORD DU[64]; fdct_main: { int* int_fdtbl = blk_fdtbl[blk]; fdct_and_quantization(CDU, int_fdtbl,DU); } vlc_main: { //Encode ACs while (i<=end0pos) { ... } if (end0pos!=63) writebits(EOB); } }



Parallelisation pragmas

Sample parallelisation on 3-core architecture

User-Guided Parallelisation

processor P0 type dlx processor P1 type dlx processor P2 type dlx parallel ParRegion lbl main::parsection task LOAD target P0 include lbl main_encoder::loading task DCT target P1 include lbl process_DU::fdct_main task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close



For each parallelisation, MP Designer shows task graph with estimated processor loads

JPEG encoding on 3-DLX architecture

Exploration

Tas k 0 " LOA D" Pro c 0 " P0" ( dlx)

ma in_en code r::lo ading : 35.8 %

IN: <no ne>

par _reg ion [ b0]

m ain_e ncod er() [b17 ]

f or [b 18]

for [b 19]

for [ b20]

OU T: <no ne>

T ask 1 "DC T" P roc 1 "P1 " (dl x)

p roce ss_D U::fd ct_m ain: 22 .2 % p roce ss_D U::q uant_ main : 25 .0 % * TOT AL* 47 .1 %

I N: i n_im g.hei ght i n_im g.wi dth

p ar_r egion [b0 ]

main _enc oder () [b 17]

for [b18 ]

for [b19 ]

fo r [b2 0]

p roce ss_D U() [ b22]

O UT: < none >

[NC dep T0 -> T1 @ b2 0] DU (64 ) [FF ]

Task 2 "V LC" Proc 2 "P 2" (d lx)

main ::jpg _ope n: 0.0 % main _enc oder ::vlc_ init: 0.0 % proce ss_D U::v lc_m ain: 16.9 % main _enc oder ::vlc_ fini: 0.0 % main ::jpg _clos e: 0.0 % *TOT AL* 16.9 %

IN: JPG_ filen ame SOF0 info .heig ht SOF0 info .wid th in_im g.he ight in_im g.w idth

par_ regio n [b0 ]

mai n_en code r() [b 17]

for [b18 ]

fo r [b1 9]

f or [b 20]

proce ss_D U() [b22 ]

OUT : <non e>

[NC dep T1 - > T2 @ b 22] DU_ ZZ (1 28) [FF]

en d0p os (4 )



T0 "bs" (dlx)

bitstream_hdr:2.0 %

bitstream_cfs: 8.2 %

idct_p_inter: 8.8 %

idct_b: 1.0 %

make_p_mv: 0.2 %

make_b_mv: 0.0 %

*TOTAL* 20.4 %

par_region [b0]

initdec() [b11]

while [b19]

getpict() [b20]

getMBs() [b23]

while [b26]

T1 "recon" (dlx)

rec_b: 3.1 %

rec_p: 20.5 %

*TOTAL* 23.7 %

IN:

framenum

has_startcode

par_region [b0]

initdec() [b11]

while [b19]

getpict() [b20]

getMBs() [b23]

while [b26]

[NC dep T0 -> T1 @ b0]

h+w (4)

[NC dep T0 -> T1 @ b11]

h+w (4)

[NC dep T0 -> T1 @ b26]

B_MV (16) [FF]

P_MV (48) [FF]

comm_mb (4)

comm_pict_hdr (2)

mvdbxy (4)

[NC dep T0 -> T1 @ b19]

start (4)

[NC dep T0 -> T1 @ b23]

MBAmax (4)

T2 "addblock" (dlx)

idct_p_intra: 2.8 %

addblock_p_intra: 1.2 %

addblock_p_inter: 3.8 %

reconblock_b: 6.7 %

addblock_b: 0.5 %

*TOTAL* 14.9 %

IN:

framenum

has_startcode

refidct

par_region [b0]

while [b19]

getpict() [b20]

getMBs() [b23]

while [b26]

[NC dep T0 -> T2 @ b26]

B_MV (16) [CF]

Mode+pCBP+pCBPB+pCOD (4x4)

pblk+bblk (768) [FF]

trb+trd (2x4)

[LC dep T0 -> T2 @ b26]

bx (4)

by (4)

pmvdbxy (4)

[NC dep T0 -> T2 @ b19]

start (4)

pb_frame (4)

[NC dep T0 -> T2 @ b23]

MBA (4)

MBAmax (4)

T3 "store_bmb" (dlx)

store_mb_b: 15.1 %

init_idct: 0.1 %

*TOTAL* 15.3 %

IN:

framenum

has_startcode

refidct

par_region [b0]

initdec() [b11]

while [b19]

getpict() [b20]

getMBs() [b23]

while [b26]

store_mb_b() [b72]

[NC dep T0 -> T3 @ b26]

comm_mb (4)

err (4)

pb_frame (4)

[NC dep T0 -> T3 @ b19]

start (4)

pb_frame (4)

[NC dep T0 -> T3 @ b23]

MBAmax (4)

T4 "out" (dlx)

store_mb_p: 21.8 %

mb_rgb_copyin: 1.6 %

padding_mb: 2.5 %

*TOTAL* 26.0 %

IN:

framenum

has_startcode

outputname

par_region [b0]

initdec() [b11]

while [b19]

getpict() [b20]

getMBs() [b23]

while [b26]

store_mb_b() [b72]

comm_mb (4)

comm_pict_hdr (2)

[NC dep T0 -> T4 @ b19]

start (4)

[NC dep T0 -> T4 @ b23]

MBAmax (4)

[NC dep T1 -> T2 @ b26]

brec (384) [FC]

prec (384) [FF]

[NC dep T2 -> T3 @ b26]

brec (384) [CF]

[NC dep T2 -> T4 @ b26]

prec (384) [CF]

[NC dep T3 -> T4 @ b72]

b_rgb_mb (884) [CF]

Task graph for H263 encoding on 8 cores

Exploration

Global dependency analysis automatically ensures correct communication & synchronisation

Manual design would be error-prone



Exploration

Algorithm #

Cores

Parallelisation Mcycles*

seq par

Speed

up

Load (%)

P0 P1 P2 P3 P4

Effi-

ciency

(%)

Original 1 7.1 - 1 100 100

Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86

Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69

Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52

* Cycles for 256x160-pixel image

JPEG encoding on multi-DLX architecture

Entire exploration in only days of time

Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85

Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68

Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78

Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69

Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87

Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 99 71 74



Multicore SDK

ISS Core-3

ISS Core-1

ISS Core-2

HW Core-2

HW Core-3

HW Core-1

JTAG controller

Debug controller

Debug controller

Debug controller Multicore simulation

Multicore on-chip debugging



Conclusion

ASIPs enable low-power, acceleration and programmability in ARM-based multicore SoCs

No (efficient) multicore SoC design without tools

Design and programming of individual ASIP cores

Multicore parallelisation and platform generation

Target can be your ASIP and multicore tools partner

More information…

Come to our booth

Brochure in your conference bag

www.retarget.com

power reduction through software-programmable accelerators for … · 2012-11-30 · power...

Documents