power reduction through software-programmable accelerators for … · 2012-11-30 · power...
TRANSCRIPT
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 1
Power Reduction through
Software-Programmable Accelerators
for ARM-based Subsystems
Gert Goossens CEO Target Compiler Technologies
www.retarget.com [email protected]
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 2
Low Power: Back to Basics
Dynamic power Pdyn = C (A fclock) Vdd
2
Leakage power Pleak = Ileak Vdd
~ (a-Vt Ngates Wdev) Vdd
["Moore's Law Meets Static Power",
Computer, IEEE Comp. Soc., Dec. 2003]
• Low-V technology
• V-scaling
• Locality of reference
• Avoid switching
• f-scaling
• Concurrency:task-, data-, instr.-level
• Power gating
• Minimal logic
• Multi-treshold libraries
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 3
Heterogeneous Multicore SoC
Dual and quad-core ARM
Concurrency in control processing: task-level parallelism Big.LITTLE: minimal logic for given task; voltage and frequency
scaling
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 4
Heterogeneous Multicore SoC
Multiple ASIPs
“Application-Specific Instruction-set Processors”
Concurrency in data processing: task-, data-, instr.-level parallelism Minimal logic for given task: architectural specialization Locality of reference: local memories and interconnect Power gating of ASIPs when not in use
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 5
Heterogeneous Multicore SoC
Hardwired datapath – an endangered species?
Except in a few cases, market dynamics require programmability ASIPs enable programmability and efficiency ARM & ASIPs offer a familiar “software approach” to SoC design:
quick algorithm mapping from C/Matlab through SDK/debug tools
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 6
“No MPSoC Design Without Tools”
Tools at IP level (ASIP cores)
Architectural exploration
SDK generation: C compiler, ISS, debugger…
RTL generation
→ IP Designer™
Tools at IP subsystem level (multicore)
Code parallelisation
Communication and synchronization
Multicore platform generation
→ MP Designer™
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 7
IP Designer Tool Suite
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 8
Architectural optimisation space
ASIP architectural optimisation space
Parallelism Specialisation
Instruction- level
parallelism
Data- level
parallelism
Task- level
parallelism
Orthogonal instruction set (VLIW)
Encoded instruction
set
Vector processing
(SIMD)
Multi-core
App.-specific
data types
App.-specific
instructions
Connectivity & storage matching application’s
data-flow
App.-spec. data
processing
App.-spec. memory
addressing
App.-spec. control
processing
Distributed regs, sub-ranges
Multiple mem’s, sub-ranges
Jumps, subroutines, interrupts, HW do-loops,
residual control, predication…
Direct, indirect, post-modification, indexed,
stack indirect…
Any exotic operator
Integer, fractional, floating-point, bits, complex, vector…
Single or multi-cycle
Relative or absolute, address range, delay slots…
Pipeline
nML and IP Designer…
Support a wide range of ASIP architectures
Enable true architectural exploration
Make ASIP design easy
Multi-threading
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 9
DSPstone benchmark on TI C55x
Graph-based C compiler technology offers retargetability and efficiency at same time
Compilable sub-set of TI C55x modelled in nML in 2.5 person-months
Only few C code modifications made: repeat loop, restrict pointers
Target’s compiler TI’s compiler Gain (Target vs TI) Cycles Code Size Cycles Code Size Cycles Code Size
Small-scale examples
FIR restrict 45 39 61 37 26% -5%
Convolution repeat 26 17 26 13 0% -31%
LMS original 98 56 117 64 16% 13%
Matrix repeat 1429 46 2541 53 44% 13%
IIR, N=4 restrict 53 51 66 62 20% 18%
IIR, N=16 restrict 149 51 198 62 25% 18%
22% 4% Large-scale examples
FFT bit reverse original 46361 91 49387 91 6% 0%
FFT butterfly original 179374 167 187621 177 4% 6%
ADPCM original 144216 3067 162978 3367 12% 9%
7% 5%
IP Designer’s C Compiler
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 10
0
20
40
60
80
100
A B C D E
0
20
40
60
80
100
Area
(kGates)
Power
(W/MHz)
IP Designer’s RTL Generator
Example: audio DSP (90 nm, clock 220 MHz, 0.9V)
Low-power optimisations yield 60% savings
Low-power optimisations have small area cost
Area and power within percentages from hand-optimized design
-60%
IP Designer configuration options
A Standard RTL generation
B Clock gating + operand isolation for functional units
C Operand isolation for multiplexers
D Latching of register addresses in instruction decoder
E Manual design by customer
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 11
IP Designer Example
“Wolverine” platform
Ultra-low power multi-core platform, optimised for audio coding
Used in hearing instruments and Bluetooth headsets
20-bit precision
4 “micro-DSP” VLIW-ASIPs + 4 filter accelerators + 1 micro-processor core
0.04 mW/M-MAC, 42 MIPS at fclock = 2 MHz (0.13u CMOS)
© Sound Design Technologies – Reproduced with permission
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 12
Reed-Solomon coding ASIP
FEC for wireless link in personal health-care systems
IEEE 802.15.4a, supports RS(63, 55) encoding/decoding
Concurrency
Data-level: 8 elements SIMD, 6-bit/element
Instruction-level: 2 scalar + 2 vector issue slots
Specialisation
Data and instruction-level parallelism
Hardware for finite-field multiplication and addition
© IMEC – Reproduced with permission
IP Designer Example
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 13
© IMEC – Reproduced with permission
Reed-Solomon coding ASIP
Performance
Gate count
IP Designer Example
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 14
© 2009 IMEC – Reproduced with permission
Reed-Solomon coding ASIP
Energy
IP Designer Example
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 15
IP Designer Market Adoption
Audio
Video & imaging
Wireless
Wireline
Medical
Network processing
Automotive
TMTM
High-perf. computing
Crypto & identification
Shown are publicly announced
IP Designer customers
Estimate about 150 unique SoC products
based on IP Designer in the market today
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 16
“No MPSoC Design Without Tools”
Tools at IP level (ASIP cores)
Architectural exploration
SDK generation: C compiler, ISS, debugger…
RTL generation
→ IP Designer™
Tools at IP subsystem level (multicore)
Code parallelisation
Communication and synchronization
Multicore platform generation
→ MP Designer™
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 17
MP Designer Tool Suite
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 18
int main(int argc,char *argv[]){ init_all(); parsection: { jpg_open: { jpg_fopen(JPG_filename); writeword(0xFFD8); //SOI write_APP0info(); } main_encoder(&in_img); jpg_close: { writeword(0xFFD9); //EOI jpg_fclose(); } } free(in_img.RGB_buffer); return 0; }
Label C code blocks for parallelisation
User-Guided Parallelisation
void main_encoder(struct image* img) { vlc_init: { DCY=0;DCCb=0;DCCr=0; } for (ypos=0..height) { for (xpos=0....width) { for (blk=0..5) { SBYTE DU[64]; loading: load_data_unit_from_RGB_buffer(img, xpos, ypos, blk, DU); process_DU(DU,blk); } } } vlc_fini: { // Bit-alignment of EOI marker if (bytepos>=0) { writebits((1<<(bytepos+1))-1, bytepos+1); } } }
void process_DU(SBYTE* CDU, int blk) { SWORD DU[64]; fdct_main: { int* int_fdtbl = blk_fdtbl[blk]; fdct_and_quantization(CDU, int_fdtbl,DU); } vlc_main: { //Encode ACs while (i<=end0pos) { ... } if (end0pos!=63) writebits(EOB); } }
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 19
Parallelisation pragmas
Sample parallelisation on 3-core architecture
User-Guided Parallelisation
processor P0 type dlx processor P1 type dlx processor P2 type dlx parallel ParRegion lbl main::parsection task LOAD target P0 include lbl main_encoder::loading task DCT target P1 include lbl process_DU::fdct_main task VLC target P2 include lbl main::jpg_open include lbl main_encoder::vlc_init include lbl process_DU::vlc_main include lbl main_encoder::vlc_fini include lbl main::jpg_close
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 20
For each parallelisation, MP Designer shows task graph with estimated processor loads
JPEG encoding on 3-DLX architecture
Exploration
Tas k 0 " LOA D" Pro c 0 " P0" ( dlx)
ma in_en code r::lo ading : 35.8 %
IN: <no ne>
par _reg ion [ b0]
m ain_e ncod er() [b17 ]
f or [b 18]
for [b 19]
for [ b20]
OU T: <no ne>
T ask 1 "DC T" P roc 1 "P1 " (dl x)
p roce ss_D U::fd ct_m ain: 22 .2 % p roce ss_D U::q uant_ main : 25 .0 % * TOT AL* 47 .1 %
I N: i n_im g.hei ght i n_im g.wi dth
p ar_r egion [b0 ]
main _enc oder () [b 17]
for [b18 ]
for [b19 ]
fo r [b2 0]
p roce ss_D U() [ b22]
O UT: < none >
[NC dep T0 -> T1 @ b2 0] DU (64 ) [FF ]
Task 2 "V LC" Proc 2 "P 2" (d lx)
main ::jpg _ope n: 0.0 % main _enc oder ::vlc_ init: 0.0 % proce ss_D U::v lc_m ain: 16.9 % main _enc oder ::vlc_ fini: 0.0 % main ::jpg _clos e: 0.0 % *TOT AL* 16.9 %
IN: JPG_ filen ame SOF0 info .heig ht SOF0 info .wid th in_im g.he ight in_im g.w idth
par_ regio n [b0 ]
mai n_en code r() [b 17]
for [b18 ]
fo r [b1 9]
f or [b 20]
proce ss_D U() [b22 ]
OUT : <non e>
[NC dep T1 - > T2 @ b 22] DU_ ZZ (1 28) [FF]
en d0p os (4 )
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 21
T0 "bs" (dlx)
bitstream_hdr:2.0 %
bitstream_cfs: 8.2 %
idct_p_inter: 8.8 %
idct_b: 1.0 %
make_p_mv: 0.2 %
make_b_mv: 0.0 %
*TOTAL* 20.4 %
par_region [b0]
initdec() [b11]
while [b19]
getpict() [b20]
getMBs() [b23]
while [b26]
T1 "recon" (dlx)
rec_b: 3.1 %
rec_p: 20.5 %
*TOTAL* 23.7 %
IN:
framenum
has_startcode
par_region [b0]
initdec() [b11]
while [b19]
getpict() [b20]
getMBs() [b23]
while [b26]
[NC dep T0 -> T1 @ b0]
h+w (4)
[NC dep T0 -> T1 @ b11]
h+w (4)
[NC dep T0 -> T1 @ b26]
B_MV (16) [FF]
P_MV (48) [FF]
comm_mb (4)
comm_pict_hdr (2)
mvdbxy (4)
[NC dep T0 -> T1 @ b19]
start (4)
[NC dep T0 -> T1 @ b23]
MBAmax (4)
T2 "addblock" (dlx)
idct_p_intra: 2.8 %
addblock_p_intra: 1.2 %
addblock_p_inter: 3.8 %
reconblock_b: 6.7 %
addblock_b: 0.5 %
*TOTAL* 14.9 %
IN:
framenum
has_startcode
refidct
par_region [b0]
while [b19]
getpict() [b20]
getMBs() [b23]
while [b26]
[NC dep T0 -> T2 @ b26]
B_MV (16) [CF]
Mode+pCBP+pCBPB+pCOD (4x4)
pblk+bblk (768) [FF]
trb+trd (2x4)
[LC dep T0 -> T2 @ b26]
bx (4)
by (4)
pmvdbxy (4)
[NC dep T0 -> T2 @ b19]
start (4)
pb_frame (4)
[NC dep T0 -> T2 @ b23]
MBA (4)
MBAmax (4)
T3 "store_bmb" (dlx)
store_mb_b: 15.1 %
init_idct: 0.1 %
*TOTAL* 15.3 %
IN:
framenum
has_startcode
refidct
par_region [b0]
initdec() [b11]
while [b19]
getpict() [b20]
getMBs() [b23]
while [b26]
store_mb_b() [b72]
[NC dep T0 -> T3 @ b26]
comm_mb (4)
err (4)
pb_frame (4)
[NC dep T0 -> T3 @ b19]
start (4)
pb_frame (4)
[NC dep T0 -> T3 @ b23]
MBAmax (4)
T4 "out" (dlx)
store_mb_p: 21.8 %
mb_rgb_copyin: 1.6 %
padding_mb: 2.5 %
*TOTAL* 26.0 %
IN:
framenum
has_startcode
outputname
par_region [b0]
initdec() [b11]
while [b19]
getpict() [b20]
getMBs() [b23]
while [b26]
store_mb_b() [b72]
comm_mb (4)
comm_pict_hdr (2)
[NC dep T0 -> T4 @ b19]
start (4)
[NC dep T0 -> T4 @ b23]
MBAmax (4)
[NC dep T1 -> T2 @ b26]
brec (384) [FC]
prec (384) [FF]
[NC dep T2 -> T3 @ b26]
brec (384) [CF]
[NC dep T2 -> T4 @ b26]
prec (384) [CF]
[NC dep T3 -> T4 @ b72]
b_rgb_mb (884) [CF]
Task graph for H263 encoding on 8 cores
Exploration
Global dependency analysis automatically ensures correct communication & synchronisation
Manual design would be error-prone
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 22
Exploration
Algorithm #
Cores
Parallelisation Mcycles*
seq par
Speed
up
Load (%)
P0 P1 P2 P3 P4
Effi-
ciency
(%)
Original 1 7.1 - 1 100 100
Original 2 ld+dct+q | vlc 7.1 4.1 1.7 100 76 86
Original 3 ld | dct+q | vlc 7.1 3.4 2.1 64 60 91 69
Original 4 ld | dct | q | vlc 7.1 3.4 2.1 77 40 24 91 52
* Cycles for 256x160-pixel image
JPEG encoding on multi-DLX architecture
Entire exploration in only days of time
Optimised 2 ld+dct | q+vlc 4.1 2.4 1.5 100 72 85
Optimised 3 ld | dct+q | vlc 4.1 2.0 2.0 74 100 40 68
Optimised 3 ld | dct | q+vlc 4.1 1.7 2.4 86 56 100 78
Optimised 4 ld | dct | q | vlc 4.1 1.5 2.8 100 65 75 54 69
Split quant 3 ld | dct+q0 | q1+vlc 4.3 1.6 2.6 92 100 79 87
Dual load 5 ld0 | ld1 | dct | q | vlc 4.1 1.1 3.7 71 61 95 99 71 74
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 23
Multicore SDK
ISS Core-3
ISS Core-1
ISS Core-2
HW Core-2
HW Core-3
HW Core-1
JTAG controller
Debug controller
Debug controller
Debug controller Multicore simulation
Multicore on-chip debugging
ARM Technology Symposium, Shenzhen, November 30, 2012
© 2012 Target Compiler Technologies 24
Conclusion
ASIPs enable low-power, acceleration and programmability in ARM-based multicore SoCs
No (efficient) multicore SoC design without tools
Design and programming of individual ASIP cores
Multicore parallelisation and platform generation
Target can be your ASIP and multicore tools partner
More information…
Come to our booth
Brochure in your conference bag
www.retarget.com