performance and power aware c code parallelization for … · 2015. 11. 2. · videos augmented...
TRANSCRIPT
Performance and Power Aware C Code Parallelization for Embedded Devices
Silexica Software Solutions GmbHAll rights reserved
Rainer LeupersChief Scientist
Silexica Software Solutions GmbH
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 2
MPSoCs and the Productivity Gap
Demand for Complex Applications
Mobile HD videos
Augmented reality
Drive-by-Wire
Human-machine Interaction
Multicores Becoming More Complex
Facts
Facts
Complex applications require multicore processors to run smoothly
No automated programmingsolution available in the market
-
3.000
6.000
2014
2017
2020
2023
2026
Max
. #
of
Cor
es
Off-the-shelf platformsSource: ITRS
8 Cores + GPU4 Cores + 3 DSPs + GPU + 3 HW
accelerators
4 Cores + 8 DSPs
Need better support for SW development in the MPSoC era
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 3
Key problem in MPSoC programming today
Lack of “the compiler” in the programming flow for MPSoCsleads to huge productivity loss!
Yesterday Today Tomorrow
Compiler
RISCRISC2
RISC1
HW
DSP1
DSP3
DSP2
Compiler Compiler Compiler Compiler
Manual analysis, partitioning and mapping
SLX
CPN
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 4
SLX Multicore Toolsuite Overview
SLXParallelizer
SLXMapper
SLXGenerator
Expose multiple forms of parallelism
Define an optimized software distribution
Generate code for thetarget platform
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 5
AutomatedPartitioning
• Information to parallelize a sequential application?• Dependencies in a Control and Data Flow Graph
• Hybrid approach: static and dynamic data flow analysis• Performance information
• Partitioning process• Clustering heuristics at the C statement granularity• Search for different parallelism patterns• Understand cost of computation and communication
SLX Parallelizer
Stmt
Stmt
Stmt
Stmt
Stmt
Stmt
Stmt
5%5%
10%
25%15%
30%10%
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 6
Supported Parallelism Types
t
t
Cores
Cores
x = 1;foo(x)
x = 1;foo(x)
bar(x);bar(x);
P1P1
P2P2
P1P1
P2P2RAW[x]
t
Task LevelParallelism
(TLP)
Task LevelParallelism
(TLP)
t
t
Cores
Cores
Loop
P2P2bar(x);bar(x);
P1P1
P2P2 t
Data LevelParallelism
(DLP)
Data LevelParallelism
(DLP)
t
t
Cores
Cores
Loop
bar(x);bar(x);t
x = foo(x)x = foo(x) RAW[x]
P1P1 P2P2 P1P1 P2P2
P1P1 P1P1P2P2 P2P2
Pipeline LevelParallelism
(PLP)
Pipeline LevelParallelism
(PLP)
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 7
SLX Parallelizer Flow
Parsing, tracing, performance estimation, dynamic data flow analysis
Graph clustering and pattern-based parallelism extraction
Feedback to the programmer
Arch. Model
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 8
Displaying Results
• Compiler graphs are difficult to handlefor the programmer
• Therefore intuitive reports are presented to the user
Source levelannotations
Source codehighlighting
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 9
SLX Parallelizer – Code Generation
• Today: Identify parallelism automatically and convert to a shared memory API specification, e.g. OpenMP
• Future: Automatically convert to a Process Network specification
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 10
SLX Mapper
• How to distribute an application based on different optimization criteria and additional user constraints to a given multicore architecture?
• Challenges: Application specificationBottleneck identificationAccurate performance estimationSimultaneous mapping of computations and communicationsHuge search space
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 11
RISC1RISC1 DSP2DSP2DSP1DSP1
RISC2RISC2 DSP3DSP3HWHW
RISC1 DSP2DSP1
RISC2 DSP3HW
SLX Mapper- Programming Model
• Many parallel programming models, but no winner
• Thoughts on programming models• Leveraged by compiler for code generation
(correctness)• Formal properties for optimizations
• Practical considerations• C dominance is unchallenged • Domain-specific (wireless, multimedia):
Dataflow programming / Streaming applications for signal processing, multimedia application and more
ProgrammingModel
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 12
Parallel Specification: CPN
• CPN: C for Process Networks• Intuitive parallel programming
• Based on C• Processes in C• Channels support C types,
structs, typedefs, …
__PNkpn Amp __PNin(short A[2]) __PNout(short B[2]) __PNparam(int boost) {
while (1)__PNin(A) __PNout(B) {for (int i = 0; i < 2; i++)B[i] = A[i]*boost;
}}
__PNprocess AudioAmp1 = Amp __PNin(C) __PNout(F) __PNparam(3);__PNprocess AudioAmp2 = Amp __PNin(D) __PNout(G) __PNparam(10);
PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);PropagateChannelAccessRanges(D, S);
PnStreamFactory streamFactory(BasePath);switch (transTarget) {case TransMVP:PnTransformTemplateInstantiate(D, S);ErasePnProcessTemplates(D);PnPrintForMVP(D, S);break;
case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces, streamFactory);ErasePnDefs(D);break;
case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S, strMappingFileName,
streamFactory);ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
}}
#include "PnTransform.h"#include "PnTVPUtg.h"#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;
voidclang::PnTransform(TransTarget transTarget, booltraces,
const std::string & strMappingFileName, ASTContext &Ctx,
Sema &S, const llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {case TransSystemC:case TransVPUtg:case TransVPUmap:PnTCopying(D, S);break;
default:;
}
PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces, streamFactory);ErasePnDefs(D);break;
case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S, strMappingFileName,
streamFactory);ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
&BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {case TransSystemC:case TransVPUtg:case TransVPUmap:PnTCopying(D, S);break;
default:;
}
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 13
SLX Mapper
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 14
SLX Mapper: Post-Mapping Analysis
Processor Utilization
Task States
Schedule
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 15
SLX Generator
• Target specific code needs to be written manually
• Challenges:Code generation for heterogeneous OS with different APIsSource code is mapping-dependentCross-compilation for a huge variety of target MPSoCs
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 16
SLX Generator Flow
Mapping
CPN-CC Compiler
KPN
Target Compiler
Plain C
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 17
Smartphone case study
Exynos Octa 5 multicore SoC Performance improvement of image
codecs via parallelizing C compiler Side-effect: energy savings Verified via physical measurements
encoder decoder
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 18
Towards system-level
power optimization
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 19
Motivation
ElectronicSystem
Level (ESL)
RegisterTransfer
Level (RTL)
GateLevel
LayoutLevel Chip
Impact of design decisions on power consumption
Timing
Power
High simulation speedEarly availability of power estimates
Level of design detail
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 20
Power estimation methodology
A
B
ESL systemcontaining component
gate-levelnetlist
layout
hardwarechip
power traceof component
automated back-annotation
C
D
ESL power estimationin different scenario
ESL power estimationin different system
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 21
Power model calibration
ESL virtual platform
inputs
ESL model
traces S
Run hardware and ESL model with same inputs will be in same state
Record ESL traces and hardware power as calibration data Assume linear power model Determine factors a with best fit
abstract powermodela = ?
Pest = S · a
determine aso that
Pest ≈ Pref ESL powermodel
a = (a1 … aN)T
Pest = S · a
minimize meansquared errora := S+ · Pref
sam
e in
puts
powercalibrationdata S, Prefinputs
hardware
details
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 22
NoC case study
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 23
Speed and accuracy
ESL power estimation methodology verified for AMBA AXI and complex custom NoC at VLSI layout level (post P&R) Estimation error < 22% Can predict different „power phases“ correctly 900x faster than low-level power simulation
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 24
ESL power estimation for clustered multicore platform Power estimation for entire SoC Focus on ARM A9 core power estimation White box approach using instrumented gem5 simulator Black box approach using OVP simulator TLM and activity traces
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 25
TI Panda Board ES
OMAP4460 2x ARM Cortex A9 2x ARM Cortex M3 DSP GPU …
SW loaded viabootleader fromSD card bare-metal
benchmarkspossible
Image by pandaboard.org
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 26
Power Measurement Setup
Test Automation Board• control power via
serial port
Benchmark Upload• via Ethernet
Standard Output• via serial port
Amplifier Circuit• for power
measurement
Data Logger• USB-DUXfast
• 16 channels• 12 bit• 5 kHz
Benchmark Time Synchronization• via GPIO pin• sampled by power
measurement circuit
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 27
White box estimation results
0
10
20
30
40
50
60
70
dhry
ston
e
lte-b
ench
mar
k
lte-b
ench
mar
k_in
t
mib
ench
/tele
com
m/a
dpcm
mib
ench
/tele
com
m/C
RC
32
mib
ench
/tele
com
m/F
FT
mib
ench
/tele
com
m/g
sm
sim
ple/
fibon
acci
sim
ple/
ins_
sort
sim
ple/
lfsr
sim
ple/
man
delb
rot_
fxp
sim
ple/
sel_
sort
sim
ple/
siev
e
sim
ple/
siev
e_of
_era
tost
hene
s
stre
amit/
bito
nic-
sort
stre
amit/
fft
stre
amit/
fft_i
nt
stre
amit/
filte
rban
k
stre
amit/
fm
stre
amit/
mat
mul
-blo
ck
stre
amit/
mat
mul
-blo
ck_i
nt
stre
amit/
mat
rixm
ult
stre
amit/
mat
rixm
ult_
int
wib
ench
/Cha
nnel
Mai
n
wib
ench
/Equ
aliz
erM
ain
wib
ench
/LTE
Sys
wib
ench
/Mod
Dem
odM
ain
wib
ench
/Rat
eMat
cher
Mai
n
wib
ench
/SC
FDM
AM
ain
wib
ench
/Scr
ambD
escr
ambM
ain
wib
ench
/Sub
Car
rierM
apD
emap
Mai
n
wib
ench
/Tra
nsfo
rmPr
eDec
Mai
n
wib
ench
/Tur
boM
ain
Single Core RMS Power Estimation Error (%)
classic
non-negative
averages: 11.1%5.4%
1∑
RMS: root mean square
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 28
Black box estimation results
0
5
10
15
20
25dh
ryst
one
dhry
ston
e2co
res
lte-b
ench
mar
k
lte-b
ench
mar
k_in
t
lte-b
ench
mar
k_in
t2co
res
mib
ench
/tele
com
m/a
dpcm
mib
ench
/tele
com
m/C
RC
32
mib
ench
/tele
com
m/F
FT
mib
ench
/tele
com
m/g
sm
pthr
ead/
audi
o_fil
ter_
4
pthr
ead/
jpeg
pthr
ead/
lte_b
ench
mar
k
pthr
ead/
man
delb
rot
pthr
ead/
mat
mul
pthr
ead/
sobe
l_co
arse
stre
amit/
bito
nic-
sort
stre
amit/
fft
stre
amit/
fft_i
nt
stre
amit/
filte
rban
k
stre
amit/
fm
stre
amit/
mat
mul
-blo
ck
stre
amit/
mat
mul
-blo
ck_i
nt
stre
amit/
mat
rixm
ult
stre
amit/
mat
rixm
ult_
int
wib
ench
/Cha
nnel
Mai
n
wib
ench
/Equ
aliz
erM
ain
wib
ench
/LTE
Sys
wib
ench
/Mod
Dem
odM
ain
wib
ench
/Rat
eMat
cher
Mai
n
wib
ench
/SC
FDM
AM
ain
wib
ench
/Scr
ambD
escr
ambM
ain
wib
ench
/Sub
Car
rierM
apD
emap
Mai
n
wib
ench
/Tra
nsfo
rmPr
eDec
Mai
n
wib
ench
/Tur
boM
ain
Dual Core Black Box OVP Power Estimation Error (%)
Bare Computed Activity
averages: 4.9%5.3%5.5%
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 29
Platform level power estimation
Multicore Cluster
RAM4MB
RAM4MB
RAM4MB
I-Cache
…
RAM4MB
RAM1MB
…
RAM1MB
Mul
ticor
e C
lust
erD-Cache
I-Cache
D-Cache
RAM4MBUart Exit Int.
Ctrl
ClusterID
Locks
ARMA9
ARMA9
CtrlID
CtrlID
power model from single core power estimation
abstract power model:Pbus = b * <in> * <out> * <width> * (s + d * <freq> * <usage>)
abstract power model:Pmem = m * <capacity> * (s + d * <freq> * 5% * <usage>)
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 30
Optimized SW task mapping (16 cores)
Audio Filter
LTE Benchmark
Observations:Energy of an application rather constant Energy optimization opportunity for heterogeneous systemsHuge optimization opportunity for average and peak powerTrade-off of runtime vs power for multicore systems is a must
Average Power Peak Power Time EnergyBest Mapping 2.52 W 4.82 W 24 us 61.6 uJOne core 0.62 W 0.62 W 92 us 56.8 uJ
Average Power Peak Power Time EnergyBest Mapping 1.98 W 9.70 W 115 us 0.23 mJOne core 0.60 W 0.61 W 245 us 0.15 mJ
x8
x16x3
x4x4
x2
SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 31
Thank you! Questions?