performance and power aware c code parallelization for … · 2015. 11. 2. · videos augmented...

31
Performance and Power Aware C Code Parallelization for Embedded Devices Silexica Software Solutions GmbH All rights reserved Rainer Leupers Chief Scientist Silexica Software Solutions GmbH

Upload: others

Post on 30-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

Performance and Power Aware C Code Parallelization for Embedded Devices

Silexica Software Solutions GmbHAll rights reserved

Rainer LeupersChief Scientist

Silexica Software Solutions GmbH

Page 2: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 2

MPSoCs and the Productivity Gap

Demand for Complex Applications

Mobile HD videos

Augmented reality

Drive-by-Wire

Human-machine Interaction

Multicores Becoming More Complex

Facts

Facts

Complex applications require multicore processors to run smoothly

No automated programmingsolution available in the market

-

3.000

6.000

2014

2017

2020

2023

2026

Max

. #

of

Cor

es

Off-the-shelf platformsSource: ITRS

8 Cores + GPU4 Cores + 3 DSPs + GPU + 3 HW

accelerators

4 Cores + 8 DSPs

Need better support for SW development in the MPSoC era

Page 3: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 3

Key problem in MPSoC programming today

Lack of “the compiler” in the programming flow for MPSoCsleads to huge productivity loss!

Yesterday Today Tomorrow

Compiler

RISCRISC2

RISC1

HW

DSP1

DSP3

DSP2

Compiler Compiler Compiler Compiler

Manual analysis, partitioning and mapping

SLX

CPN

Page 4: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 4

SLX Multicore Toolsuite Overview

SLXParallelizer

SLXMapper

SLXGenerator

Expose multiple forms of parallelism

Define an optimized software distribution

Generate code for thetarget platform

Page 5: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 5

AutomatedPartitioning

• Information to parallelize a sequential application?• Dependencies in a Control and Data Flow Graph

• Hybrid approach: static and dynamic data flow analysis• Performance information

• Partitioning process• Clustering heuristics at the C statement granularity• Search for different parallelism patterns• Understand cost of computation and communication

SLX Parallelizer

Stmt

Stmt

Stmt

Stmt

Stmt

Stmt

Stmt

5%5%

10%

25%15%

30%10%

Page 6: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 6

Supported Parallelism Types

t

t

Cores

Cores

x = 1;foo(x)

x = 1;foo(x)

bar(x);bar(x);

P1P1

P2P2

P1P1

P2P2RAW[x]

t

Task LevelParallelism

(TLP)

Task LevelParallelism

(TLP)

t

t

Cores

Cores

Loop

P2P2bar(x);bar(x);

P1P1

P2P2 t

Data LevelParallelism

(DLP)

Data LevelParallelism

(DLP)

t

t

Cores

Cores

Loop

bar(x);bar(x);t

x = foo(x)x = foo(x) RAW[x]

P1P1 P2P2 P1P1 P2P2

P1P1 P1P1P2P2 P2P2

Pipeline LevelParallelism

(PLP)

Pipeline LevelParallelism

(PLP)

Page 7: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 7

SLX Parallelizer Flow

Parsing, tracing, performance estimation, dynamic data flow analysis

Graph clustering and pattern-based parallelism extraction

Feedback to the programmer

Arch. Model

Page 8: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 8

Displaying Results

• Compiler graphs are difficult to handlefor the programmer

• Therefore intuitive reports are presented to the user

Source levelannotations

Source codehighlighting

Page 9: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 9

SLX Parallelizer – Code Generation

• Today: Identify parallelism automatically and convert to a shared memory API specification, e.g. OpenMP

• Future: Automatically convert to a Process Network specification

Page 10: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 10

SLX Mapper

• How to distribute an application based on different optimization criteria and additional user constraints to a given multicore architecture?

• Challenges: Application specificationBottleneck identificationAccurate performance estimationSimultaneous mapping of computations and communicationsHuge search space

Page 11: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 11

RISC1RISC1 DSP2DSP2DSP1DSP1

RISC2RISC2 DSP3DSP3HWHW

RISC1 DSP2DSP1

RISC2 DSP3HW

SLX Mapper- Programming Model

• Many parallel programming models, but no winner

• Thoughts on programming models• Leveraged by compiler for code generation

(correctness)• Formal properties for optimizations

• Practical considerations• C dominance is unchallenged • Domain-specific (wireless, multimedia):

Dataflow programming / Streaming applications for signal processing, multimedia application and more

ProgrammingModel

Page 12: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 12

Parallel Specification: CPN

• CPN: C for Process Networks• Intuitive parallel programming

• Based on C• Processes in C• Channels support C types,

structs, typedefs, …

__PNkpn Amp __PNin(short A[2]) __PNout(short B[2]) __PNparam(int boost) {

while (1)__PNin(A) __PNout(B) {for (int i = 0; i < 2; i++)B[i] = A[i]*boost;

}}

__PNprocess AudioAmp1 = Amp __PNin(C) __PNout(F) __PNparam(3);__PNprocess AudioAmp2 = Amp __PNin(D) __PNout(G) __PNparam(10);

PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);PropagateChannelAccessRanges(D, S);

PnStreamFactory streamFactory(BasePath);switch (transTarget) {case TransMVP:PnTransformTemplateInstantiate(D, S);ErasePnProcessTemplates(D);PnPrintForMVP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces, streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S, strMappingFileName,

streamFactory);ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, booltraces,

const std::string & strMappingFileName, ASTContext &Ctx,

Sema &S, const llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {case TransSystemC:case TransVPUtg:case TransVPUmap:PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces, streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S, strMappingFileName,

streamFactory);ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

&BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {case TransSystemC:case TransVPUtg:case TransVPUmap:PnTCopying(D, S);break;

default:;

}

Page 13: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 13

SLX Mapper

Page 14: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 14

SLX Mapper: Post-Mapping Analysis

Processor Utilization

Task States

Schedule

Page 15: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 15

SLX Generator

• Target specific code needs to be written manually

• Challenges:Code generation for heterogeneous OS with different APIsSource code is mapping-dependentCross-compilation for a huge variety of target MPSoCs

Page 16: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 16

SLX Generator Flow

Mapping

CPN-CC Compiler

KPN

Target Compiler

Plain C

Page 17: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 17

Smartphone case study

Exynos Octa 5 multicore SoC Performance improvement of image

codecs via parallelizing C compiler Side-effect: energy savings Verified via physical measurements

encoder decoder

Page 18: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 18

Towards system-level

power optimization

Page 19: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 19

Motivation

ElectronicSystem

Level (ESL)

RegisterTransfer

Level (RTL)

GateLevel

LayoutLevel Chip

Impact of design decisions on power consumption

Timing

Power

High simulation speedEarly availability of power estimates

Level of design detail

Page 20: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 20

Power estimation methodology

A

B

ESL systemcontaining component

gate-levelnetlist

layout

hardwarechip

power traceof component

automated back-annotation

C

D

ESL power estimationin different scenario

ESL power estimationin different system

Page 21: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 21

Power model calibration

ESL virtual platform

inputs

ESL model

traces S

Run hardware and ESL model with same inputs will be in same state

Record ESL traces and hardware power as calibration data Assume linear power model Determine factors a with best fit

abstract powermodela = ?

Pest = S · a

determine aso that

Pest ≈ Pref ESL powermodel

a = (a1 … aN)T

Pest = S · a

minimize meansquared errora := S+ · Pref

sam

e in

puts

powercalibrationdata S, Prefinputs

hardware

details

Page 22: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 22

NoC case study

Page 23: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 23

Speed and accuracy

ESL power estimation methodology verified for AMBA AXI and complex custom NoC at VLSI layout level (post P&R) Estimation error < 22% Can predict different „power phases“ correctly 900x faster than low-level power simulation

Page 24: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 24

ESL power estimation for clustered multicore platform Power estimation for entire SoC Focus on ARM A9 core power estimation White box approach using instrumented gem5 simulator Black box approach using OVP simulator TLM and activity traces

Page 25: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 25

TI Panda Board ES

OMAP4460 2x ARM Cortex A9 2x ARM Cortex M3 DSP GPU …

SW loaded viabootleader fromSD card bare-metal

benchmarkspossible

Image by pandaboard.org

Page 26: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 26

Power Measurement Setup

Test Automation Board• control power via

serial port

Benchmark Upload• via Ethernet

Standard Output• via serial port

Amplifier Circuit• for power

measurement

Data Logger• USB-DUXfast

• 16 channels• 12 bit• 5 kHz

Benchmark Time Synchronization• via GPIO pin• sampled by power

measurement circuit

Page 27: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 27

White box estimation results

0

10

20

30

40

50

60

70

dhry

ston

e

lte-b

ench

mar

k

lte-b

ench

mar

k_in

t

mib

ench

/tele

com

m/a

dpcm

mib

ench

/tele

com

m/C

RC

32

mib

ench

/tele

com

m/F

FT

mib

ench

/tele

com

m/g

sm

sim

ple/

fibon

acci

sim

ple/

ins_

sort

sim

ple/

lfsr

sim

ple/

man

delb

rot_

fxp

sim

ple/

sel_

sort

sim

ple/

siev

e

sim

ple/

siev

e_of

_era

tost

hene

s

stre

amit/

bito

nic-

sort

stre

amit/

fft

stre

amit/

fft_i

nt

stre

amit/

filte

rban

k

stre

amit/

fm

stre

amit/

mat

mul

-blo

ck

stre

amit/

mat

mul

-blo

ck_i

nt

stre

amit/

mat

rixm

ult

stre

amit/

mat

rixm

ult_

int

wib

ench

/Cha

nnel

Mai

n

wib

ench

/Equ

aliz

erM

ain

wib

ench

/LTE

Sys

wib

ench

/Mod

Dem

odM

ain

wib

ench

/Rat

eMat

cher

Mai

n

wib

ench

/SC

FDM

AM

ain

wib

ench

/Scr

ambD

escr

ambM

ain

wib

ench

/Sub

Car

rierM

apD

emap

Mai

n

wib

ench

/Tra

nsfo

rmPr

eDec

Mai

n

wib

ench

/Tur

boM

ain

Single Core RMS Power Estimation Error (%)

classic

non-negative

averages: 11.1%5.4%

1∑

RMS: root mean square

Page 28: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 28

Black box estimation results

0

5

10

15

20

25dh

ryst

one

dhry

ston

e2co

res

lte-b

ench

mar

k

lte-b

ench

mar

k_in

t

lte-b

ench

mar

k_in

t2co

res

mib

ench

/tele

com

m/a

dpcm

mib

ench

/tele

com

m/C

RC

32

mib

ench

/tele

com

m/F

FT

mib

ench

/tele

com

m/g

sm

pthr

ead/

audi

o_fil

ter_

4

pthr

ead/

jpeg

pthr

ead/

lte_b

ench

mar

k

pthr

ead/

man

delb

rot

pthr

ead/

mat

mul

pthr

ead/

sobe

l_co

arse

stre

amit/

bito

nic-

sort

stre

amit/

fft

stre

amit/

fft_i

nt

stre

amit/

filte

rban

k

stre

amit/

fm

stre

amit/

mat

mul

-blo

ck

stre

amit/

mat

mul

-blo

ck_i

nt

stre

amit/

mat

rixm

ult

stre

amit/

mat

rixm

ult_

int

wib

ench

/Cha

nnel

Mai

n

wib

ench

/Equ

aliz

erM

ain

wib

ench

/LTE

Sys

wib

ench

/Mod

Dem

odM

ain

wib

ench

/Rat

eMat

cher

Mai

n

wib

ench

/SC

FDM

AM

ain

wib

ench

/Scr

ambD

escr

ambM

ain

wib

ench

/Sub

Car

rierM

apD

emap

Mai

n

wib

ench

/Tra

nsfo

rmPr

eDec

Mai

n

wib

ench

/Tur

boM

ain

Dual Core Black Box OVP Power Estimation Error (%)

Bare Computed Activity

averages: 4.9%5.3%5.5%

Page 29: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 29

Platform level power estimation

Multicore Cluster

RAM4MB

RAM4MB

RAM4MB

I-Cache

RAM4MB

RAM1MB

RAM1MB

Mul

ticor

e C

lust

erD-Cache

I-Cache

D-Cache

RAM4MBUart Exit Int.

Ctrl

ClusterID

Locks

ARMA9

ARMA9

CtrlID

CtrlID

power model from single core power estimation

abstract power model:Pbus = b * <in> * <out> * <width> * (s + d * <freq> * <usage>)

abstract power model:Pmem = m * <capacity> * (s + d * <freq> * 5% * <usage>)

Page 30: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 30

Optimized SW task mapping (16 cores)

Audio Filter

LTE Benchmark

Observations:Energy of an application rather constant Energy optimization opportunity for heterogeneous systemsHuge optimization opportunity for average and peak powerTrade-off of runtime vs power for multicore systems is a must

Average Power Peak Power Time EnergyBest Mapping 2.52 W 4.82 W 24 us 61.6 uJOne core 0.62 W 0.62 W 92 us 56.8 uJ

Average Power Peak Power Time EnergyBest Mapping 1.98 W 9.70 W 115 us 0.23 mJOne core 0.60 W 0.61 W 245 us 0.15 mJ

x8

x16x3

x4x4

x2

Page 31: Performance and Power Aware C Code Parallelization for … · 2015. 11. 2. · videos Augmented reality Drive-by-Wire Human-machine Interaction Multicores Becoming More Complex Facts

SILEXICA SOFTWARE SOLUTIONS GMBH, ALL RIGHTS RESERVED 31

Thank you! Questions?