he-p2012: performance and energy exploration of ... · pdf filegreendroid architecture [?],...

13
JSPS manuscript No. (will be inserted by the editor) He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores Francesco Conti · Andrea Marongiu · Chuck Pilkington · Luca Benini Received: date / Accepted: date Abstract The end of Dennardian scaling in advanced technologies brought about new architectural templates to overcome the so-called utilization wall and provide Moore’s Law-like performance and energy scaling in em- bedded SoCs. One of the most promising templates, architectural heterogeneity, is hindered by high cost due to the design space explosion and the lack of ef- fective exploration tools. Our work provides three con- tributions towards a scalable and effective methodol- ogy for design space exploration in embedded MC-SoCs. First, we present the He-P2012 architecture, augment- ing the state-of-art STMicroelectronics P2012 platform with heterogeneous shared-L1 coprocessors called HW processing elements (HWPE). Second, we propose a novel methodology for the semi-automatic definition and instantiation of shared-memory HWPEs from a C source, supporting both simple and structured data types. Third, we demonstrate that the integration of HWPEs can provide significant performance and energy efficiency benefits on a set of benchmarks originally de- veloped for the homogeneous P2012, achieving up to 123x speedup on the accelerated code region (98% of Amdahl’s law limit) while saving 2/3 of the energy. This work was supported by the EU Projects P-SOCRATES (FP7-611016) and PHIDIAS (FP7-318013). Francesco Conti · Andrea Marongiu · Luca Benini Department of Electrical, Electronic and Information Engi- neering, University of Bologna E-mail: {f.conti,a.marongiu,luca.benini}@unibo.it Andrea Marongiu · Luca Benini Integrated Systems Laboratory, ETH Zurich, E-mail: {a.marongiu,luca.benini}@iis.ee.ethz.ch Chuck Pilkington Synopsys Ottawa (formerly at STMicroelectronics Ottawa), E-mail: [email protected] Keywords Many-core · Heterogeneity · HW Accelera- tion · Exploration Flow 1 Introduction One of the main challenges for the design of future em- bedded Systems-on-Chip is the utilization wall [?]: due to the end of Dennardian scaling, the fraction of the chip that can be kept on at a given power budget de- creases with every new technology node. Most of the area in future SoCs will therefore be “dark silicon”[?], i.e. composed of transistors that are only seldom pow- ered on; designers must face the challenge to provide more computing performance by exploiting this dark silicon in an effective and sensible way. The utilization wall limits the amount of parallelism that can be ex- ploited by adding more and more general-purpose pro- cessors, because it is impossible to keep them simulta- neously on. At the same time, though, it also frees chip area for other purposes. Proposed techniques to make effective use of dark silicon include dim and specialized silicon [?], i.e. respec- tively silicon that is used seldom/at a lower frequency and silicon that performs specialized functions. By in- serting both general and special-purpose processors in a single computing fabric we can effectively counteract the loss of potential performance due to the utilization wall, since special-purpose processors are typically not in use simultaneously nor are they always on [?]. How- ever, heterogeneous architectures increase the size and complexity of the design space along several axes: gran- ularity of the heterogeneous processors, coupling with software cores, communication interfaces, etc. Simula- tion flows and tools aimed at simplifying design space exploration are thus highly beneficial.

Upload: duongkhanh

Post on 16-Feb-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

JSPS manuscript No.(will be inserted by the editor)

He-P2012: Performance and Energy Exploration ofArchitecturally Heterogeneous Many-Cores

Francesco Conti · Andrea Marongiu · Chuck Pilkington · Luca Benini

Received: date / Accepted: date

Abstract The end of Dennardian scaling in advancedtechnologies brought about new architectural templatesto overcome the so-called utilization wall and provide

Moore’s Law-like performance and energy scaling in em-bedded SoCs. One of the most promising templates,architectural heterogeneity, is hindered by high cost

due to the design space explosion and the lack of ef-fective exploration tools. Our work provides three con-tributions towards a scalable and effective methodol-

ogy for design space exploration in embedded MC-SoCs.First, we present the He-P2012 architecture, augment-ing the state-of-art STMicroelectronics P2012 platform

with heterogeneous shared-L1 coprocessors called HWprocessing elements (HWPE). Second, we propose anovel methodology for the semi-automatic definition

and instantiation of shared-memory HWPEs from aC source, supporting both simple and structured datatypes. Third, we demonstrate that the integration of

HWPEs can provide significant performance and energyefficiency benefits on a set of benchmarks originally de-veloped for the homogeneous P2012, achieving up to

123x speedup on the accelerated code region (∼98% ofAmdahl’s law limit) while saving 2/3 of the energy.

This work was supported by the EU Projects P-SOCRATES(FP7-611016) and PHIDIAS (FP7-318013).

Francesco Conti · Andrea Marongiu · Luca BeniniDepartment of Electrical, Electronic and Information Engi-neering, University of BolognaE-mail: {f.conti,a.marongiu,luca.benini}@unibo.it

Andrea Marongiu · Luca BeniniIntegrated Systems Laboratory, ETH Zurich,E-mail: {a.marongiu,luca.benini}@iis.ee.ethz.ch

Chuck PilkingtonSynopsys Ottawa (formerly at STMicroelectronics Ottawa),E-mail: [email protected]

Keywords Many-core · Heterogeneity · HW Accelera-tion · Exploration Flow

1 Introduction

One of the main challenges for the design of future em-bedded Systems-on-Chip is the utilization wall [?]: due

to the end of Dennardian scaling, the fraction of thechip that can be kept on at a given power budget de-creases with every new technology node. Most of the

area in future SoCs will therefore be “dark silicon”[?],i.e. composed of transistors that are only seldom pow-ered on; designers must face the challenge to provide

more computing performance by exploiting this darksilicon in an effective and sensible way. The utilizationwall limits the amount of parallelism that can be ex-

ploited by adding more and more general-purpose pro-cessors, because it is impossible to keep them simulta-neously on. At the same time, though, it also frees chip

area for other purposes.

Proposed techniques to make effective use of darksilicon include dim and specialized silicon [?], i.e. respec-

tively silicon that is used seldom/at a lower frequencyand silicon that performs specialized functions. By in-serting both general and special-purpose processors in

a single computing fabric we can effectively counteractthe loss of potential performance due to the utilizationwall, since special-purpose processors are typically not

in use simultaneously nor are they always on [?]. How-ever, heterogeneous architectures increase the size andcomplexity of the design space along several axes: gran-

ularity of the heterogeneous processors, coupling withsoftware cores, communication interfaces, etc. Simula-tion flows and tools aimed at simplifying design space

exploration are thus highly beneficial.

2 Francesco Conti et al.

Evolution towards heterogeneity in multi- and many-

core SoCs stimulates an evolution also in traditionalHW/SW codesign techniques [?][?]. In this work, weextend and complete our previous work in [?] where we

augmented the STMicroelectronics P2012 cluster withtightly-coupled shared-L1 memory hardware processingelements (HWPEs). HWPEs can be used to increase

performance and/or energy efficiency of the P2012 ho-mogeneous clusters. Shared-L1 HWPEs work in-placeon data shared by the SW cores instead than on a copy,

therefore avoiding consistency issues typical of manyacceleration schemes and allowing cooperation betweenSW threads and HW jobs. To support this architectural

heterogeneity paradigm, we developed a novel tool flowand design methodology that allows to generate HW-PEs semi-automatically starting from normal C code

for homogeneous P2012, using commercial high-leveland RTL synthesis tools. Our methodology supportsa significant fraction of the C syntax and respects SWconventions for pointer manipulation and data layout,

therefore allowing shared-memory acceleration for non-trivial accelerated code. Generated HWPE models areintegrated in the P2012 platform simulator; this facili-

tates exploration of various acceleration schemes, earlyperformance assessment and generation of HW usingcommercial HLS tools.

The rest of the paper is organized as follows: inSection 2 we discuss related work to our heterogeneityparadigm and co-design approach. Section 3 describes

the homogeneous P2012 cluster and our heterogeneousextension, He-P2012. In section 4 we detail the toolflow we developed to perform architectural exploration.

Section 5 presents the results we obtained by bench-marking a set of applications developed for the homo-geneous P2012 platform and augmenting them with

tightly-coupled shared memory accelerators. Finally, inSection 6 we summarize our contributions and findingsand discuss future research directions.

2 Related work

Heterogeneous architectures based on HW acceleratorswork in an asymmetric fashion: they feature a proces-

sor (or a group of processors) acting as host runningnon-accelerable code, and accelerators used to achieve agreater level in some metric (e.g. performance or energy-

efficiency). This kind of heterogeneity can be present atmany scales: for example, P2012 – the platform we aug-ment with internal heterogeneity in this work – is itself

meant to be used as an accelerator to an ARM host pro-cessor running mostly sequential code. We can groupacceleration paradigms regardless of the scale at which

they work by considering two main axes: how tight is

the integration between host and accelerator, and how

is data exchanged between them.

A first approach is to insert accelerators directly in-side the processor pipeline, and share data through theregister file. This is the case of ASIPs with specialized

functional units, such as the ones provided by Tensilica(now Cadence) [?] and Synopsys Processor Designer [?].Other examples include specialized architectures such

as Movidius Myriad [?] and Silicon Hive [?] that relyon VLIW cores augmented with special-purpose units.This acceleration approach can provide significant per-

formance and energy-efficiency gains in very fine-grainworkloads, but is inefficient for coarser-grain functional-ity and is usually limited to communicate through theregister file interface. Moreover, as it is implemented in-

side a processor’s pipeline, it is still subject to the VonNeumann fetch-decode-execute bottleneck, which can se-riously limit its potential efficiency. Efficacy of this ac-

celeration approach also relies on very deep knowledgeof the algorithm and the architecture on the program-mer’s part (therefore limiting the number of proficient

users), or on very efficient compiler technology that isoften not available.

The other extreme of the spectrum is to use verycoarse-grained accelerators and rely mostly on message-

passing channels for host-accelerator communication,adopting a dataflow model of computation. This ap-proach has been leveraged for software-defined radio

applications (see for example Ramacher et al. [?], CEAMAGALI [?][?] and StepNP [?]), in high-performancecomputing (for example by Maxeler [?]) and in the

embedded vision domain (for example Vortex [?][?]for biologically-inspired vision acceleration, and Neu-Flow [?] and nn-X [?], which focus on Convolutional

Neural Network acceleration). Loose coupling of pro-cessors and accelerators through message-passing chan-nels is extremely scalable and power-efficient, as ac-

celerators completely bypass the Von Neumann bottle-neck, working only when they are fed with input. Manyhigh-level synthesis tools [?] partially relieve the de-

signer of the burden of the HW design; however, fromthe SW perspective, this approach requires extensiverewriting of the code following a dataflow model that

is completely different with the dominant SW imper-ative paradigm [?]. Moreover, not all algorithms areamenable to be implemented in a dataflow fashion with-

out loss of generality or efficiency, and the fine-grain ac-celeration is typically not possible due to the overheadof message-passing communication. Recent research on

loosely-coupled accelerators has concentrated on target-ing FPGA devices, often trying to hide the under-the-hood model from the developer by exposing a high-level

programming model such as OpenCL [?].

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 3

Platform Offload latency Memory latency Memory bandwidth Memory size

Dataflow (Maxeler) 1000 cycles 10-100 cycles 64 B/cycle (per link) tens of GBsGPGPU (Nvidia) 1000 cycles 10-100 cycles 64 B/cycle 1 GBFPGA SoC (Zynq) 1000 cycles 10-100 cycles 8 B/cycle 1 GBHe-P2012 (many-core) 1000 cycles 10-100 cycles 8 B/cycle 1 GBHe-P2012 (HWPEs) 10-50 cycles 2-3 cycles 4 B/cycle (per port) hundreds of KBsASIP 1 cycle 1 cycle 4 B/cycle (per port) 1 KB (reg. file)

Table 1: Orders of magnitude of offload latency and memory parameters for several heterogeneous paradigms.

In between these two extremes, we find many archi-tectures that do not imply insertion within a processor

pipeline but they share data with the host through alevel of the memory hierarchy. At high level, this is whatis done by APUs [?] and many-core accelerators such

as P2012 [?], which share DRAM with a x86 or ARMhost processor optimized for sequential execution. At asmaller scale, special-function units in Nvidia GPUs [?]

and coprocessors in Plurality HyperCore [?] are exam-ples of this trend. Also a number of academic groupsadvocates this approach to accelerate many-cores. The

GreenDroid architecture [?], features clusters with a sin-gle general-purpose core, which is tightly-coupled to asea of smaller conservation cores (or c-cores) [?] that

communicate with the rest of the system through a L1coherent cache. C-cores target energy efficiency of com-monly executed portions of general-purpose code, such

as parts of the Android stack, by leveraging high-levelsynthesis tools. The EXOCHI architecture, introducedby Wang et al. [?], features hardware accelerators mod-

eled as coarse-grain MIMD functional units, collaborat-ing with IA32 CPU cores by sharing of the same virtualmemory space. Cong et al. [?] also tackle the utilization

wall by developing a heterogeneous multi-core architec-ture with shared-memory accelerators; their HW IPscommunicate by means of shared L2 caches, accessi-

ble through NoC nodes. Previous work by our group(Burgio et al. [?], Dehyadegari et al. [?][?], Conti etal. [?]) considers a tightly-coupled multi-core based on

RISC32 cores sharing a L1 scratchpad and extend itwith hardware processing units (HWPUs). HWPUs aremanaged by the software through an OpenMP-based

programming model designed to mix parallelization andacceleration. The third approach is promising since itovercomes the scalability issues of ASIP-like architec-

tures while not requiring extensive changes to the ap-plication code, and making it easier to integrate accel-erator programming in standard shared memory pro-

gramming models; it makes efficient use of dark sili-con since accelerators are triggered only in correspon-dence of hot regions of code, and then switched off. To

better position this work with respect to other shared-memory heterogeneity paradigms, in Table 1 we specify

the orders of magnitude of the offload latency and thememory access bandwidth, latency and size for some of

these paradigms. Orders of magnitude of memory per-formance metrics for Nvidia GPGPUs and the ZynqFPGA SoC are derived from publicly available infor-

mation from Nvidia and Xilinx; metrics for the Max-eler dataflow engine are derived from [?]. We list twodistinct sets of metrics for the He-P2012 platform, as

it exposes two distinct layers of heterogeneity: that be-tween the host and the P2012 fabric, and that betweenthe cores in the fabric and the in-cluster HWPEs; this

work focuses exclusively on the latter.

Our approach differs from GreenDroid and similar

architectures in that in our case the shared L1 mem-ory is a software-managed scratchpad, and not a cache,introducing major differences in the communication in-

terface design choices. As we extend the STMicroelec-tronics P2012 cluster with tightly L1-coupled HW ac-celerators, our architectural template is similar to that

used in Burgio et al. [?] and Dehyadegari et al.[?], butin our case it was implemented in a real, completemany-core platform. Moreover, contrarily to previous

work in this category of accelerators, we propose a full-fledged flow that allows fast design exploration of het-erogeneous clusters, with semi-automatic generation of

HWPEs and estimation of power and area consump-tion based on RTL synthesis results. With respect toour previous work in [?], we improved our design flow

by adding #pragmas that directly control not only thehigh-level synthesis tool, but also the subsequent logicalsynthesis tool. Moreover, we developed two new bench-

marks for P2012 (one based on Convolutional NeuralNetworks, the other on Mahalanobis Distance) and ac-celerated them using our design flow.

3 Hardware architecture

3.1 P2012 homogeneous cluster architecture

Platform 2012 [?] is an effort led by STMicroelectron-ics to build up a fabric of tightly-coupled homogeneousclusters. In the tightly-coupled cluster paradigm (exem-

plified by many-core accelerators such as P2012 and

4 Francesco Conti et al.

Peripheral Interconnect

Logarithmic Interconnect

HWPE wrapper HWPE wrapper

Fig. 1: He-P2012 cluster: a P2012 cluster extended forheterogeneous computing

Kalray MPPA [?]), each cluster is composed by a rela-tively small number of simple in-order RISC cores (Pro-cessing Elements or PEs) that communicate through a

fast low-latency interconnection to a shared L1 or L2data memory. Clusters are then connected through ahigh-bandwidth scalable medium such as a network-on-

chip.

In P2012, each cluster contains up to 16 STxP70-v4 cores (PEs) sharing a L1 scratchpad of 256 KB,called tightly-coupled data memory (TCDM ). Each PE

has a 16 KB instruction cache; there is no data cache.The logarithmic interconnection between the PEs andthe TCDM was designed to be ultra-low latency (see

Rahimi et al.[?]), with a banking factor of 2 to minimizememory contention. In absence of access contention, aload from shared memory through the logarithmic in-

terconnection takes two cycles and a store takes one.The cluster also has a similarly designed peripheral in-terconnection that is used for communication between

PEs and peripherals, DMAs, memory outside of thecluster, etc.

The P2012 fabric is not meant to be used as a stand-

alone computing device, but as a programmable accel-erator connected to a ARM host processor. Sequentialcode is mostly executed on the ARM, while highly par-

allel workloads are offloaded to the fabric. In its firstembodiment, the STHORM board, the P2012 fabric isconnected to a Zynq device as a host running a full

operating system (Linux or Android), while the fabricruns only a light runtime to support the OpenCL andOpenMP programming models.

3.2 He-P2012: Heterogeneous P2012

We propose here He-P2012, an heterogeneous extensionto the P2012 architecture. In He-P2012, the tightly-coupled clusters are augmented with a set of HW accel-

erators that communicate with PEs via the same shared

L1 data memory (a scratchpad) used by the PEs them-

selves. These tightly-coupled shared memory HW Pro-cessing Elements or HWPE s [?] address the shortcom-ings of more traditional copy-based accelerator models,

such as the necessity to transfer and maintain coher-ent multiple copies of data between distinct processorand accelerator memory spaces. In addition, they di-

minish the semantic gap between executing a task inSW or HW: from a programming perspective, dispatch-ing a HW task on a HW IP is not different from calling

a SW function to perform the same task. In fact, theprogramming model we propose in section 4.1 exposes asynchronous HW interface call that maintains the same

interface and semantics of the accelerated SW function,although asynchronous calls may also be used to en-hance performance or energy savings if that is needed.

Each HWPE is treated as an additional master withone or more ports on the cluster logarithmic intercon-nection. The logarithmic interconnection manages syn-

chronization on memory accesses via a simple round-robin scheme to avoid starvation [?]: access to a con-tended memory location is granted to a single master,

while the other masters are stalled for one cycle. Con-tention on accesses from HWPEs and software PEs(or between those from different HWPEs) is treated

in the same way as in the homogeneous cluster. Theperformance loss due to contention depends on thecomputation-to-communication ratio of PEs and HW-

PEs and on the architectural parameters of the sharedmemory, such as its banking factor. In Figure 2 we plotthe results of a synthetic benchmark showing the ex-

ecution time overhead of a 4-master port HWPE run-ning in a cluster with 16 PEs, while varying the prob-ability of a memory access for both each PE and each

HWPE port. The maximum overhead is ≈70% whenall 16 PEs have Pmem op PE = 90% and each HWPEport has Pmem op HWCE = 90%; this can be considered

a worst-case scenario where all 16 PEs are essentiallybeing used for data movement only. In a more realistichigh-contention scenario, with Pmem op PE = 50% and

Pmem op HWCE = 80%, overhead is reduced to ≈30%.This is consistent with our empirical observations. Inmany cases, software computation-to-communication

ratio is much higher and as a consequence the overheadintroduced by memory contention is less than 15%. 1

All results reported in Section 5 include modeling of

memory contention.

HWPEs are also connected as targets to the periph-

eral interconnection for control. Figure 1 shows a dia-gram of the He-P2012 cluster extended with HWPEs

1 A more detailed discussion regarding the architecturaltradeoffs of shared-memory HWPEs versus private-memoryones can be found in Dehyadegari et al. [?].

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 5

Fig. 2: Overhead due to memory contention on a syn-thetic benchmark.

for heterogeneous computing. As detailed in our pre-vious work [?], HWPEs are designed as two separate

modules:

1. the HW IP is a datapath that implements the accel-erated function. It can be designed using high-levelsynthesis from a C function.

2. the wrapper provides the HW IP with the capabilityof accessing shared memory through a set of portson the logarithmic interconnect and of accepting

jobs from the SW processing elements into a jobqueue.

The HW IP can be developed directly from the orig-inal software function by means of standard methodolo-

gies (e.g., HLS tools [?][?]) with no additional program-mer effort. High-level synthesis produces an IP that canread and write to an external memory using a custom

protocol based on address and handshake. It is possibleto build HW IPs with more than one port for enhancedmemory bandwidth. An important limitation of currentHLS tools is that they do not support IPs dynamically

addressing data in the external memory: all transac-tions happen in a private memory space, where data isplaced statically (i.e. at design time). This limitation

is overcome by connecting the HW IP to the HWPEwrapper.

The wrapper, shown in Figure 3, was designed totake care of three tasks:

– translating the memory accesses from the custom

handshake protocol produced in the HLS tool to theone used in the shared-memory interconnection;

– augmenting the memory transactions so that they

support data placed dynamically (i.e. at run time)in the shared-memory;

– providing a control interface through which the SW

PEs can offload jobs to the HWPE.

The wrapper is divided in a controller submodule that

translates HW IP transactions into shared memory onesand a slave module hosting a register file to provide thecontrol interface.

To offload a new job to the HWPE, a SW PE must i)take a lock on the HWPE (so that no other PE can con-currently offload a different job) by accessing a special

register in the register file, ii) write the base addresses ofI/O data in the HWPE register file and iii) trigger theexecution (and release the lock) by accessing another

special register. The controller then uses the addressesin its register file to remap memory accesses from theHW IP to the right region in the shared memory using

the address translation block. The register file in thewrapper can host a queue of up to 4 jobs that can beoffloaded to the HWPE also when the HW IP is busy,

to minimize the offload overhead; as soon as one of thejobs is completed, the next one is started.

The HLS tool produces a cycle-accurate SystemCmodel for simulation and a Verilog model that can be

used for both RTL simulation and synthesis; these canbe used in conjunction to a SystemC or SystemVerilogimplementation of the wrapper, respectively.

4 Exploration flow

The final goal of the tool-flow presented in this paper isto enable fast design space exploration of heterogeneousarchitectures. To achieve this goal we need an instru-

ment which is easy-to-use, and that leverages a clear,well-defined methodology. GEPOP (Generic Posix Plat-form), the simulation infrastructure provided with the

official P2012 SDK, provides a convenient starting pointto achieve this goal. Clearly, it only models the originalhomogeneous P2012 clusters, so we extended it to sup-

port accurate simulation of HWPEs, wrapped with thediscussed tightly-coupled shared memory methodology.Models of all HWPEs are encapsulated in a single Sys-

temC model that runs as a separate process. A GEPOPplugin manages communication between this model and

slave module

address translation

done

start

accel add mem add

register file

accel data signals mem data signals

HWPE wrapper

cfg signals

controller module

accel handshake mem handshake

protocol conversion logic

HW

IP

blo

ck

sh

me

m in

tcp

erip

h intc

Fig. 3: Simplified architecture of the HWPE wrapper

6 Francesco Conti et al.

Catapult scripts & glue files,

SystemC simulator files,

Makefile & build helper scripts

Interface Definition

File (SIDL)

HWPE Software

API Interface

SIDL

COMPILER

Accelerator

Source (C/C++)

CATAPULT

HLS

P2012 Application

Source (OpenCL)

CLAMC + STXP70

COMPILER

HWPE RTL

(Verilog)

HWPE cycle-accurate

model (SystemC)

P2012

Application

GEPOP

SIMULATION

RTL SYNTHESIS

(Design Compiler)

Fig. 4: He-P2012 flow for cycle-accurate HWPE simu-

lation

the rest of the simulation platform, by means of UNIXpipes.

For different applications and their associated ac-

celeration opportunities we obviously need to generatedifferent simulation platforms. We simplify simulatorgeneration by automating essentially all of the steps re-

quired to create an executable simulation. The user isexclusively responsible for:

1. writing the main application source code, using oneof the programming models supported in homoge-neous P2012 (i.e., OpenCL[?] or OpenMP[?]);

2. defining the HW/SW interface in a custom language(SIDL, see Section 4.1);

3. extracting the functions to be accelerated into sep-

arate C files.

The tool flow uses this information to automaticallygenerate all of the underlying glue code that is requiredto execute the application on our heterogeneous plat-

form.

HLS and RTL synthesis scripts, SystemC wrap-

pers for the HWPEs and SW API calls to be usedin the application code are all generated from theinterface, which is described in a custom language

called SIDL (Software/hardware Interface DefinitionLanguage). Figure 4 shows how the SIDL compiler takesthe HW/SW interface specification, and produces infor-

mation for both the HW and the SW sides of the flow.The compiler produces several TCL scripts to drive theHLS and RTL synthesis tools (Calypto Catapult and

Synopsys Design Compiler, respectively). Moreover, italso generates a set of C++ classes that are used to sup-port the shared-memory interface between HW and SW

in the Catapult HLS tool, as described in Section 4.2,

and the top-level SystemC wrapper that encapsulates

all HWPEs. Finally, it also generates the C API (appli-cation programming interface) that can be used by theOpenCL or OpenMP kernels in the GEPOP simulation

to call for HW-accelerated execution of a function.

It is worth mentioning that our flow also supports a

less accurate simulation mode, in which the HWPEs aremodeled at a purely functional abstraction level. Thiscan be useful at early design stages to quickly get a

feeling of the behavior of different SW/HW partitioningschemes.

4.1 Software/Hardware Interface

The tools model the interaction between software andHWPEs as a client/server communication, similarly to

what is done in DSOC [?]; in our case, the client is theSW running on the He-P2012 processing elements andthe server are the available HWPEs. The client/server

interface is expressed in SIDL (Software/Hardware In-terface Description Language), a subset of C++.

A SIDL source file is composed by a preamble, inwhich structured data types are declared using C++struct syntax, and a series of interfaces, which are de-

clared as the virtual methods of a pure virtual class(e.g., class hwpe). Since the HWPEs share data exclu-sively through the shared TCDM, inputs and outputs

of the HW IP are all expressed by means of pointersto simple or structured data types, as when passing ar-guments by reference to a C function. The following is

the SIDL code describing the interface for two functionscalled foo and bar:

// library interface function specification (SIDL)class hwpe {

#pragma sidl unroll core/inner_loop 4#pragma sidl pipeline core/main_loop 2virtual void foo(int *in, int *out) = 0;

#pragma sidl unroll core/main_loopvirtual void bar(int *in, int *out) = 0;

};

The SIDL source may also contain a small set ofpragmas that are used to direct the HLS tools in a rel-

atively easy way. Table 2 lists the supported pragmas.

#pragma sidl Description

nb hwpe N duplicates the HWPE N timesnb master N uses N shared-mem portsunroll LOOP N unrolls LOOP N timespipeline LOOP N pipelines LOOP with init. interval Ndirective DIR passes DIR directive to Catapult

Table 2: SIDL Pragmas

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 7

Following the classical client/server IDL approach,

we developed a compiler to process the SIDL interfacedefinition and generate all downstream files as shownin Figure 4 from a set of templates. The SIDL com-

piler is based on open-source tools (flex, bison andlibtemplate) and easily extensible.

On the client (i.e. SW) side, the result of the SIDL

compilation is a set of APIs for a light-weight runtimeenvironment for the HWPE that can be used instead ofthe native low-level HAL that drives the HWPE wrap-

per module. These APIs are expressed in pure C, andthey are therefore compatible with all the programmingparadigms available for the P2012 platform, such as the

OpenMP[?] and OpenCL[?] programming models. Thetools provide both a synchronous interface call, in whichthe client blocks until the results are available, as well

as an asynchronous one that returns immediately. Inthe latter case, the SIDL compiler generates also APIsto wait for the end of the HWPE computation or tocheck if it has ended.

4.2 High-Level Synthesis

Due to the shared-memory nature of our architecture,argument marshaling is not needed to collect the client

arguments, and it is not necessary to copy data from theclient to the server (i.e., from the SW-accessible mem-ory to the HWPE). Rather, data is left in-place, and

only pointers are exchanged between the client and thehardware accelerator. This allows a very low-overheadclient/server call.

As a high-level synthesis tool, we decided to use Ca-lypto Catapult System-Level Synthesis as it supportsout-of-the box most of the C and C++ syntax, as well

as SystemC. However, to make better use of the shared-memory nature of He-P2012, we also need to supportthe conventions used by the SW compiler for data lay-

out: transforming by software the client data to a rep-resentation usable by the hardware is not desirable -the overheads of data reformatting would defeat the

advantages of our shared-memory architecture. Manu-ally modifying the HLS code to explicitly deal with thelayout used by the software compiler is also undesirable,

as it would require discarding all higher-level data struc-tures. It would also lower the general abstraction level,which is highly undesirable as this would defeat part of

the purpose of our flow.Therefore, we introduced a C++ template-based

technique that statically maps simple and structured

data types used in the source code of the HW IP to alower-level, word-based representation that respects theconventions set by the SW compiler. This way, normal

data access syntax in the HW IP source is maintained,

while correct data addressing in the shared-memory is

also guaranteed.

At the lowest level, all inputs/outputs are repre-

sented by an array of sc uint<36> (i.e., 36-bit SystemCunsigned integers) called backing store. The highest fourbits of the backing store represent a byte enable signal,

whereas the lowest 32 bits are the actual data word.To represent the address of a particular word in thebacking store, a hwpe addr class is used. Note that, as

the HLS tool does not support dynamically positioningdata in the shared-memory, this class merely representsan offset with respect to the base of the input/output

array pointed by the backing store.

To map atomic (i.e. non-structured) types tothe underlying backing store, a template C++hwpe atomic<T> class is used, where T is the C atomic

type (e.g. int16 t). From the moment of its construc-tion, a hwpe atomic instance features a pointer to a un-derlying backing store and an offset value. Assignment

and access operators on hwpe atomic classes are over-loaded so that, depending on the byte width of thehigh-level type T, the correct byte enables are activated

(in case of a store operation) or the correct part of thedata word is retrieved (in case of a load). All standardinteger types are supported: signed and unsigned 8-bit,

16-bit, 32-bit and 64-bit integers. The following codesnippet shows some parts of the hwpe atomic class.

class hwpe_atomic<T> {sc_uint<36> *backing_store;int byte_offset;public:const T operator=(hwpe_atomic &atomic) {

return operator=(atomic.operator T());}operator T() {

int bs_offset = byte_offset / 4;int byte_idx = (byte_offset % 4)/sizeof(T);// for brevity, only sizeof(T) = 4 case is shownreturn (T) backing_store[bs_offset].range(31,0);

}hwpe_ptr_atomic<T> operator&() {

return hwpe_ptr_atomic<T>(backing_store, byte_offset);}

};

To provide the shared-memory interface, a smartpointer hwpe ptr atomic<T> template class is also cre-

ated corresponding to each hwpe atomic class, to rep-resent a pointer to the atomic type. Similarly to thehwpe atomic class, hwpe ptr atomic features a pointer to

the underlying backing store and an offset. Pointer andarray access operators (* and []) are overloaded to re-turn a hwpe atomic instance; conversely, the & operator

of hwpe atomic is overloaded to return a hwpe ptr atomic

instance.

For each structured type present on the interface ofthe HWPE, two template C++ classes are generated bythe SIDL compiler. Let us suppose that the higher-level

data structure is the following structured t:typedef struct {

8 Francesco Conti et al.

uint32_t foo;int16_t bar;uint16_t who;

} struct_t;

To represent it, the SIDL compiler generates a C++class that includes hwpe atomic types for each of the

members of the higher-level struct:class hwpe_struct_t {

sc_uint<36> *backing_store;int byte_offset;public:hwpe_atomic<uint32_t> foo;hwpe_atomic<int16_t> bar;hwpe_atomic<uint16_t> who;

};

This class also overloads the field access operator (.) to

allow access to its fields. Finally, a smart pointer tem-plate class hwpe ptr struct<T> allows pointer access tothe structured data type by overloading pointer, array

access and field access operators (*, [] and -> respec-tively). For convenience, the SIDL compiler defines newtypes for all pointers used in the interface of the HWPE:

typedef hwpe_ptr_atomic<uint32_t> hwpe_ptr_uint32_t;typedef hwpe_ptr_struct<hwpe_struct_t> hwpe_ptr_struct_t;

In this way, most features typically required fromaccelerated hardware are supported: simple data struc-

tures, random access to the shared memory, data-dependent control flow, inlined function calls. The mainlimitation, which is imposed by the fact that Catapult

does not support dynamic addressing, is that it is notpossible to dynamically reference data: double pointersare not supported, and therefore all structures must be

purely composed of atomic data types. Calling inlinedfunctions is supported as a means of code organization;it is not possible to call functions to be executed in

software from the accelerator, nor is it possible to usefunction pointers.

From the perspective of the designer of the HW IPsource code, only minimal changes are strictly needed

to the HW IP source code to achieve a functional accel-erator, though naturally to achieve better performanceit is sometimes necessary to perform optimizations. The

following code snippet shows an example of the transfor-mation of a simple function from the regular C pointersyntax to the Catapult smart pointer one.

// Function with regular C pointervoid add_one(int *the_value) {

*the_value = *the_value + 1;}// Function with Catapult smart pointervoid add_one(hwpe_ptr_int the_value) {

*the_value = *the_value + 1;}

In this example, the value is pointing to data in theshared memory space. In the Catapult implementation

the hwpe ptr int type must be used in the function decla-ration, as in the second case. The hwpe ptr int type gen-erated by the SIDL compiler can be used like an integer

pointer, i.e., it seems to the HW IP developer that this

type was defined like typedef int *hwpe ptr int. This

modification must be carried out in all pointer typesreferring to data placed in the shared memory.

To perform the high-level synthesis of a HW IP the

developer must therefore only i) define the SW/HW in-terface, ii) adapt the HW IP source code so that ituses smart pointers and iii) optionally optimize the

HW IP for better performance. The Catapult backendis called automatically from a lightweight make-basedwrapper that is used to automatize the flow. HWPEs

generated with this methodology show only a minorarea and power increase with respect to those generateddirectly inside Catapult. For example, the size of the

CSC HWPE (see Section 5) increases by 6.34% with re-spect to the one that would be obtained directly, whilepower increases by 3.44%. We believe this increase is

well worth the gain in generality and the speedup indesign time; moreover, if one desires to further reducearea and power consumption, it is possible to use our

exploration flow only for fast assessment of the bestacceleration strategy, and afterwards switch to a man-ually tweaked HWPE. One must note, however, that

data types different from 32-bit integers in the sharedmemory cannot be directly supported in Catapult with-out manually replicating much of what is automatically

generated by our flow.

4.3 Power/Area Estimation and Simulation

The high-level synthesis tool produces two outputs: asynthesizable Verilog model and a cycle-accurate Sys-temC model of the HW IP. After high-level synthesis,

RTL synthesis of the HWPE (including both the HWIP and the wrapper) can be performed using Synop-sys Design Compiler. The SIDL compiler generates all

the scripts necessary to fully automatize the process.Estimations of area and power consumption can be ex-tracted from the reports of the Design Compiler synthe-

sis.

To evaluate the execution time, the SystemC modelsof all HW IPs are encapsulated in a single executable

together with a model of all the wrappers. The Sys-temC model and the power estimation are loaded by aGEPOP plugin to estimate the energy consumed by the

platform in the heterogeneous configuration. We used

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 9

the following simple power model:

Etot = Euncore + EPEs + EHWPEs

Euncore = Puncore · texecEPEs = PPE

∑PEi

ton,PEi

EHWPEs =∑

HWPEj

PHWPEj · ton,HWPEj

The total energy spent in the cluster is due to three com-

ponents: the energy spent in uncore devices, which aresupposed to be on during the whole execution time; theenergy spent in PEs; the energy spent in HWPEs. Both

PEs and HWPEs are modeled as consuming no powerwhen not used; note that this simplistic assumption ispessimistic when evaluating the impact of heterogene-

ity, as it overestimates energy efficiency of homogeneousconfigurations.

5 Results

In this section we explore various versions of the He-P2012 cluster on a set of benchmarks that were origi-

nally developed for the homogeneous P2012. Note thatthe focus here is not on optimizing the HW blocks,which we simply derived from C code with minimal

modifications, but on showing the effects of augmentingthe P2012 architecture with heterogeneity by addingtightly-coupled accelerators.

For the purpose of this work, we concentrated our

analysis of performance and energy to a single P2012cluster clocked at 400 MHz, where we sweep the numberof SW and HW processing elements present in the clus-

ter. In fact, communication between clusters and withthe host is dealt with in the same way in both homoge-neous and heterogeneous clusters and is an orthogonal

issue with respect to our focus.

For our experiments we consider six applicationsthat were originally developed for the homogeneous

P2012 cluster: Viola-Jones Face Detection using Haarfeatures [?], FAST Circular Corner Detection [?][?], Re-moved Object Detection based on normalized cross-

correlation (NCC) [?], a parallel version of Color Track-ing from the OpenCV library [?], a Convolutional Neu-ral Network (CNN) forward-propagation benchmark [?]

[?], and a Mahalanobis Distance (10-dimensional) ker-nel. Face Detection is parallelized with OpenCL, whilethe other benchmarks use the OpenMP implementa-

tion described in Marongiu et al. [?]. Four of the ap-plications we chose to focus on (Viola-Jones, FAST,ROD, and Color Tracking) are from the computer vi-

sion field, while the two remaining ones (CNN, Ma-

halanobis Distance) are more generally from the ma-

chine learning field, though vision is one of their mainuses. We chose to focus on applications from this fieldfor several reasons: first, they are typically very de-

manding from the computational point of view and fea-ture high computation-to-communication ratios, mak-ing them well suited for HW acceleration. Second, these

kernels are often used in pipelines to build completeapplications; this makes them an ideal target for ourheterogeneity paradigm, that is highly focused on coop-

eration between software and hardware cores where wecould run different stages of a pipeline. Finally, they areamong the main applications targeted by the homoge-

neous platform we built upon (P2012), which allowedus to directly compare the homogeneous and heteroge-neous platforms. This set of applications differs signif-

icantly in size, execution time and structure, with theobjective of demonstrating our flow on a wide range ofapplications.

Color Tracking (CT) is composed of three ker-nels: color scale conversion (CSC), color based thresh-old (TH) and moment based center of gravity computa-

tion (MOM). The most computationally intensive kernelis CSC, which we chose for HW acceleration. The ap-plication is structured as a software pipeline made up

of a DMA-in stage, a CSC stage and a TH+MOM+DMA-out stage. It is possible to use both synchronous andasynchronous calls for the accelerated CSC kernel; in the

latter case, the accelerated CSC execution can be almostcompletely hidden behind the TH+MOM stage.

Viola-Jones object recognition (VJ) is the most

complex of our set of benchmarks; it includes twocomputation-intensive kernels: cascade, whose execu-tion time might vary on a scale of 3 orders of magnitude,

depending on input data; integral image, whose execu-

CT sync

CT asy

nc

VJ coars

e

VJ fin

er

VJ fin

er+in

t

FAST a

sync

ROD coars

e

ROD finer

CNN conv

CNN tanh

CNN both

MD h

wavg

0

20

40

60

80

100

acc

ele

rato

r e

ffic

acy

(%

Am

da

hl's

lim

it)

Fig. 5: Accelerator efficacy with 1 PE + HWPEs in %of Amdahl’s limit

10 Francesco Conti et al.

sw hwpe async0

2

4

6

8

10

12

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(a) Color Tracking

sw coarse finer 2finer 2finer,4int0

100

200

300

400

500

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(b) Viola-Jones Face Detection

sw async0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(c) FAST Corner Detection

sw coarse finer0

5

10

15

20

25

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(d) Removed Object Detection

sw conv tanh both0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(e) Convolutional Neural Network

sw hw0

2

4

6

8

10

12

Exe

cution tim

e (ms)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(f) Mahalanobis Distance

Fig. 6: Execution time.

sw sync async0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(a) Color Tracking

sw coarse finer 2finer 2finer,4int0

10

20

30

40

50

60

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(b) Viola-Jones Face Detection

sw async0.0

0.1

0.2

0.3

0.4

0.5

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(c) FAST Corner Detection

sw coarse finer0.0

0.5

1.0

1.5

2.0

2.5

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(d) Removed Object Detection

sw conv tanh both0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(e) Convolutional Neural Network

sw hw0.0

0.2

0.4

0.6

0.8

1.0

1.2

Energy (mJ)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(f) Mahalanobis Distance

Fig. 7: Energy.

tion time is not strongly dependent on input. For thisbenchmark, we developed three HWPEs.

The cascade kernel is composed by a series of 13stages, each of which is more expensive than the previ-

ous one. Only matches have to go through the full cas-cade; non-matching windows usually get out of the cas-cade at an earlier stage. In our experiments, up to 95%

of the windows got out of the cascade in the first two

stages. Therefore, we developed two distinct strategiesto accelerate the cascade kernel: the first strategy con-

sists in accelerating the whole cascade in HW (coarseHWPE); the second in accelerate only stages from thethird on (finer HWPE), i.e. only in the least common

cases.

FAST circular detection (FAST) is composed of two

kernels, detection and score. To accelerate the applica-

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 11

sw sync async100

101

102

Perform

ance

/Area/Energy

(norm

alized)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs (a) Color Tracking

sw coarse finer 2finer 2finer,4int100

101

102

Perform

ance

/Area/Energy

(norm

alized)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(b) Viola-Jones Face Detection

sw async100

101

102

Perform

ance

/Area/Energy

(norm

alized)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(c) FAST Corner Detection

sw coarse finer100

101

102

Perform

ance

/Area/Energy

(norm

alized)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(d) Removed Object Detection

sw conv tanh both100

101

102

Perform

ance

/Area/Energy

(norm

aliz

ed)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(e) Convolutional Neural Network

sw hw100

101

102

103

Perform

ance

/Area/Energy

(norm

alized)

1 PE

2 PEs

4 PEs

8 PEs

16 PEs(f) Mahalanobis Distance

Fig. 8: Performance/Area/Energy.

10-1 100

Energy(normalized)

10-2

10-1

100

Exe

cution tim

e(norm

alized)

1 pe

2 pe

4 pe

8 pe

16 pe

sw

sync

async

(a) Color Tracking

10-1 100

Energy(normalized)

10-2

10-1

100

Exe

cution tim

e(norm

alized)

1 pe

2 pe

4 pe

8 pe

16 pe

sw

coarse

finer

2x finer

2x finer + 4x int

(b) Viola-Jones Face Detection

10-2 10-1 100

Energy(normalized)

10-1

100

Exe

cution tim

e(norm

alized)

1 pe

2 pe4 pe

8 pe

16 pe

sw

async

(c) FAST Corner Detection

10-1 100

Energy(normalized)

10-2

10-1

100

Exe

cution tim

e(norm

alized)

1 pe

2 pe

4 pe

8 pe

16 pe

sw

coarse

finer

(d) Removed Object Detection

10-1 100

Energy(normalized)

10-1

100

Exe

cution tim

e(norm

aliz

ed)

1 pe

2 pe

4 pe

8 pe

16 pe

sw

conv

tanh

both

(e) Convolutional Neural Network

10-2 10-1 100

Energy(normalized)

10-2

10-1

100

Exe

cution tim

e(norm

alized)

1 pe

2 pe

4 pe

8 pe

16 pe

sw

hw

(f) Mahalanobis Distance

Fig. 9: Energy vs Execution Time.

tion, we merged them in a single HWPE, that is calledasynchronously.

Removed Object Detection (ROD) spends most ofits execution time in the normalized cross-correlation

kernel (NCC). We consider two approaches for accelera-tion: a coarse HWPE that accelerates all the iterations,and a fine HWPE that accelerates only one iteration.

In the first case, only one PE is actually used; in the

second case the code containing the offload call to aHWPE can be parallelized among all the threads.

Convolutional Neural Network (CNN) is composedof a series of convolutional, max-pooling and linear lay-

ers. Convolutional layers are responsible for the greatmajority of execution time and energy consumption; wetherefore concentrated on accelerating their two main

kernels: convolutions (conv HWPE) and hyperbolic tan-

12 Francesco Conti et al.

gents (tanh HWPE), both used in an asynchronous fash-

ion. We simulated configurations with one of the twokind of HWPEs, or both.

Mahalanobis Distance (MD) is composed by a sin-gle 10-dimensional distance kernel, which was fully HW-

accelerated in the heterogeneous configuration.

We collect results for three different metrics: execu-tion time, energy, performance/area/energy. We mea-

sure execution time as seen by the controlling PEs,which is natively supported by GEPOP. To obtain areaand power values for the HWPEs, we used Calypto Cat-

apult University Version 2011.a62 for high-level synthe-sis and we synthesized its RTL output together witha SystemVerilog wrapper using Synopsys Design Com-

piler G-2012.06. We used the 28 nm bulk STMicroelec-tronics technology libraries as a target, with a clockfrequency of 400 MHz. Table 3 reports power and area

estimations for all HWPEs used in these benchmarks.

HWPE Power (mW) Area (kgates)

CT 5.82 16.74VJ coarse 12.58 44.98VJ finer 7.86 27.20VJ int 4.88 19.79FAST 4.55 13.88ROD coarse 4.98 17.74ROD fine 5.04 17.52CNN conv 7.99 26.67CNN tanh 2.43 7.10MD 5.59 19.06

Table 3: Design Compiler synthesis results

Figure 5 reports the fraction of Amdahl’s limit thatour accelerators reach in a single core situation. To as-

sess the amount of effective acceleration that our HW-PEs reach versus sequential code, we can compare themwith a perfect accelerator by Amdahl’s law, i.e. one that

reduces the accelerable fraction of our kernel to 0 time.In Figure 5 we plot the accelerator efficacy as the frac-tion of this “Amdahl limit” that we can reach with

HWPEs, measured in the following way:

Ehwpe =Tsw − TAmdahl

Tsw − Tsw+hwpe

We can reach an average efficacy of ∼80%, with some

lower-efficacy accelerators such as VJ stopping at ∼60%and others, especially those exploiting asynchronicitybetween SW and HWPE execution, reaching up to 98%.

In Figures 6, 7, 8 we show results for the exe-cution of our six benchmarks in terms of executiontime, energy spent in the cluster and normalized perfor-

mance/area/energy respectively. From these results we

can drive some interesting observations on the class of

accelerators that can be generated from our flow. First,as shown by the VJ benchmark, acceleration of coarsecode blocks is best controlled by a small number of PEs.

Long-running HWPEs are clearly more likely to gener-ate high HWPE contention in cases where the code par-allelized among a large number of PEs does not contain

much computation to be done in SW (besides the codeto offload computation). HWPE contention can be seenin most benchmarks; when the number of PEs is high

the execution time does not scale down any more, thuslimiting performance/area/energy, which shows a peakat 4 or 8 cores in most tests.

A second effect is that when the number of PEs is rel-atively low, area and power consumption are mostly re-

duced to the invariant contribution of other componentsof the cluster (DMAs and external interfaces). There-fore, adding PEs or HWPEs to the cluster improvesperformance without significantly worsening area and

power, whereas when the number of PEs is higher the re-verse may be true, as they scale approximately with thenumber of PEs and HWPEs. The combination of the

effect of HWPE contention and of the linear scaling ofpower and area with number of processing elements canbe seen in Figure 8, where the performance/area/energy

in most tests reaches its maximum with 4 or 8 PEs.

Third, applications structured as pipelines, such as

CT, benefit the most from this acceleration model be-cause it is possible to exploit asynchronous HWPE callsand to hide most of the accelerated execution behind

the rest of the application.

In the single-PE CT case, asynchronous execution

allowed us to almost fully exploit CSC acceleration, ob-taining a speedup of 123x, that is 98% of the max-imum permitted by Amdahl’s law. The same bench-

mark shows an increase of almost an order of mag-nitude in performance/area/energy; the FAST bench-mark, which is also relatively efficient in term of Am-

dahl’s limit, goes even farther achieving a 21x increasein that metric in the measurement with 4 PEs.

Figure 9 helps understanding the energy-delaytradeoffs implied by the choice of a particular homoge-neous or heterogeneous architecture, by reporting nor-

malized execution time versus energy spent in eachbenchmark while we sweep the number of SW PEs from1 to 16. Many plots exhibit near-ideal performance scal-

ing in SW; however, it is clear that in most benchmarksthat only a certain amount of SW parallelism can beextracted in an energy-efficient way, as efficiency drops

considerably when using more than a certain numberof PEs (between 4 and 8, depending on the bench-mark). This is due to the fact that performance scaling

is never completely ideal and usually saturates at a cer-

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores 13

tain threshold, while power scaling is always linear; this

is clearly shown in the plots, where most curves bendrightwards after 4/8 PEs.

In the CT and VJ benchmarks and in the finerconfiguration of the ROD HWPE jobs are offloadedconcurrently by all PEs, which causes two visible ef-

fects: the first is HWPE contention by PEs, which isparticularly visible in the coarse configuration of VJ.The second, which is shown in both CT and VJ, is the

“constructive interference” of SW and HW co-working,which leads to significant overall speedups even whenwe are using the most SW parallelism available (up to

6x overall in VJ ) or to complete acceleration (i.e. reach-ing Amdahl’s limit) in the case of CT. We also see thatin ROD finer there is a hard limit to performance (both

in SW and HW) given by memory bandwidth: bringingdata in/out of the cluster via DMA transfer becomesthe bottleneck.

The ROD coarse configuration and the CNN andMD benchmarks show instead what happens when HW-

PEs are called by a single PE. In ROD coarse and MD,we have no scaling at all as the whole application isaccelerated. In CNN it is possible to employ SW cores

concurrently with HWPE execution, which is startedby a single PE. It is interesting to note that thoughconvolution is responsible for more than 80% of the to-

tal execution time, it is difficult to achieve a top-gradeaccelerator using our HLS flow in this case, because theefficient way of expressing convolutions in SW is funda-

mentally different from the top-efficiency way to designthem in HW [?]. For this reason, the single acceleratorwe use in the conv configuration is not able to compete

with 16 PEs performance-wise. Conversely, acceleratingthe hyperbolic tangent in the tanh configuration is veryeasy and effective using our flow, even if the hyperbolic

tangent is not as critical a kernel as the convolutionis. Accelerating both kernels leads to the best resultsin terms of energy efficiency, but makes it impossible

to exploit SW scaling to reach even better performanceand energy; the convolution accelerator acts as the bot-tleneck in the 16 PE both configuration.

Finally, the FAST benchmarks shows what hap-pens to a very unbalanced application. We used a syn-

thetic image (a checkerboard) as input to the bench-mark; this tends to exacerbate the imbalance as, de-pending on the total number of SW threads and the

subsequent chunking of the image, some cores will con-sistently have no corners and some others many cornersto find. The plot clearly shows that neither SW-only nor

HW-augmented scaling are ideal; however, we also seethat using an accelerator in this case is helpful as it re-duces the difference between threads with many corners

and threads with none, reducing the imbalance. This re-

sults in slightly improved scaling and significant gains

in terms of both energy and performance.

6 Conclusions

The novel methodology and tools we developed, ori-ented at heterogeneity exploration on the tightly-

coupled shared-memory He-P2012 platform, allow fastand easy exploration of a number of hardware accel-eration alternatives. We have shown that with our

heterogeneous approach it is possible to obtain a sig-nificant advantage with respect to pure software interms of performance, energy consumption and perfor-

mance/area/energy. This was achieved without majorrewriting of any benchmark, since we adopted the view-point of augmenting the multicore cluster of P2012 with

heterogeneity rather than completely changing its work-ing paradigm.

Design space exploration and early assessment of

the benefits of one acceleration solution over the oth-ers are critical to help overcome the utilization wall bymeans of architectural heterogeneity. This work demon-

strates a technique to apply HW acceleration to a widerange of SW kernels and exemplifies it in the caseof a state-of-art many-core platform, P2012. Although

the flow we have shown is specific to P2012, we arguethat the underlying heterogeneity approach and the pro-posed methodology are applicable to the entire class of

clustered many-cores (e.g. Kalray MPPA [?] and PULP[?]).

Our results are promising, as we show up to ∼100x

gain in perf/area/energy on the equivalent homoge-neous benchmarks, with our HWPEs exploiting on aver-age 80.1% of Amdahl’s limit. Our future work will focus

on widening the range of acceleration mechanisms thatcan be used.