[ieee 2014 15th international symposium on quality electronic design (isqed) - santa clara, ca, usa...

8
PETS: Power and Energy Estimation Tool at System-Level Santhosh-Kumar RETHINAGIRI * , Oscar PALOMAR * , Osman UNSAL * , Adrian CRISTAL * , Rabie BEN-ATITALLAH and Smail NIAR * Barcelona Supercomputing Center Barcelona, Spain LAMIH, Universit´ e de Valenciennes Valenciennes, France Abstract—In this paper, we introduce PETS, a simulation based tool to estimate, analyse and optimize power/energy con- sumption of an application running on complex state-of-the-art heterogeneous embedded processor based platforms. This tool is integrated with power and energy models in order to support comprehensive design space exploration for low power multi- core and heterogeneous multiprocessor platforms such as OMAP, CARMA, Zynq 7000 and Virtex II Pro. Moreover, PETS is equipped with power optimization techniques such as dynamic slack reduction and work load balancing. The development of PETS involves two steps. First step: power model generation. For the power model development, functional-level parameters are used to set up generic power models for the different components of the system. So far, seven power models have been developed for different architectures, starting from the simple low power architecture ARM9 to the very complex DSP TI C64x. Second step: a simulation based virtual platform framework is developed using SystemC IP’s and JIT/ISS compilers to accurately grab the activities to estimate power. The accuracy of our proposed tool is evaluated by using a variety of industrial benchmarks. Estimated power and energy values are compared to real board measurements. The power estimation results are less than 4% of error for single core processor, 4.6% for dual-core processor, 5% for quad-core, 6.8% multi-processor based system and effective optimisation of power/energy for the applications I. I NTRODUCTION Evolution of embedded applications are critical and com- plex, involving more computing resources therefore indirectly implying the increase of power of the underlying hardware. The follow-through is two fold in power/thermal dissipation due to increase in complexity of the hardware and also decreases the lifetime of battery operated devices (iWatch and google glass). Facing this issue, power and energy constraints are being considered as first order critical pre-design metrics in the embedded community. Due to this issue, designers should calculate and optimize the power dissipation of the embedded system at the earliest in the design flow to reduce the time-to-market cost. In the last decade, various researches have been dedicated to propose power estimation techniques based on different abstraction levels, starting from electrical- level to system-level. Today, system-level power estimation is considered a vital premise to cope with the critical design constraints. However, the development of tools for power estimation at the system-level is in the face of extremely challenging requirements such as the seamless power-aware design methodology, accurate and fast system power model- ing approach, and efficient power and energy based Design Space Exploration (DSE). Recently, International Technology Roadmap for Semiconductors (ITRS) 1 predicted that power and energy issues are expected to get worse as we move to the next technology nodes. Similar to ITRS, HiPEAC 2 , an Euro- pean embedded consortium also released their Roadmap-2013 stating that increasing the number of components on a chip and decreasing energy scaling leads to the phenomenon called ”dark silicon”. This effect leads to propose energy efficient systems with the right combination of different components. In current practices, power estimation/optimization are done using low level CAD tools, which is unfortunately not applicable for the ever growing complexity of processors supporting complex multimedia applications. This challenge is addressed by several frameworks through the development of Electronic System-Level (ESL) tools. The objective is to develop virtual prototypes which are fully functional software models of the physical hardware system and to unify them with the application design and to offer a rapid system-level designing. Based on the design step and the requirements like timing accuracy and estimation speed, designers could select an appropriate abstraction level to model the software simulating the system. Unfortunately, most of existing system- level simulation based tools don’t consider the power metric for a given abstraction level. System power modeling, estimation and optimization: The power modeling at the system-level is concentered around two connected aspects: granularity of the power model and characterization of the main activity. First aspect of the power model development depends on the relevant activities granu- larity. It starts from the fine grain level such as the transistor or gate switching and expands to the coarse grain level like the hardware functional events. In general for system-level de- signer, fine-grain power estimation need a lot modeling efforts due to the involvement of low level implementation details and technological parameters. Whereas, coarse-grain power models are easy to develop and it depends on least number micro-architectural activities but accuracy decreases as the complexity of system increases. The second aspect involves the characterization of the activities, which requires a huge number of experimental measurements and thus significant time to 1 http : //www.itrs.net/Links/2012IT RS/Home2012.htm 2 http : //www.hipeac.net/system/f iles/hipeac roadmap1 0.pdf 978-1-4799-3946-6/14/$31.00 ©2014 IEEE 535 15th Int'l Symposium on Quality Electronic Design

Upload: smail

Post on 05-Jan-2017

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

PETS: Power and Energy Estimation Tool atSystem-Level

Santhosh-Kumar RETHINAGIRI∗, Oscar PALOMAR∗, Osman UNSAL∗, Adrian CRISTAL∗,Rabie BEN-ATITALLAH† and Smail NIAR†

∗Barcelona Supercomputing CenterBarcelona, Spain

†LAMIH, Universite de ValenciennesValenciennes, France

Abstract—In this paper, we introduce PETS, a simulationbased tool to estimate, analyse and optimize power/energy con-sumption of an application running on complex state-of-the-artheterogeneous embedded processor based platforms. This tool isintegrated with power and energy models in order to supportcomprehensive design space exploration for low power multi-core and heterogeneous multiprocessor platforms such as OMAP,CARMA, Zynq 7000 and Virtex II Pro. Moreover, PETS isequipped with power optimization techniques such as dynamicslack reduction and work load balancing. The development ofPETS involves two steps. First step: power model generation. Forthe power model development, functional-level parameters areused to set up generic power models for the different componentsof the system. So far, seven power models have been developedfor different architectures, starting from the simple low powerarchitecture ARM9 to the very complex DSP TI C64x. Secondstep: a simulation based virtual platform framework is developedusing SystemC IP’s and JIT/ISS compilers to accurately grab theactivities to estimate power.

The accuracy of our proposed tool is evaluated by using avariety of industrial benchmarks. Estimated power and energyvalues are compared to real board measurements. The powerestimation results are less than 4% of error for single coreprocessor, 4.6% for dual-core processor, 5% for quad-core,6.8% multi-processor based system and effective optimisation ofpower/energy for the applications

I. INTRODUCTION

Evolution of embedded applications are critical and com-plex, involving more computing resources therefore indirectlyimplying the increase of power of the underlying hardware.The follow-through is two fold in power/thermal dissipationdue to increase in complexity of the hardware and alsodecreases the lifetime of battery operated devices (iWatch andgoogle glass). Facing this issue, power and energy constraintsare being considered as first order critical pre-design metricsin the embedded community. Due to this issue, designersshould calculate and optimize the power dissipation of theembedded system at the earliest in the design flow to reducethe time-to-market cost. In the last decade, various researcheshave been dedicated to propose power estimation techniquesbased on different abstraction levels, starting from electrical-level to system-level. Today, system-level power estimation isconsidered a vital premise to cope with the critical designconstraints. However, the development of tools for powerestimation at the system-level is in the face of extremelychallenging requirements such as the seamless power-aware

design methodology, accurate and fast system power model-ing approach, and efficient power and energy based DesignSpace Exploration (DSE). Recently, International TechnologyRoadmap for Semiconductors (ITRS) 1 predicted that powerand energy issues are expected to get worse as we move to thenext technology nodes. Similar to ITRS, HiPEAC 2, an Euro-pean embedded consortium also released their Roadmap-2013stating that increasing the number of components on a chipand decreasing energy scaling leads to the phenomenon called”dark silicon”. This effect leads to propose energy efficientsystems with the right combination of different components.

In current practices, power estimation/optimization aredone using low level CAD tools, which is unfortunately notapplicable for the ever growing complexity of processorssupporting complex multimedia applications. This challengeis addressed by several frameworks through the developmentof Electronic System-Level (ESL) tools. The objective is todevelop virtual prototypes which are fully functional softwaremodels of the physical hardware system and to unify themwith the application design and to offer a rapid system-leveldesigning. Based on the design step and the requirementslike timing accuracy and estimation speed, designers couldselect an appropriate abstraction level to model the softwaresimulating the system. Unfortunately, most of existing system-level simulation based tools don’t consider the power metricfor a given abstraction level.

System power modeling, estimation and optimization:The power modeling at the system-level is concentered aroundtwo connected aspects: granularity of the power model andcharacterization of the main activity. First aspect of the powermodel development depends on the relevant activities granu-larity. It starts from the fine grain level such as the transistoror gate switching and expands to the coarse grain level likethe hardware functional events. In general for system-level de-signer, fine-grain power estimation need a lot modeling effortsdue to the involvement of low level implementation detailsand technological parameters. Whereas, coarse-grain powermodels are easy to develop and it depends on least numbermicro-architectural activities but accuracy decreases as thecomplexity of system increases. The second aspect involves thecharacterization of the activities, which requires a huge numberof experimental measurements and thus significant time to

1http : //www.itrs.net/Links/2012ITRS/Home2012.htm2http : //www.hipeac.net/system/files/hipeac roadmap1 0.pdf

978-1-4799-3946-6/14/$31.00 ©2014 IEEE 535 15th Int'l Symposium on Quality Electronic Design

Page 2: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

extract the power model. The above-described aspects yieldto the definition of the power model that can be representedby a set of analytical functions or a table of consumptionvalues. The selected power model granularity depends on thetarget abstraction level and the user requirements in terms ofestimation accuracy and speed. In this tool, we also addressthe issue of power/energy optimization of an application basedon two well known techniques: Dynamic Slack Reclamation(DSR) and work load balancing for MPSoC platforms. Thefirst technique is static workload balancing by the meansof DSE in order to tune the application according to thearchitecture specifications. In fact, MPSoC are generally arevery heterogeneous which contains several processor, accel-erators, etc.,. This heterogeneity makes DSE as one of themost important design challenge. In the second technique, thepower consumption of an MPSoC platform can be reduceddynamically at run-time by two main approaches at OS level:the Dynamic Power Management (DPM) that switches-off thepower supply of a part of the circuit and the Dynamic Voltageand Frequency Scaling (DVFS) that tunes the processor clockspeed and its corresponding voltage according to the workload(actual or expected) or the battery charge. There are manyimplementation of these approaches in the literature [5]. Mostof these approaches are not validated with real measurementsin terms of accuracy and moreover they don’t rely on realisticpower models.

In the power estimation/optimization process, the devel-oped power models will interact with system-level platform tograb the strict relevant data depending on the design step. Themain challenge is to define a generic power modeling approachthat can cover the different abstraction levels, which canguarantee the coherence of the power estimation/optimisationstrategy. To answer the above described challenges, we proposePETS: Power and Energy estimation Tool at the System-level based on simulation of complex embedded platforms.Our tool focuses on the functional and the transactional levelsto deal with the design complexity and the broadness of thearchitectural solution space. Second, Functional Level PowerAnalysis (FLPA) is used to elaborate different power modelsthat are plugged afterwards with our tools to evaluate thetotal consumption of the system. Third, a runtime optimizationtechnique is developed and tested in order to reduce energy andpower consumption of the system.

The rest of this paper is organized as follows. Afterreviewing the related works in Section 2, we elaborate thetool flow in Section 3. In Section 4, we describe PETS powerestimation methodology. Section 5 and Section 6 give thedetailed explanation about the power modeling methodologyand optimization technique used in the this tool respectively.In Section 7, we present the power and energy estima-tion/optimization results, also validating with the state of theart tools. We conclude in Section 8.

II. LITERATURE REVIEW

In the last decade, various research efforts have been em-ployed to estimate the processor power dissipation at differentabstraction levels in the design flow. To reduce the simulationtime, several studies have proposed evaluating system powerconsumption at higher abstraction levels. These tools use amicro-architectural simulator to evaluate system performance

and an analytic power model to estimate consumption for eachplatform component. Wattch [3] and SimplePower [19] areexamples of tools available at this level. In general, these toolsrely on Cycle-Accurate (CA) simulation technique. Usually, tomove from the RTL to the CA level, hardware implementationdetails are hidden from the processing part of the system, whilepreserving system behavior at the clock cycle level. The powerconsumption of the main internal units is estimated usingpower macro-models, produced from lower-level characteriza-tions. The contributions of the internal unit activities are calcu-lated and added together during the execution of the programon the cycle-accurate micro-architectural simulator. Researchworks such as in [15] present CA simulation environment forsymmetric MPSoC with power estimation. Though using theCA level has allowed accurate power estimation, MPSoC spaceexploration at this level is not yet sufficiently rapid comparedto RTL. In addition, system description at this level is difficultto obtain especially for off-the-shelf processors.

PETS user interface

Power model developmentFunctional level power

analysis (FLPA)

Platform Name: OMAP3530, OMAP4430, Virtex II pro, Zynq 7000Processor Name: ARM, PowerPC, C64xFrequency of the processor:Number of cores:Number of FPGA blocksOptimization: DSR, work load balancing, DSE based energy, Run different processor with different configuration

Target complied application source file

(C or C++)Application input file

Bus protocol

ARM Cortex-A9 ARM Cortex-A8 PowerPC 405

Memory C64xHardware

Accelerators

Graphical result interface

Export Graphical to .CSV file

Real board measurement

interface (online module)

System-level simulation framework

Estimation results of the tool

Use

r in

pu

t fi

le (

Tcl)

Power model library

First step

Se

co

nd

s

tep

Fig. 1. PETS Tool flow

In an attempt to reduce simulation time, recent effortshave been done to build up fast simulators using TransactionLevel Modeling (TLM) [1]. SystemC and its TLM 2.0 kit havebecome a de facto standard for the system-level description ofSystems-on-Chip (SoC) by means of offering different codingstyles. Nevertheless, power estimation at the TLM level is stillunder research and is not well established. Lajolo et al. [7] pro-posed a power estimation methodology for hardware/softwareco-design based on concurrent and synchronized execution ofvarious power estimators to estimate different parts of the SoC.In this methodology, the authors use RTL/gate-level powermodel into the SystemC environment to estimate power.

Lately, McPAT [9] was introduced, which is an improvedmodel of Cacti [16] tool set. McPAT supports power, areaand timing estimation for multi-core processors. McPAT useits XML interface to interact with the simulator in order tocollect the data needed by its power model. In this paper, wewill be comparing McPAT tool which is executed with help ofthe Multi2Sim [17] functional simulator for dual-core ARMCortex-A9 processor with the proposed tool for the powerestimation accuracy. Here the major issue is with the accuracyof power model and its estimation. To overcome this drawback,a hybrid power estimation methodology (HSL) was proposed

Page 3: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

in [13] by combining interpreted ISS with functional levelpower model but this methodology suffers in terms of speedas it uses the interpreted ISS for the simulation. Moreover,this methodology was proposed for simple processors suchPowerPC405. Similar to the previous work, a signature-basedpower model for MPSoC based on FPGA [12] was proposed.

In order to optimize the energy/power consumption of anapplication, there are several studies proposed in the literaturesuch as work load balancing, scheduling policy and dynamicvoltage/frequency management. Recently, embedded researchcommunity has shifted its focuses on multi-core processors dueto the need of high parallelism in the applications. To addressthis issue, first, work load balancing is one of the prominentway to increase the performance in turn increase the powerbut reduces the energy. Recently, there is an approach, whichis explicitly focuses on loosely timed system and offers theuser a set of primitives to express tasks with duration. Theirtool exploits this notion of duration to run the simulation inparallel. This allows the user to focus on the performance-critical parts of the program that need to be parallelized [10].We will be using similar approach in PETS. Another toolthat tackles the same issue at the virtual platform level [6].Kreku et al. present a compiler based technique for automaticgeneration of workload models for performance simulation,while exploiting an overall approach and platform performancecapacity models developed previously. Second, there weremany researches done in the field of dynamic power man-agement(DPM) [18], [2] and dynamic voltage and frequencyscaling (DVFS) [4], [11]. In this tool, we will integrate theinter task DVFS called Dynamic Slack Reclamation (DSR)into the PET. In our previous work [14], we introduced thismethodology for simple mono-processor based platforms andin this work, we propose a simulation based power and energyestimation tool for complex multiprocessor and multi-coreprocessor based heterogeneous platform at the system-level. Inthis tool, we also integrate the DSR to dynamically change thepower consumption profile based on the different task of theapplication for different processors and also implement workload balancing technique to reduce energy.

III. PETS OVERVIEW

PETS is a hardware/software co-simulation based power& energy estimation/optimisation and DSE tool for complexembedded (multiprocessor and multi-core) processors. Theoverall flow of the PETS tool is given in the Fig. 1. Thistool is designed for easy porting of the application and alsoenables user to choose the processor architecture to performhardware/software co-simulation. PETS includes eclipse basedenvironment and also supports a customized graphical outputinterface programmed in C. To automate the process, we usedTCL/TK scipts. During the start of PETS, a configurationmenu appears to choose the input file (TCL/TK file). Theinterface is programmed in a way that the power estimator andsimulator communicates dynamically to estimate the change inpower/energy of an application for the corresponding platform.The power estimator in return gives the power/energy data tothe system-level simulator to automatically balance the work-load or to use the DVFS technique as per the user inputconfiguration by using a Smart DSE for a reliable optimisationof the application. This approach makes PETS flexible andeasy for software programmers to port their application to the

state-of-the-art processor architecture available with the tool.With the flexible interface of this tool, users can modify theSystemC architecture for the future processor and also able toadd their own power models into the tool by using a simpleextensible interface.

The main units of the PETS are as follows: First, the powermodels which will be elaborated in the upcoming section.Second, the fast system-level virtual platform based simulatorand finally the smart DSE for an effective optimisation and areliable DSE of the application. The simulator has some fixedcounters such as, cache activity counter, instruction per cycledetector, bus access counter and memory access counter.

Activity counter interfaceMapping interface

Estimation resultsApplication and

dataPower estimator

kernelSmart DSE

(DVFS & Workload)

Power model Library

PETS

System level simulation

User input file

Fig. 2. PETS simulation framework

IV. PETS ESTIMATION METHODOLOGY

As mentioned earlier, this section exposes our power esti-mation methodology that is divided into two steps as shownin the Fig. 1. The first step concerns the power modeldevelopment for the system functional components. In ourframework, the FLPA methodology is extended to developgeneric power models for different target platforms. The mainadvantage of this methodology is to obtain power models,which rely on the functional parameters of the system with areduced number of experiments. As explained in the previoussection, FLPA comes with few consumption laws, which areassociated with the consumption activity values of the mainfunctional blocks of the system. The generated power modelshave been adapted to system-level design, as the requiredactivities can be obtained from a system-level environment.For a given platform, the generation of power model is a onetime activity. To estimate power dissipation of a platform, firststep is to divide the system into different functional blocksbased on the activity and then to tune the divided blocks thatare simultaneously activated while a micro-benchmarking codeis executed.

The functional parameters are of two types: algorithmicparameters which consumes power due to the software running(typically the instruction per cycle (IPC), fetch access rate (ρ)for DSP processor and cache miss rate (γ) for a processorand area utilization (α) for a hardware accelerator) and archi-tectural parameters which are set by the designer (typicallythe clock frequency, bus frequency and number of processor

Page 4: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

TABLE I. GENERIC POWER MODEL PARAMETERS

Name Descriptionτ External memory access rateγ1 L1 cache miss rate for a processorγ2 L2 cache miss rate for a processorSCU Snoop control unit counter

for ARM Cortex-A9 multi-coreγ1 L1 cache miss rate for a processorρ parallelism rate for DSP processor

fetch stage accessAlgorithmic β DSP processor processing rate

between instruction memory unitand processing unit

PSR Pipeline stall ratebetween instruction memory unit

and processing unitIPC Instruction Per Cycleα area utilization for a CLB

Fprocessor Frequency of the processorArchitectural Fbus Frequency of the bus

N Number of cores

cores). For example, Table I show the common set of genericparameters used in the power model.

The second part is the characterization of the embeddedsystem power consumption by varying the parameters. Thesevariations are obtained by using some elementary assemblyprograms (called scenario) to stimulate each functional blockseparately. Characterization is performed by taking measure-ments on real boards. Finally, a curve fitting of the graphicalrepresentation will allow us to determine the power con-sumption laws by regression. The analytical form or a tableof values expresses the obtained power laws. This powermodeling approach was proven to be fast and precise [8].In our work, this approach has been applied to model powerconsumption for processor, memory, reconfigurable hardwareand I/O peripherals.

The second step of the methodology defines the mainarchitecture of our tool that includes the functional levelpower /energy estimator kernel and fast IA simulator/ fastISS depending on the architecture as shown in Fig.2. Thefunctional power estimator evaluates the consumption of thetarget system with the help of the elaborated power modelsfrom the first step. It takes into account the architecturalparameters (e.g. the frequency, the number of processors,the processor cache configuration, etc.) and the applicationmapping. It also requires the different activity values on whichthe power models rely. In order to collect accurately theneeded activity values, the functional power estimator kernelcommunicates with a Just-In-Time (JIT) Open Virtual Platform(OVP) simulator at IA level. The combination of the above twocomponents described at different abstraction levels (functionaland IA) leads to a standalone hybrid power estimation tool thatgives a better trade-off between accuracy and speed.

The vital function of virtual platform power estimation isto offer a detailed power analysis by means of a completesimulation of the application. This process is initiated by thefunctional power estimator through the application and datainterface (Fig. 2). In this way, the mapping information istransmitted to the OVPsim simulator. Our simulator consistsof processors and hardware components which are instantiated

from the OVP library and few other models are written inSystemC in to develop a virtual model of the target platform.We would like to emphasis that processors are described usingIA level that sequentially executes the instructions and has nonotion of concurrency of micro-architecture.

The power estimation step is carried out on the fly while theapplication is executed on the simulator. During the simulation,the values of the activities such as IPC, cache miss rate, areaoccupied, bus accesses are given to the power estimator kernalusing the activity counter interface to calculate the globalpower/energy consumption as shown in Fig. 2.

As we discussed before, the following section will elabo-rate the first step i.e., the power model development.

V. POWER MODELING METHODOLOGY

In order to prove the effectiveness of the proposed tool,we used different multi-core and heterogeneous architecturesimplemented into the OMAP (3530, 4430 & L138), CARMA(quad core) and Xilinx (Zynq & Virtex II Pro) platform.This section elaborates the first step of the proposed tool,which is power model development by using functional levelpower analysis (FLPA) methodology. As discussed before,we adapted the FLPA methodology to develop generic powermodels for the used platforms. As a first step of the power de-velopment, the processor architecture is divided into differentfunctional blocks such as the clock unit, the pipeline stage unit,the memory unit, the bus unit, the peripheral unit and the hard-ware accelerator unit, etc. Then, the characterization of eachcomponent is done by using micro-architectural benchmarksin order to extract the related power consumption models.

A. Power model elaboration

In Table II, we present power consumption models fordifferent processor architectures. The input parameters onwhich the power models rely are the frequency of the processor(Fprocessor), Instruction Per Cycle (0 ≤ IPC ≤ 2), paral-lelism rate for DSP processor fetch stage access (0 ≤ ρ ≤ 1),DSP processor processing rate between instruction memoryunit and processing unit (0 ≤ β ≤ 1), Pipeline stall rate (0 ≤PSR ≤ 1) and the cache miss rate (0 ≤ γ 1 & γ 2 ≤ 100(%)). The frequency of the processor and bus are set by thedesigner, whereas IPC, cache miss rate etc., are consideredas an activity of the processor during the execution of theapplication. The activity of the processor could be collectedfrom the system-level simulator.

FPGA power model: A power model of the hardwareaccelerator/reconfigurable part is developed for the Zynq &Virtex II Pro board by high-level synthesis tool. For any givenFPGA, the high level parameters which can be extracted fromthe specification are the frequency F , the switching rate βand the utilized area α. In this power model development,we used GAUT 3. With the help of GAUT, we predicted thepower with good estimates. An automated 3 way entries LUTof power dissipation values written in C file is used. This isdue to the results does not come as a linear equation of thebefore mentioned parameters. The power estimation is donebased on interpolation of the mentioned 3 input parameters.

3http : //www − labsticc.univ − ubs.fr/www − gaut/

Page 5: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

0

500

1000

1500

2000

2500

PETS power estimation for single core processor

PowerPC Power estimation (mW) PowerPC Real board measurement (mW) ARM9 Power estimation (mW) ARM9 Real board measurement (mW)

ARM Cortex-A9 Power estimation (mW) ARM Cortex-A9 Power measuremet (mW) ARM Cortex-A8 Power estimation (mW) ARM Cortex-A8 Power measurement (mW)

Pow

er (

mW

)

Fig. 3. PETS power estimation of different processor architecture based platform

0,01

0,1

1

10

100

VLD IQ IDCT IMC Rebuilding

Instruction Miss rate (%) Read Miss rate (%) Write Miss Rate (%) Total Miss Rate (%)

MPEG 2 application task

Miss

rate

(%)

Fig. 4. Cache miss rate activity

Extrapolation for heterogeneous multiprocessor archi-tectures: The developed power models are used in the system-level environment for consumption estimation of heteroge-neous multi core/processor architectures which may containseveral different processors and hardware accelerators. Thisapproach allows architecture scalability which cannot be im-plemented due to the hardware limitations or unavailabilityof the target component. For instance, we cannot exceeddual-core ARM Cortex-A9 based architecture using our Zynqplatform. We mention here that it is necessary to computethe energy before the deduction of the total power consump-tion. Equation 1 gives the total energy consumption of theplatform. The parameters used are the processor (Epi ) andthe conventional blocks (ECLB). The equation also involvesthe energy consumption of the synchronization part (Esync)required to access the shared memory (Emem) and the sharedI/O resources (EI/O). Where, i and j are the total number ofprocessor cores and hardware accelerator’s (CLB) respectively.

TABLE II. GENERIC POWER MODELS

Processors Power modelsPowerPC P(mW) = 4.1 (γ) + 6.3 Fbus + 1599ARM9 P(mW) = 1.03 FProcessor + 0.6 (γ) + 5.3

ARM Cortex-A8 P(mW) = 0.79 FProcessor + 38.65 IPC+ 0.26 (γ1 + γ2) + 10.13

ARM Cortex-A9 P(mW) = 0.7 Fprocessor + 66.54 IPC(single core) + 0.67 (γ1) + 1.4 (γ2) + 10.45

ARM Cortex-A9 P(mW) = 0.63 Fprocessor + b2∑

i=1

(IPCc1−c2))

(dual core) + c2∑

i=1

(γ1c1−c2) + 1.4 (γ2 + 12.45

ARM Cortex-A9 P(mW) = 0.59 Fprocessor + b4∑

i=1

(IPCc1−c4)

(quad core) + c4∑

i=1

(γ1c1−c4) + 1.4 (γ2) + 2.5 (SCU) + 16.45

DSP P(mW) = 0.6 FProcessor + 0.186 (ρ)(1 − PSR)TMS320C64x + 0.12 (γ1) + 0.0257 (γ2)(shared + unified) +

0.0345 (β)(1 − PSR) + 8.45

Etotal =n∑i=1

Epi +Emem +Esync +EI/O +n∑j=1

ECLB (1)

VI. POWER OPTIMIZATION TECHNIQUE

DSR & work loading balancing are one of the promisingtechniques to reduce power/energy consumption of modernprocessors. The pseudo frequencies are built at startup based onthe available frequency in the real board and then at runtime.The algorithm retrieves the utilization of CPU and selects apseudo frequency level at or above the rate of use. Finally, thealgorithm corresponds the pseudo frequency with a frequencysupported by the processor. This method is not permitted ifthe effective time of the tasks is less than their worst caseexecution time. In the PETS, the dynamic slack reclamation(DSR) algorithm is integrated to reduce dynamically the energyof the embedded multimedia application.

Page 6: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

Dynamic slack are produced due to the early completionof the application. For this reason, we consider the worst casecompletion time of each and every task in an application.This is also known as Longest Processing Time (LPT). Bythis way, we reduce the dynamic power consumption of theprocessor by increasing the execution time of early completionand by decreasing the frequency of the processor for the tasks.The identification of dynamic slack in a task is retrieved bycomparing the actual execution time and worst case executiontime.

Workload balancing: As mentioned earlier, work loadbalancing is one of best way to increase the performance andreduce the energy consumption of the processor. The objectiveis to exploit the parallelism in both architecture and also inthe algorithm. To do that, all the processor core are equallyloaded with the task in an application. The weight of eachtask is retrieved by static profiling of the application. Fromthe retrieval of weight, the tool will compute the followingsteps, computational complexity and feasibility of the taskmapping of the given application to the processor core or ona FPGA. This technique doesn’t consider the communicationoverhead between the processor cores, which is a criticalfactor of communication intensive applications. Similar typeof work has been proposed to achieve load balancing of aJPEG application for MPSoC systhesis [20].

So far, power model for the single core and dual coreprocessor, FPGA and heterogeneous multiprocessor architec-ture has been developed. Estimation of the overall powerconsumption for single core processor, dual-core processorand heterogeneous multiprocessor architecture at system-levelwill be briefed up with the results in the next section as thesecond step of power estimation methodology with the helpof Multimedia 4 and SPEC2006 5 benchmarks.

VII. POWER ESTIMATION ENVIRONMENT

In the second step, a system-level prototype of an Pow-erPC405, ARM9, ARM Cortex-A8, multi-core ARM Cortex-A9 based architectures has been programmed. The developedprototype uses different SystemC models especially the JIT &interpreted ISS simulator provided by OVP 6 and SoCLib 7

for the target processor. Furthermore, the cache parameters,pipeline stage unit and bus latencies are set to emulate thereal platform behaviour. A set of counters are injected into thesimulator to determine the values of different IPC, cache missrates: read data miss, write data miss and read instruction missaccess rate, memory access and bus access.

A. Power & energy estimation results

1) Power estimation for single core architecture: Firststudy is to estimate the power consumption of single coreprocessor architectures. For example, we used the MPEG 2decoder simple profile application as an benchmark to showthe data gather by the activity counters in Fig. 4. The MPEGapplication consists of 4 main tasks: VLD, IQ, IDCT and IMC.In Fig. 4, we show the results of the activities fetched by the

4http://www.spec.org/gwpg/pastissues/Dec975http://www.spec.org/cpu2006/6http://www.ovpworld.org/7http://www.soclib.fr/trac/dev

counter interface of the fast IA simulator for each task of theMPEG-2 Part 2 decoder application. The results collected inFig. 4 will help designers to tune the application for a betterpower optimization. Detailed results from multimedia andsingle processing benchmarks show a similar behaviour likethe MPEG-2 Part 2 decoder application. In the following step,we estimated the total power dissipation of each tasks using thepower models shown in Table II (single core). Fig. 3 illustratesthe results and shows the comparison between the proposedtool and the real board measurements. In order to prove that ourtool is accurate, we used several other benchmarking programsavailable in the industry as shown in the Fig. 3. Our tool hasa negligible maximum error of 4% compared to real boardmeasurement.

2) Multi-core & heterogeneous multiprocessor architectureenergy estimation: The second study involves a processorarchitecture with identical cores and heterogeneous processorsto run the multimedia benchmarks. All the cores/processorsexecute the same workload. Fig. 5 & Fig. 6 report the totalenergy consumption in mJ .

Compared to real board measurements, our tool achieved amaximum error of 4.5% for dual core architecture while Mc-PAT has a maximum error of 28%. This accuracy is obtainedbecause of two main reasons. First, power laws are extractedfrom real board measurements. Second, additional activitiesthat are intrinsic to parallel processing such as synchronizationand communication overheads are accurately evaluated byusing our PETS build-in simulator. To retrieve activities forOMAP3530 platform, we used C64x+ cycle accurate simula-tor 8 with a special interface from PETS. The before mentionedexplanations encourage us to consider architectures with ahigher number of cores/processor in the context of exploringnew complex heterogeneous processor architectures. Fig. 5 &Fig. 6 show that, for the implemented multimedia parallelapplication, adding cores/processors/hardware accelerator tothe system decreases the execution time, which improves thesystem performance.

B. Power optimization for multi-core & multiprocessor archi-tecture

Power optimization based on DSR: In order to evaluatethe described DSR technique implemented in our PETS tool,we used a H.264 (4 slices 50 frames) application running onan ARM Cortex-A9 (quad core) processor based platform. Theplatform is configured with 4 processor cores at 500MHz.Based on the longest processing time, we determine the pa-rameters of each task (release time, actual-case and worst-caseexecution time, relative deadline and periodicity). These pa-rameters are given as an input to the tool during the executionof H.264 application. In the start-up of the tool, we had set upsome pseudo frequencies based on the available frequency onthe board in order to dynamically change the frequency duringthe application simulation. To optimize the power consumptionof the overall system, we change dynamically the frequencyof each processor core according to these pseudo frequencies(250 MHz, 500 MHz and 720 MHz).

In order to evaluate our tool with this technique, weexecuted an modified H.264 with different processor configu-

8http://processors.wiki.ti.com/index.php/C64x%2B Cycle Accurate Simulator

Page 7: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

1

10

100

1000

10000

JPEG I JPEG II H264 CIF H264 QCIF MPEG-4 JPEG2000 H.263 X264encoder

Downscalar H.264 HD Parallelinsertion

sort

PETS energy estimation

PowerPC 2 processor Energy (mJ) ARM Cortex-A9 dual core (mJ) PowerPC 4 processors (Energy) (mJ) ARM Cortex- A9 quad core (mJ)

Ener

gy

(m

J)

Fig. 5. Energy estimation variation according to the number of processor cores with different multimedia application (PowerPC, ARM9, ARM Cortex-A8,ARM Cortex-A9)

100

2100

4100

6100

8100

10100

12100

14100

JPEG I JPEG II H264 CIF H264 QCIF MPEG-4 JPEG2000 H.263 X264 encoder Downscalar H.264 HD Parallelinsertion sort

PETS energy estimation for hetergeneous multiprocessor based platformOMAP (ARM Cortex-A8+DSP C64x) energy estimation (mJ) Virtex II Pro (PowerPC + FPGA) energy estimation (mJ) Zynq 7000 ( ARM Cortex-A9 + FPGA) energy estimation (mJ)

Energ

y (m

J)

Fig. 6. Energy estimation variation according to the number of heterogeneous multiprocessor architecture based platform with different multimedia application(OMAP3530, Virtex II Pro, Zynq7000)

TABLE III. POWER OPTIMIZATION BASED ON DSR OF H.264APPLICATION

Processor core Power (mW) Execution time (s)2 processor core without DSR 374 4.52 processor core with DSR 231 4.84 processor core without DSR 391 3.74 processor core with DSR 312 3.4

ration. The results of power consumption during the executionof the H.264 is given in Table III with different configurationsof the multi-core processor based platform. We have differentconfigurations of the number of active cores to verify the im-pacts on power consumption with and without the optimizationalgorithm. This analysis takes into account the CPU load byvarying the number of threads with different configurationsdecoder (2, 4 slices). The Table III shows the power dissipationand execution time of the different configurations of H.264 foractive processor cores.

TABLE IV. ENERGY OPTIMIZATION BASED ON WORK LOADBALANCING OF JPEG AND H.264 APPLICATION

Application No. of processor Energy without Energy with workcore workload balancing workload balancing

(mJ) (mJ)JPEG 1 252.23 —JPEG 2 166.51 152.54JPEG 4 118.24 101.23H.264 1 812.12 —H.264 2 542.23 469.144H.264 4 384.12 302.21

Energy optimization based on work load balancing:To implement the workload based energy optimisation in ourtool, we used a similar setup like the DSR (H.264 on aARM Cortex-A9 quad core processor). First, static profilingof the sequential H.264 application is done. Based on theprofiling results, we get the dependency graph of the appli-cation and also the computational intensive task of H.264

Page 8: [IEEE 2014 15th International Symposium on Quality Electronic Design (ISQED) - Santa Clara, CA, USA (2014.03.3-2014.03.5)] Fifteenth International Symposium on Quality Electronic Design

(i.e., motion estimation/intra prediction). We further dividedmotion estimation and intra prediction to have a four waygraph to be implemented on the quad core processor. Table IVpresents the result of energy optimization based on work loadbalancing technique by PETS. In the table, it also gives therelation between without load balancing, 2 processor core and4 processor core energy optimisation result. From the table IV,we are able to see that using this technique will reduce theenergy consumption of the processor.

VIII. CONCLUSION

This paper proposes a simulation based Power Estimationand optimisation Tool for complex processor based platformsat System-level. PETS is the first tool to integrate powermodels and optimization technique at the simulation levelfor the state-of-the-art processors. PETS tool includes aneclipse environment and a scripting interface to instantiatethe processor and the application. Moreover, it contains agraphical interface for displaying the results of the simulation,which is coupled with the eclipse environment. First, a powermodeling methodology for system-level has been proposed toaddress the global system consumption that includes heteroge-neous processors, reconfigurable hardware, etc. Secondly, thedeveloped power models are seamlessly coupled with a fastsimulation technique to grab the needed micro-architecturalactivities for the power models with a better trade-off betweenaccuracy and speed. In addition, using functional power modelsbrings transparency regarding the low level implementation,reduces the number of dependent parameters and ease outthe extraction of the required data. Our proposed system-level power estimation tool explores these two aspects andoffers an accurate and fast system-level power prediction.Experimental results show that our tool exhibits less than 4%average error compared with the real measurements and moreaccurate compared to other tools. With the proposed tool, thedesigner can explore several implementation choices: singlecore, multi core and heterogeneous multiprocessor architecturebased platforms.

As part of future work, we will focus more on integrationof thermal and reliability models based on energy and power atthe system-level. Furthermore, in order to obtain more accuratepower estimations, some power model refinements must berealized.

ACKNOWLEDGMENT

The research leading to these results has received fundingfrom the European Community’s Seventh Framework Pro-gramme [FP7/2007-2013] under the ParaDIME Project (www.paradime-project.eu), grant agreement no 318693.

REFERENCES

[1] G. Beltrame, L. Fossati, and D. Sciuto. ReSP: A NonintrusiveTransaction-Level Reflective MPSoC Simulation Platform for DesignSpace Exploration. Computer-Aided Design of Integrated Circuits andSystems, IEEE Transactions on, 28(12):1857–1869, Dec. 2009.

[2] M. K. Bhatti, C. Belleudy, and M. Auguin. Hybrid power managementin real time embedded systems: an interplay of dvfs and dpm techniques.Real-Time Syst., 47(2):143–162, Mar. 2011.

[3] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A frameworkfor architectural-level power analysis and optimizations. In Proc.International Symposium on Computer Architecture ISCA’00, pages 83–94, 2000.

[4] S. Eyerman and L. Eeckhout. Fine-grained dvfs using on-chip regula-tors. ACM Trans. Archit. Code Optim., 8(1):1:1–1:24, Feb. 2011.

[5] K. Funaoka, A. Takeda, S. Kato, and N. Yamasaki. Dynamic voltage andfrequency scaling for optimal real-time scheduling on multiprocessors.In SIES, pages 27–33, 2008.

[6] J. Kreku, K. Tiensyrja, and G. Vanmeerbeeck. Automatic workloadgeneration for system-level exploration based on modified gcc compiler.In Proceedings of the Conference on Design, Automation and Test inEurope, DATE ’10, pages 369–374, 3001 Leuven, Belgium, Belgium,2010. European Design and Automation Association.

[7] M. Lajolo, A. Raghunathan, S. Dey, and L. Lavagno. Efficient power co-estimation techniques for system-on-chip design. In Design, Automationand Test in Europe Conference and Exhibition 2000. Proceedings, pages27–34, 2000.

[8] J. Laurent, N. Julien, and E. Martin. Functional level power analysis:An efficient approach for modeling the power consumption of complexprocessors. In Proceedings of the Design, Automation and Test inEurope Conference, Munich, 2004.

[9] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi. McPAT: an integrated power, area, and timing modelingframework for multicore and manycore architectures. In MICRO 42:Proceedings of the 42nd Annual IEEE/ACM International Symposiumon Microarchitecture, pages 469–480, New York, NY, USA, 2009.ACM.

[10] E. Macii, editor. Design, Automation and Test in Europe, DATE 13,Grenoble, France, March 18-22, 2013. EDA Consortium San Jose, CA,USA / ACM DL, 2013.

[11] A. Mello, I. Maia, A. Greiner, and F. Pecheux. Parallel simulation ofsystemc tlm 2.0 compliant mpsoc on smp workstations. In Proceedingsof the Conference on Design, Automation and Test in Europe, DATE’10, pages 606–609, 3001 Leuven, Belgium, Belgium, 2010. EuropeanDesign and Automation Association.

[12] R. Piscitelli and A. D. Pimentel. A signature-based power model formpsoc on fpga. VLSI Des., 2012:6:6–6:6, Jan. 2012.

[13] S. Rethinagiri, R. Ben Atitallah, S. Niar, E. Senn, and J. Dekeyser.Hybrid system level power consumption estimation for fpga-basedmpsoc. In Computer Design (ICCD), 2011 IEEE 29th InternationalConference on, pages 239–246, 2011.

[14] S. K. Rethinagiri, R. Ben Atitallah, J.-L. Dekeyser, E. Senn, and S. Niar.An efficient power estimation methodology for complex risc processor-based platforms. In Proceedings of the Great Lakes Symposium onVLSI, GLSVLSI ’12, pages 239–244, New York, USA, 2012. ACM.

[15] S. K. Rethinagiri, O. Palomar, R. Ben Atitallah, S. Niar, O. Unsal, andA. C. Kestelman. System-level power estimation tool for embeddedprocessor based platforms. In Proceedings of the 6th Workshop on RapidSimulation and Performance Evaluation: Methods and Tools, RAPIDO’14, pages 5:1–5:8, New York, NY, USA, 2014. ACM.

[16] S. Thoziyoor and N. Muralimanohar. Cacti 5.0, 2007.[17] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2sim: a

simulation framework for cpu-gpu computing. In Proceedings of the21st international conference on Parallel architectures and compilationtechniques, PACT ’12, pages 335–344, New York, NY, USA, 2012.ACM.

[18] C. Xian, Y.-H. Lu, and Z. Li. Energy-aware scheduling for real-timemultiprocessor systems with uncertain task execution time. In DesignAutomation Conference, 2007. DAC ’07. 44th ACM/IEEE, pages 664–669, 2007.

[19] W. Ye, N. Vijaykrishnam, M. Kandemir, and M. Irwin. The design anduse of simplepower: a cycle accurate energy estimation tool. In Proc.Design Automation Conference DAC’00, June 2000.

[20] V. Zadrija and V. Sruk. Mapping algorithms for mpsoc synthesis. InMIPRO, 2010 Proceedings of the 33rd International Convention, pages624–629, 2010.