a power estimation technique for cycle-accurate …users.auth.gr/ksiop/pdf/samos2015_power.pdf · a...

A Power Estimation Technique for Cycle-AccurateHigher-Abstraction SystemC-based CPU Models

Efstathios Sotiriou-Xanthopoulos∗, G. Shalina Percy Delicia∗∗, Peter Figuli∗∗,Kostas Siozios∗, George Economakos∗, and Jurgen Becker∗∗

∗School of Electrical and Computer Engineering, National Technical University of Athens, Athens, GreeceE-mail: {stasot, ksiop, geconom}@microlab.ntua.gr

∗∗Institute for Information Processing, Karlsruhe Institute of Technology, Karlsruhe, GermanyE-mail:{shalina.ford, peter.figuli, becker}@kit.edu

Abstract—Due to the ever-increasing complexity of embeddedsystem design and the need for rapid system evaluations inearly design stages, the use of simulation models known asVirtual Platforms (VPs) has been of utmost importance as theyenable system modeling at higher abstraction levels. Since atypical VP features multiple interdependent components, VPlibraries have been utilized in order to provide off-the-shelfmodels of commonly-used hardware components, such as CPUs.However, CPU power estimation is not adequately supported byexisting VP libraries. In addition, existing power characterizationtechniques require architectural details which are not alwaysavailable in early design stages. To address this issue, this paperproposes a technique for power annotation of CPU modelstargeting SystemC/TLM libraries in order to enable the accuratepower estimation at higher abstraction levels. By using a setof benchmarks on a power-annotated SystemC/TLM model ofXilinx Microblaze soft-processor, it is shown that the proposedapproach can achieve accurate power estimation in comparisonto the real-system power measurements as the estimation errorranges from 0.47% up to 6.11% with an average of 2%.

Keywords—Power annotation; Virtual Platform; SystemC;Cycle-accurate

I. INTRODUCTION

Modern embedded systems envelope a very wide rangeof applications with different requirements and specifications,while aiming to maximize throughput and minimize powerconsumption. Towards this goal, a vast combination of dif-ferent architectural decisions is taken into account, such asthe number and type of processing elements, size of caches,interconnection type, etc. However, this results to an ever-increasing complexity of the embedded system design. Thus,new design strategies and methodologies are needed in orderto bridge the gap between the designer’s productivity and thedifferent design decisions which are provided [1]. One of thosestrategies is the use of Virtual Platforms (VPs), i.e. softwarerepresentations of the system-under-design, which enable thehardware simulation for both early software development andrapid system evaluation. For the latter, system modeling can beachieved in higher abstraction levels by excluding architectural

This work was partially supported by “TEAChER: TEach AdvanCEdReconfigurable architectures and tools” project funded by DAAD (2014)and CIDCIP and MENELAOS projects funded by the Greek Ministry ofDevelopment under the National Strategic Reference Framework NSRF 2007-2013, action “Creation of innovation clusters” “A GREEK PRODUCT, ASINGLE MARKET: THE PLANET”

details which are not yet available, especially in early designstages.

For the development of VPs, a very commonly-used lan-guage is SystemC [2], which provides Hardware Descrip-tion Language (HDL) features like signals, ports, threads,etc. Hence, it enables VP modeling in any abstraction levelfrom functional-only down to RTL by combining the easybehavioural representation of a C++ source code with commonhardware features as provided by a typical HDL. In addition,the TLM2.0 extension provides Transaction-Level Modeling(TLM) of the communication among VP components withoutdemanding details like bus signals, arbitration, etc. As Sys-temC/TLM does not provide specific models for commonly-utilized VP components like CPUs, memories, interconnection,debuggers, etc., SystemC/TLM-based VP libraries providesuch off-the-shelf modules to facilitate the system design.

A strong requirement is the evaluation of power consump-tion as the power budget of the modern embedded systemsis very limited either due to the need for long battery lifeor for reducing the thermal dissipation. This requirementis more stringent in the case of CPUs as they consume asignificant portion of the overall system power. However, mostof the existing VP libraries provide only timing information,while the power estimation for CPUs is not addressed. Inaddition, although a number of power estimation techniquesfor SystemC modules have been proposed, these techniquesrequire architectural details which are not always available forCPUs, especially in early design stages.

Towards this need, the goal of this paper is to powercharacterize cycle-accurate CPU models in higher abstractionlevels. Hence, it is possible to create power-annotated CPUlibraries, for multiple CPU models and for multiple imple-mentation technologies for each CPU. The power-annotatedlibrary can be (re)used in numerous SystemC-based VPswhich utilize one or more CPUs of the provided library. Inparticular, this paper extends the approach proposed in [3],where each CPU instruction is power-annotated by executingmultiple copies of the instruction onto a real system andmeasuring the power consumption with an oscilloscope. Whenapplying this technique to cycle-accurate CPU models, thismeasurement also includes the power consumption of thecache hit, the instruction decoding and other activity of theCPU microarchitecture. Thus, the contribution of this paper isto achieve an abstracted but accurate CPU power annotation,

Fig. 1. Achieving a trade-off between speed and accuracy during VPmodeling

where there is no need for characterizing every architecturaldetail but instead only a few additional parameters like thecache miss penalty and the pipeline activity are required.Such additional power parameters can be obtained from themeasurements made on multiple sequences of instructions,thus enabling power characterization much more feasible inearly design stages. Running a number of benchmarks on aVP with a power-annotated model of Xilinx Microblaze soft-processor and comparing the estimated power consumptionwith real measurements on a Xilinx FPGA board, we showthat the absolute power estimation error ranges from 0.47%up to 6.11% with an average of 2%.

The rest of the paper is organized as follows: Section IIdescribes in detail the motivation of the presented approach.Section III provides a qualitative comparison of existing powerannotation/estimation approaches. Section IV presents the pro-posed power characterization methodology. Section V presentsthe experimental results of the proposed approach. Finally,Section VI ends up with the conclusions and future work.

II. MOTIVATION

The origination of Moore’s law has led to tremendousadvancement in the embedded industry as the number oftransistors that can be added onto a single chip doublesonce in every 24 months. However, this has lead to anexponential increase in the design complexity. Thus, high-level design techniques were proposed in order to enhancethe designer’s productivity [1]. Although speed has nearedits limits and has barely doubled in the last decade, powerconsumption has become a serious constraint during logicdesign either because of the need for long battery life ordue to reliability issues which arise from thermal dissipation.Thus, the enlarging technological trends, the need for high-level integration, as well as the requirement for low-powerdesign, have laid stress on early energy estimation as statedby the International Technology Roadmap for Semiconductors(ITRS) of 2013 [4].

VPs meet the above demands by enabling the simulationof the system in early design stages with the absence of actualhardware. Such a simulation can be executed in a very fastabstract manner with reduced accuracy or in a more accurateway that results in slower simulations. However, in suchextremes, either the simulations are carried out with accuracyhardly referring to the real hardware or such a higher accuracyis obtained at the expense of execution time. Thus, one of the

most important objectives during embedded system design is toachieve a trade-off between speed and accuracy depending onthe design stage by giving emphasis on cycle-accurate VPs,as shown in Figure 1, which are able to accurately modelthe system behaviour while retaining the simulation speed indecent levels.

Such features that a cycle-accurate simulation providesalso enable the early energy/power estimation of embeddedsystems. However, towards this aim, the correct characteriza-tion/annotation of the VP is required even when numerousarchiectural details are missing, which is the case in higherabstraction levels. This is the exact requirement that theproposed methodology aims to address by focusing on higher-abstraction SystemC/TLM-based CPU models.

III. RELATED WORK

Table I presents a qualitative comparison between theproposed methodology and the existing power annota-tion/estimation techniques. The criteria for this comparisoninvolve the speed and accuracy of power evaluation, how easilyapplicable and configurable each technique is, as well as thefinancial cost which involves all the resources and licenses re-quired for measuring the power consumption. This comparisonfocuses only on simulation-based estimation approaches: Analternative is the power annotation of the target software [5][6],however such an approach imposes high estimation error, asit ignores the effect of various CPU components, e.g. caches,pipeline, etc., while its accuracy highly depends on the soft-ware optimization which is performed by the compiler. Thus,using power-annotated instruction-set simulators for detailedsoftware simulation is considered an important prerequisite foran accurate power estimation.

A commonly-used power measurement technique is tomap the system-under-evaluation (including CPU) onto a realFPGA board [7] and measure the power consumption with theuse of an oscilloscope via pins which match to the core voltageof the FPGA. Although fast and accurate, such an evaluationtechnique only supports a limited number of soft-processorsdepending on the available resources of the FPGA. In addition,a serious disadvantage is that the RTL description of the system(in VHDL or Verilog) is needed, which is not always feasiblein early design stages, especially for CPUs. Moreover, forsome processors, using exact architectural details may requirespecial licenses which may highly increase the cost. Thus,for the scope of this paper, such a technique is used onlyfor annotating single CPUs, which can be mapped in a lessexpensive small-scale FPGAs, and the power measurementsincorporated in a CPU library can be reused for multiple andmore complex system models.

Another alternative is the gate-level power estimation byusing commercial tools like Synopsys PrimeTimePX [8]. Theinput is an RTL netlist in VHDL/Verilog and a VCD filecreated by RTL simulation tools like Modelsim [9]. However,this technique also requires RTL description of the system-under-evaluation, while additionally this description is noteasily parameterizable. Moreover, the creation of VCD file isa very slow task which does not facilitate the architecturalexploration for large design spaces.

To the other extreme, the methodology proposed in [3]provides a very fast power estimation framework for Open

TABLE I. QUALITATIVE COMPARISON OF THE PROPOSED APPROACH WITH EXISTING POWER ANNOTATION/ESTIMATION TECHNIQUES

Measurement in Gate-level Annotation SimplePower [11] / PredictionFPGA board [7] estimation [8] to OVP [3] PowerSim [10] Wattch [12] Models [13] [14] Proposed

Evaluation speed Fast Slow Fast Mediuma Medium Mediuma Mediuma

Estimation accuracy Very high Very high Low Higha High Medium Higha

CPU models support Limited Limited Any Any Limited Any Any

Applicable toN/Ab N/Ab Yes Yes No Not clearly addressed Yesmultiple platforms

Supported CPU RTL Gate-level RTL Functional Architectural/ Architectural Architectural TLM/RTLabstraction levels RTL

Configurability Medium Low High High High High High

Cost High High Medium High High Medium Medium

a Assuming cycle-accurate simulationb No VP is used; the whole system is described in VHDL/Verilog

Virtual Platform (OVP), which enables rapid system evalu-ation and coverage for large design spaces, while the onlycost involves the use of a real board, e.g. FPGA, for theCPU power annotation. However, OVP provides instruction-accurate instruction set simulators designed mainly for func-tional verification, which do not correctly model mandatoryCPU functionality like caches, pipeline, etc. This leads to alow-accuracy power estimation with very high errors. Thisissue is overriden with the use of an empirical correctionfactor which achieves decent power accuracy. However, theestimation error still remains too high for later design stagesthat require better accuracy.

A technique which supports cycle-accurate power estima-tion is provided by PowerSim [10], a library that extendsSystemC and provides mechanisms for the power annotationof each module of the VP. If PowerSim is applied to CPUs,every part of the CPU including the fetching and decodingmechanisms, the arithmetic and logic components, the pipeline,the register file, the caches, etc., should be modeled as a power-annotated SystemC module, thus allowing accurate evalua-tion with decent simulation speed. Similarly, simulators likeSimplePower [11] and Wattch [12] provides cycle-accuratepower estimation by taking as input the power consumption ofeach CPU part. However, none of these approaches achievespower annotation in higher abstraction levels, but they requirethe acquisition of power measurements for each CPU partseparately, which might be a tedious, expensive and sometimesinfeasible task, which is prefered to be avoided in earlydesign stages. Additionally, SimplePower and Wattch cannotbe easily combined with VPs featuring other components andperipherals.

A set of techniques which are very close to the proposedmethodology involve the use of prediction models for powerestimation. In particular, [13] distinguishes a number of in-struction categories, each of which matches to a measuredpower consumption. This model is utilized for predicting thepower consumption of a specific instruction, according to thecategories to which the instruction belongs. Another approach[14][15] is the use of a linear regression model, which istrained by using representative software execution scenarios.All the above approaches can be used in higher abstractionlevels, as they do not require the power annotation of all CPUparts. However, when there is the need for higher evaluationaccuracy, those techniqes hardly cover the design requirements;

in particular, power estimation should be based on detailedsimulation, rather than on predictions. This issue becomesmore savage in case of regression models, e.g. [14], as suchapproaches use as input average values for the number ofcycles per instruction, the cache miss rate, the memory accessrate, etc., which are not always representative: For example,the application has different cache misses at the beginning, inthe middle and at the end of its execution. In addition, only fewapproaches address the use of VPs for early system evaluation:From the aforementioned approaches, only [15] uses a fully-developed VP, however the CPU power estimation is based onregression rather than on cycle-accurate evaluation. Last butnot least, especially in [13], there is no information about howto accurately handle the cache misses.

The methodology proposed in this paper resolves theaforementioned issues by performing cycle-accurate powerannotation without demanding architectural details for theCPU-under-characterization. In particular, as compared withapproaches like [13], the accuracy of the proposed method-ology relies on the fine-grain power annotation of each CPUinstruction, while the power estimation is not based on pre-dictions/regression, but it is extracted after a detailed softwaresimulation on the CPU model. In addition, contrarily to staticsoftware annotations, the cycle-accurate simulation also in-cludes the effect of cache hits/misses, the pipeline, the accessesto main memory, etc. Also, the effect of each activity of theCPU is thoroughly analyzed during every distinct phase of thesoftware execution. The only requirement for this analysis isthe power budget of each instruction accompanied with a fewmore parameters like cache miss penalty and pipeline power.Thereby, power annotation can be easily applied in every CPUwith the same cost as in [3], while the resulted power librarycan be used in any SystemC/TLM-based system model.

IV. PROPOSED METHODOLOGY

This section presents in detail the proposed cycle-accurateCPU power annotation methodology. Firstly, a technique formeasuring the power consumption of every CPU instruction isdemonstrated as this is the basis for the proposed methodology.Afterwards, the overall power profiling framework which isaccompanied with a reference VP template is analyzed. Thisanalysis also includes the modeling of cache hits/misses foraccurate power estimation, as well as how each CPU and cachestate is finally translated into power consumption.

Fig. 2. Power Annotation for CPU Instructions

A. Power Annotation for CPU instructions

Figure 2 depicts the proposed power annotation approachfor each instruction of the underlying CPU model. For thepower annotation, we utilize a real hardware prototypingplatform which can be either a processor board incorporating ahard-processor or an FPGA board on which a soft-processor ismapped. To facilitate the annotation procedure, the real boardis controlled by a system debugger, typically provided by theIntegrated Development Environment (IDE) that accompaniesthe hardware board. By using the debugger, the execution ofthe software running on the board can be paused/resumed,thereby having better control on the power measurements.

In order to measure the power for each of the instructionsin the processor’s instruction set, a repeating series of multiplecopies of the instruction to be annotated is used. The existenceof these multiple copies minimizes the effect of the branchinstruction which is used for the loop mechanism. Eachinstruction copy is inserted as an inline assembly code.

The electrical current profile of the instruction during itsexecution is measured with the use of a digital oscilloscope viapins which are provided in the hardware board having a corevoltage of Vcore. The current measured by the oscilloscopemight include noise because of the variations in the load. Usinga load resistor can stabilize the output current but this leads toreduction in the input voltage. In order to solve this problem,a high-performance shunt current monitor with differentialamplifier is recommended in order to convert the load-sensitivecurrent from the board into a stabilized current to be given tothe oscilloscope. This shunt current monitor employs a resistorwith impedance R and amplifies the measured current G times.Factor G will also be called thereof as amplification gain. Inthis way, a power waveform is produced and can be exportedfrom the oscilloscope. The instant power consumption of aspecific time point t of this waveform is calculated by using theinstant current intensity imeasured(t), according to Equation 1.

pmeasured(t) =Vcore × imeasured(t)

G(1)

The final (effective) power for each specific instruction isdefined as the average of the measured power consumption(pmeasured(t)) of the waveform for a representative number Tof time points, as Equation 2 describes.

P InstructionEffective =

∑Tt=1 pmeasured(t)

T(2)

Fig. 3. Reference Virtual Platform with power profiler

The power annotation result for each instruction is a tuple< symbol, power > that matches each instruction symbol tothe measured power for the specific instruction. A file with thepower annotation of all the instructions is given as input to theVP to enable the overall power profiling.

B. Simulation and Profiling Framework

Figure 3 depicts the reference VP template to be usedfor the rest of the paper. This VP is based on SoCLib [16],a TLM library which provides abstracted but cycle-accurateCPU models, as well as memories, interconnection, debuggers,terminals, etc. Nonetheless, the proposed approach can beapplied on any SystemC/TLM-based library as well.

For the correct modeling of memory accesses and cachehits/misses during the cycle-accurate simulation, the CPUmodel employs a set of Finite-State Machines (FSM), whichtake control of the instruction and data cache. The currentstate of each FSM in conjunction with the instruction whichis currently executed by the interpreter of the CPU modelis given to the profiler, i.e. a C++ class which utilizes thisinformation to estimate the instant power consumption of eachCPU clock cycle. Hence, the profiler calculates the total powerof the executed algorithm, while it can produce a power traceincluding the executed instruction, the number of cycles taken,the status of caches and the instant power for further poweranalysis. A very important feature of this profiler is that itcan be combined with any CPU model as it requires only theappropriate interface from the CPU model in order to make itsinner status visible1. For the power calculation, the profiler is

1This requirement is satisfied by the CPU templates provided by SoCLib.

Fig. 4. Finite-State Machine for read accesses to memory and CPU cachecontrol

Fig. 5. Finite-State Machine for write accesses to memory, assuming write-through cache

configured with the instruction set and the power budget foreach instruction. Additionally, a set of power parameters whichmainly involve the correct profiling of cache misses should beincluded.

To define the minimum possible set of these additionalparameters, which is a major contribution of the proposedmethodology, a careful analysis on the CPU behaviour isrequired. Towards this need, Figures 4 and 5 present a setof FSMs which analyze the CPU and cache control in caseof memory reads and writes respectively. The actual control isdepicted as gray states, which take a single clock cycle, whilethe white states are included only for sake of readability. Thus,in case of memory reading, according to the FSM of Figure 4,if the word to be fetched is in the cache then a cache hit occurswhich lasts only for a single cycle and does not cause the CPUto freeze2. Otherwise, a number of words (depending on thecache specifications) is fetched from main memory and if the

2The term ”frozen processor” means that the program counter of theprocessor does not change because of ongoing memory activity.

TABLE II. ACTIVITY MODES FOR CPU AND CACHES

Instr. cache Data cacheCache FSM state activity activity CPU statusInstr. cache hit Read Idle Running

Instr. cache update Write Idle FrozenInstr. cache hit + Data cache hit Read Read Running

Instr. cache hit + Data cache miss Read Idle RunningInstr. cache hit + Data cache update Idle Write Frozen

Fetch wait Idle Idle FrozenWriting to main memory Idle Idle Frozen

instruction or data amount is to be cached, the cache memoryis updated and a cache hit is performed. Figure 4 also analyzesthe ”FETCH WAIT” state for the case of acquisition from mainmemory. For the memory write accesses, as shown in Figure5, the focus is made on write-through caches, however thebehaviour of write-back caches can be modeled as well. Inwrite-through cache, if the memory word to be updated doesnot exist in the cache (i.e. a cache miss happens), then thewrite access is performed only to the main memory and thecache is left intact. On the contrary, upon cache hit, not onlythe main memory but also the cache is updated.

By combining the above FSMs for both instruction anddata cache memory, we can conclude a set of activity modesfor the CPU and the caches as presented in Table II, whichcan be grouped into the following cases:

1) In case of an instruction cache hit, the power estima-tion should primarily include the reading access fromthe instruction cache. Additionally, if the instructionreads from data memory and a data cache hit isperformed, the power consumption of the data cacheread access should be included as well. In all thecases the CPU is in the running state, thus the powerconsumption of other parts like pipeline should alsobe taken into consideration.

2) In case of data cache miss, although the data is notavailable, the processor still remains in the runningstate in order to initiate the fetching from the mainmemory, while the data cache is idle.

3) After an instruction or data cache miss, the respec-tive cache is updated, thus the power consumptionincludes the power for writing onto the cache. In sucha case, the other cache is idle and the CPU is frozen.

4) In both fetching and writing from/to the main mem-ory, CPU and caches are idle, so a constant for thestatic power is taken into account as we focus onlyon the power consumption of the CPU core.

Below, an analysis is presented about how the CPU activityis addressed by the instruction annotations and the additionalpower parameters, which are given as input to the VP profiler.

Instruction execution and cache hits: When power-annotating an instruction, it is observed that the annotatedpower includes not only the power consumption of the in-struction itself, but also the power consumption of other CPUactivity, e.g. cache hits. In particular, during the first executionof a long series of an instruction INS, numerous (compulsory)cache misses occur. However, when repeating this series andwith the condition that the series fits into the instruction cachememory, it can be noticed that an instruction cache hit isalways performed. Thus, according to Table II, the instructionannotation P INS

annotated includes the static power (Pstatic), the

power for cache hit (Phit), as well as the instruction decodingand execution power (PINS), as Equation 3 describes.

P INSannotated = Phit + PINS + Pstatic (3)

In the methodology proposed in this paper, only P INSannotated is

required, without the need for analytically defining Pstatic,Phit and PINS . Also, in case of data load instruction,P INSannotated additionally includes the data cache hit (also equal

to Phit). We assume that this annotated power also appliesfor data cache misses, although the data cache is idle, withoutsignificant loss of accuracy. Finally, for data store instruction,no additional power consumption is taken into considerationbecause no cache update is performed during the annotationof the instruction as the stored data is not included in the datacache memory.

Hence, a significant portion of the CPU’s activity is ap-pended to the power annotation of each instruction withoutthe need for detailed modeling of caches and other CPU parts.This results to a minimum required set of additional powerparameters, which are summarized as follows:

• Idle power: This constant is valid for the case whenthe CPU is frozen, i.e. during fetching/writing from/tothe main memory.

• Miss penalty: This parameter matches to the caseof instruction/data cache update and is taken intoconsideration when the CPU informs the profiler thatone of the caches is written.

• Pipeline activity: This value represents an averageactivity of the CPU pipeline. When a single instructionis executed in multiple copies, all the stages of thepipeline have the same content, thus there is nosignificant switching activity. However, as a typicalalgorithm comprises a series of different instructions,a significant dynamic power is consumed. This portionof power consumption is considered when the CPU isnot frozen.

Idle power and miss penalty: For the calculation of idlepower and cache miss penalty the following assumptions aremade without significantly degrading the power estimationaccuracy:

• The idle power is equal to the static power Pstatic

of the CPU. In this paper, the static power canbe obtained with a modest accuracy by pausing theexecution via the Software Development Kit (SDK)debugger which takes control of the real board that isused for the power annotation.

• The power consumption of writing accesses to thecache is similar to the consumption of reading ac-cesses. Thus, we assume that the cache hit and thecache update consume the same power, which meansthat the cache miss penalty (Pmiss) is equal to Phit.

Therefore, for analyzing the effect of cache misses, weconsider a series SINS of multiple copies of an instructionINS that takes a single CPU clock cycle. Before the executionof this series, the instruction cache is disabled. In that case,the measured power consumption is reduced as the instructioncache is not used. This case is explained in Equation 4. In

this equation, C is the number of cycles that are spent in thepipeline flushing, right from the instruction fetching to the finalexecution, and can be extracted from the CPU specifications.

P INSnocache =

PINS + C × Pstatic

C=>

C × P INSnocache = PINS + C × Pstatic (4)

Equation 5 gives the difference DINS between Equations 3(with caches) and 4 (without caches).

DINS = P INSannotated−C×P INS

nocache = Phit+(1−C)×Pstatic

(5)Taking into consideration that Pmiss = Phit, Equation 5 isre-written as follows:

DINS = Pmiss + (1− C)× Pstatic (6)

We noticed that, when applying Equation 6 to multiple seriesSINS1

, SINS2, SINS3

, ..., SINSnof single-cycle instructions,

the differences are similar, i.e. DINS1 ≈ DINS2 ≈ DINS3 ≈... ≈ DINSn

. By acquiring an average Daverage of thesedifferences, Equation 6 is rewritten as follows:

Daverage = Phit + (1− C)× Pstatic =>

Pmiss = Daverage − (1− C)× Pstatic (7)

Thus, by using Equation 7 and with Pstatic which is alreadyknown, Pmiss is extracted in order to be utilized as parameterof the VP.

Pipeline activity: Let us consider a series[INS1, INS2, INS3, ..., INSn] of n different instructionseach of which occupies a single clock cycle. Assuming thatthe series fits into the instruction cache, the ideal powerconsumption (Pideal) of this series would be the averageof the instant power consumption of each instruction, asEquation 8 describes:

Pideal =P INS1

annotated + P INS2

annotated + ...+ P INSn

annotated

n(8)

However, the measured power Pmeasured of this series ishigher than Pideal, as Equation 9 explains:

Pmeasured = Pideal +A, A > 0 (9)

Actually, A corresponds to an additional CPU activity thatmatches to the instruction pipeline. The reason of this differ-ence between Pmeasured and Pideal relies on the fact that thesequence of different instructions induces significant switchingactivity in the instruction pipeline. Also, for different lengthsof this series or different included instructions, parameterA takes a similar value. Thus, by using multiple series ofrandom instructions and using the power annotation of eachutilized instruction, we can achieve a power parameter whichrepresents the average pipeline activity.

V. EXPERIMENTAL RESULTS

In this section, the efficiency of the proposed methodologywill be explained. Firstly, a description about the processor,the deployed equipment and the utilized benchmarks will bediscussed. Afterwards, the results of the power estimation tech-nique will be compared with the power measurements from thereal board. This section ends up with a discussion about theimprovement in the estimation accuracy when compared to thepower estimation in OVP [3].

TABLE III. PARAMETERS FOR REAL HARDWARE MEASUREMENT

Core voltage (Vcore) 1.2 VResistance (R) 0.5 ohm

Amplification Gain (G) 50

TABLE IV. POWER PARAMETERS FOR VP SIMULATION

Idle power 0.104849 WMiss penalty 0.02214228 W

Pipeline activity 0.0146 W

A. Experimental Setup

To study the efficiency of the proposed methodology, wepower-annotated a Xilinx Microblaze soft-processor, whichhas been mapped onto a Xilinx Spartan-3AN Starter Kit.The Microblaze soft-processor has been configured with 4-KB instruction cache and 2-KB data cache with a line sizeof 4 words (i.e. 16 bytes) in both caches. According to theMicroblaze Reference Guide [17], the caches are not set-associative and a write-through data cache is deployed. Thisconfiguration is applied in both the FPGA-mapped processorand the respective CPU model which is utilized in the VP.

Despite the use of a simple CPU model, the effectivenessof the proposed methodology can be demonstrated by utilizingmodels of other embedded CPUs as well. The reason is thatthe architectural modules of Microblaze are also mandatorymodules of most embedded CPUs. Thus, an analysis onMicroblaze is a representative case study for this section.

To deploy the FPGA board for power measurements, wefirstly use Xilinx ISE to create an embedded processor projectwith the netlist, top level HDL source and the bitstream forthe MicroBlaze processor. The generated bitstream is exportedto Xilinx SDK which allows the development of applicationprograms for the hardware platform. From SDK, the bitstreamis then used to configure the FPGA. After the bitstream isflashed successfully into the hardware platform, the applicationprogram which can be either multiple copies of the instruction-under-annotation or a full benchmark/application is simulatedand debugged in the SDK.

For the current measurement, we used a HAMEGHMO2525 digital oscilloscope, which is connected to theboard via the differential shunt current monitor. Table III showsthe parameters for the power measurement in real hardwareincluding the core voltage Vcore of the FPGA fabric, as well asthe resistance and amplification gain of the shunt current mon-itor. The acquired current profile which is exported from theoscilloscope in the form of Comma Separated Values (CSV)file is then given as input to a MATLAB script which calculatesthe power consumption of the instruction (or application) usingthe Equations 1 and 2.

The benchmarks utilized in this paper are Bubble-sort,Heap-sort, Merge-sort, RSA, Dhrystone, Fibonacci and Peak-speed. Firstly, all the benchmarks are made to run on theFPGA-mapped Microblaze soft-processor to acquire the mea-sured power consumption of the real hardware. Afterwards,each benchmark is executed on the reference VP with theadditional power parameters as stated in Table IV configuringthe VP profiler.

(a)

(b)

Fig. 6. (a) Measured and simulated power consumption; (b) Differencebetween the two quantities

TABLE V. ABSOLUTE ESTIMATION ERROR STATISTICS

Minimum Maximum Average Median0.47 % 6.11 % 2.01 % 1.40 %

B. Power Estimation Results

Figure 6a depicts the power consumption of the utilizedbenchmarks in Microblaze including the measurement in realhardware and the estimated power with the use of the referenceVP. The difference between the measured and the estimatedpower is illustrated in Figure 6b. It shows that the proposedpower estimation technique achieves a very good accuracy formost applications, despite the abstraction of numerous archi-tectural details. This also explains the importance of cycle-accurate simulation even in higher abstraction levels. In par-ticular, for most applications, the difference (as absolute value)might reach up to 3.7 mW. An exception is noticed in case ofDhrystone which gives a high deviation of approximately 8.1mW. This is due to the fact that many branch mispredictionsare not accurately simulated in the reference VP where loopswith changing number of iterations are used. However, theaverage difference remains as low as -0.001691 Watts (lessthan 1.7 mW as absolute value), thus corresponding only to avery small portion of the estimated power consumption, whichconfirms the accuracy of the proposed annotation technique.

It can be observed that the power estimation values seemto have a quite limited variability (about 11%), ranging from0.128 up to 0.144 W. However, the idle power of Table IV isa significant offset for all the power estimations, which affectsvariability. In particular, excluding the idle power, the powerestimations range from 0.023 W to 0.039 W, thus having avariability reaching 41%. This shows the importance of cycle-accurate CPU simulation in power estimation.

Figure 7 shows in detail the percentile absolute estimationerror and the average value of these error values. Also Table Vsketches some statistics on the estimation errors. The averageerror is as low as 2% and ranges from 0.47% up to 6.11%,while the median of the absolute error values is 1.40%.

Fig. 7. Percentile absolute estimation error and its average

(a)

(b)

Fig. 8. Comparison of the power estimation error between the proposedapproach and the use of power-annotated OVP models [3]: (a) Absolute powerestimation error %; (b) Estimation improvement with the proposed technique

Finally, for comparison reasons and to provide a clear viewabout the trade-off between accuracy and simulation speedthat is provided by the reference VP, the estimation errorsare compared with those of an OVP-based power-annotatedplatform incorporating a Microblaze processor as proposedin [3]3. This comparison is presented in Figure 8, whichdemonstrates the multiple-time accuracy enhancement of theproposed power estimation technique ranging from 1.4× up to32× for most of the applications.

VI. CONCLUSIONS AND FUTURE WORK

This paper presents a power annotation methodology forcycle-accurate SystemC/TLM-based CPU models, targeting toaccurate the power estimation in higher abstraction levels,where architectural details are not available. The proposedmethodology achieves this target by power annotating eachinstruction of the CPU model with the use of real hardware.As this annotation also includes CPU parts like caches, registerfile and pipeline, a large portion of the CPU is being annotatedwithout requiring the architectural details. Thus, there is needfor only a small number of additional power parameters, whichcan be easily estimated with a decent accuracy. The presentedannotation technique has been applied on a Microblaze CPUmodel, the estimation accuracy of which was compared with

3RSA was excluded because [3] did not include such measurement.

a real FPGA board with the use of a number of benchmarks.The experimental results show that the percentile absolute errorranges from 0.47% up to 6.11%, with an average of 2%,thereby proving the efficiency of the proposed methodology.

As a future work, this annotation technique can be extendedand quantitatively analyzed for more complex CPU models,as well as for including the case of CPU interrupts andexceptions, with the use of real-life applications. Also, theaccuracy of the proposed methodology can be further analyzedby examining how many architectural details can be abstractedout, while maintaining the estimation error in low margins.Furthermore, a very important extension is the annotation ofdata transfers to/from the main memory, which includes thepower budget for both the communication and the memoryaccess. Lastly, the power trace of the profiler can be used forearly estimation of thermal dissipation of the processor, whichis important for avoiding hot spots on the final chip.

REFERENCES

[1] Madariaga, A.; Jimenez, J.; Martın, J.L.; Bidarte, U.; Zuloaga, A.,“Review of electronic design automation tools for high-level synthesis,”International Conference on Applied Electronics (AE), 2010, pp.1–6, 8-9Sept. 2010

[2] SystemC download link (from Accelera Systems Initiative):http://www.accellera.org/downloads/standards/systemc

[3] G. Shalina, T. Bruckschloegl, P. Figuli, C. Tradowsky, G. Almeida,J. Becker, “Bringing Accuracy to Open Virtual Platforms (OVP): ASafari from High-Level Tools to Low-Level Microarchitectures”, InInternational Conference on Intelligent Instrumentation, Optimizationand Signal Processing, 2013

[4] International Technology Roadmap for Semiconductors 2013:http://www.itrs.net/LINKS/2013ITRS/Home2013.htm

[5] Bammi, J.R.; Harcourt, E.; Kruitzer, W.; Lavagno, L.; Lazarescu, M.T.,“Software performance estimation strategies in a system-level designtool,” CODES 2000, pp.82–86, 5 May 2000.

[6] Brandolese, C.; Corbetta, S.; Fornaciari, W., “Software energy estimationbased on statistical characterization of intermediate compilation code,”ISLPED 2011, pp.333–338, 1-3 Aug. 2011

[7] Becker, J.; Huebner, M.; Ullmann, M., “Power estimation and powermeasurement of Xilinx Virtex FPGAs: trade-offs and limitations,” SBCCI2003, pp.283–288, 8-11 Sept. 2003

[8] Synopsys PrimeTime timing and power estimation tool for RTL:http://www.synopsys.com/Tools/Implementation/SignOff/Pages/PrimeTime.aspx

[9] Mentor Graphics ModelSim: http://www.mentor.com/products/fv/modelsim[10] Giammarini, M.; Conti, M.; Orcioni, S., “System-level energy estima-

tion with Powersim,” ICECS 2011, pp.723–726, 11-14 Dec. 2011[11] Ye, W.; Vijaykrishnan, N.; Kandemir, M.; Irwin, M.J., “The design and

use of SimplePower: a cycle-accurate energy estimation tool,” DAC 2000,pp.340–345, 2000

[12] Brooks, D.; Tiwari, V.; Martonosi, M., “Wattch: a framework forarchitectural-level power analysis and optimizations,” 27th InternationalSymposium on Computer Architecture, 2000, pp.83–94, 14 June 2000

[13] Brandolese, C; Salice, F.; Fornaciari, W.; Sciuto, D., “Static powermodeling of 32-bit microprocessors”, IEEE Trans. on CAD of IntegratedCircuits and Systems 21(11), pp. 1306–1316, 2002.

[14] Kumar Rethinagiri, S.; Palomar, O.; Ben Atitallah, R.; Niar, S.; Unsal,O;, Kestelman, A.C., “System-level power estimation tool for embeddedprocessor based platforms”, RAPIDO 2014, pp. 1–8, Vienna, Austria,2014

[15] Kumar Rethinagiri, S.; Palomar, O.; Arias Moreno, J.; Unsal, O.; Cristal,A., “VPPET: Virtual platform power and energy estimation tool forheterogeneous MPSoC based FPGA platforms,” PATMOS 2014, pp.1–8,Sept. 29 2014-Oct. 1 2014

[16] SoCLib TLM Library website: www.soclib.fr[17] Xilinx Microblaze Reference Guide:

http://www.xilinx.com/support/documentation/sw manuals/mb ref guide.pdf

a power estimation technique for cycle-accurate …users.auth.gr/ksiop/pdf/samos2015_power.pdf · a...

Documents