[ieee 2012 41st international conference on parallel processing workshops (icppw) - pittsburgh, pa,...

8
Evaluation of Core Performance when the Node is Power Capped using Intel R Data Center Manager Joshua McCartney, Patricia J. Teller, and Sarala Arunagiri Department of Computer Science The University of Texas at El Paso El Paso, TX, USA Email: [email protected], [email protected], [email protected] Abstract—Power consumption is a growing concern in the design of computing platforms, particularly large-scale, HPC systems or computing platforms that assist battlefield operations. Accordingly, Intel R recently introduced a new development platform, Software Development Platform S2R2 Family with Intel R Node Manager technology, that is capable of real-time dynamic power monitoring and capping. Although targeted at data centers, such a tool can be applied to one compute node. This development opens up new possibilities for managing payloads in fielded computing platforms, which typically have limited power budgets. Towards this end, this paper presents a preliminary study of the effect of node power capping on the execution time of two applications of interest to the U.S. Army. We executed the applications under a range of power caps and employed performance counters and a program that strides through memory invoking different levels of the hierarchy to capture execution time as well as other performance metrics that help explain the increase in execution times with heightened restrictions on power consumption. Confirming results of earlier work, we show that, in general, time-to-solution and energy consumption increase as the power cap decreases. In addition, our data indicates that: (1) for fielded systems there is a range of power caps that may result in acceptable increases in execution time and (2) although power capping is achieved mainly by dynamic voltage and frequency scaling (DVFS), when executing at lower power caps other techniques are being employed to reduce power consumption. Keywords-Power capping; power consumption; energy con- sumption; memory performance I. I NTRODUCTION For over a decade, power consumption has been an increasingly prominent concern in the design of computing platforms. This is particularly true for fielded and large- scale high-performance computing (HPC) platforms [1]. In the case of fielded platforms, power consumption is an issue when a generator is providing the power, e.g., for the operation of multiple devices, including computing devices. Accordingly, each device is given a power budget, i.e., power allocation. In particular, a computing device is given a power budget associated with its payload (data) processing. Ex- amples of such environments are unmanned aerial vehicles (UAVs), Humvees R , and manned aircraft, where power is produced from a heavy fuel generator, or a ground station, where power is produced intermittently by a generator. For these types of systems, power management techniques can be very useful. In contrast, large-scale HPC systems have huge power draws. In this case, non-traditional computer cooling technologies and power management strategies are required to prevent failures and operate efficiently. Additionally, future predictions regarding silicon chip trends indicate that maintaining chip throughput growth will be a challenge, and relaxing the Thermal Design Power (TDP) constraint will be necessary to achieve the ideal 2X throughput growth per chip generation. This translates to in- creased power consumption, which, after three generations, will require more sophisticated cooling solutions and will complicate system level design as processor power tends to consume almost all the power budget at the system level [1]. Alleviating this problem requires solutions along several di- mensions, e.g., power-efficient chip and system architectural innovations, non-traditional cooling technologies, and the use of aggressive power management strategies. A. Motivation In response to this need for power management in com- puting systems, Intel Corporation introduced a new de- velopment platform, Software Development Platform S2R2 Family with Intel R Node Manager technology, that is capable of real-time dynamic power monitoring and capping. This ability to set and manage a power cap opens up new possibilities for managing payloads in fielded computing platforms that have power budgets. Although it has been shown that on modern CPUs power capping increases energy consumption due to leakage and increased execution time (since energy = power × execution time) [2], in certain situations some increase in execution time and, thus, energy consumption may be acceptable. For example, in battlefield situations where there are soft real-time deadlines for data processing that influence decision making, a specific range of delay in time-to-solution and, thus, energy consumption are tolerable. However, for each target application it is necessary to understand the impact of the various power caps in order to determine the feasibility, in terms of time- to-solution, of enforcing such limits. Accordingly, we used the Intel R Data Center Manager 2012 41st International Conference on Parallel Processing Workshops 1530-2016/12 $26.00 © 2012 IEEE DOI 10.1109/ICPPW.2012.35 246

Upload: sarala

Post on 07-Apr-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

Evaluation of Core Performance when the Node is Power Cappedusing Intel R© Data Center Manager

Joshua McCartney, Patricia J. Teller, and Sarala ArunagiriDepartment of Computer Science

The University of Texas at El PasoEl Paso, TX, USA

Email: [email protected], [email protected], [email protected]

Abstract—Power consumption is a growing concern inthe design of computing platforms, particularly large-scale,HPC systems or computing platforms that assist battlefieldoperations. Accordingly, Intel R© recently introduced a newdevelopment platform, Software Development Platform S2R2Family with Intel R© Node Manager technology, that is capableof real-time dynamic power monitoring and capping. Althoughtargeted at data centers, such a tool can be applied to onecompute node. This development opens up new possibilitiesfor managing payloads in fielded computing platforms, whichtypically have limited power budgets. Towards this end, thispaper presents a preliminary study of the effect of node powercapping on the execution time of two applications of interestto the U.S. Army. We executed the applications under a rangeof power caps and employed performance counters and aprogram that strides through memory invoking different levelsof the hierarchy to capture execution time as well as otherperformance metrics that help explain the increase in executiontimes with heightened restrictions on power consumption.Confirming results of earlier work, we show that, in general,time-to-solution and energy consumption increase as the powercap decreases. In addition, our data indicates that: (1) forfielded systems there is a range of power caps that may result inacceptable increases in execution time and (2) although powercapping is achieved mainly by dynamic voltage and frequencyscaling (DVFS), when executing at lower power caps othertechniques are being employed to reduce power consumption.

Keywords-Power capping; power consumption; energy con-sumption; memory performance

I. INTRODUCTION

For over a decade, power consumption has been anincreasingly prominent concern in the design of computingplatforms. This is particularly true for fielded and large-scale high-performance computing (HPC) platforms [1]. Inthe case of fielded platforms, power consumption is anissue when a generator is providing the power, e.g., for theoperation of multiple devices, including computing devices.Accordingly, each device is given a power budget, i.e., powerallocation. In particular, a computing device is given a powerbudget associated with its payload (data) processing. Ex-amples of such environments are unmanned aerial vehicles(UAVs), Humvees R©, and manned aircraft, where power isproduced from a heavy fuel generator, or a ground station,where power is produced intermittently by a generator. For

these types of systems, power management techniques canbe very useful. In contrast, large-scale HPC systems havehuge power draws. In this case, non-traditional computercooling technologies and power management strategies arerequired to prevent failures and operate efficiently.

Additionally, future predictions regarding silicon chiptrends indicate that maintaining chip throughput growth willbe a challenge, and relaxing the Thermal Design Power(TDP) constraint will be necessary to achieve the ideal 2Xthroughput growth per chip generation. This translates to in-creased power consumption, which, after three generations,will require more sophisticated cooling solutions and willcomplicate system level design as processor power tends toconsume almost all the power budget at the system level [1].Alleviating this problem requires solutions along several di-mensions, e.g., power-efficient chip and system architecturalinnovations, non-traditional cooling technologies, and theuse of aggressive power management strategies.

A. Motivation

In response to this need for power management in com-puting systems, Intel Corporation introduced a new de-velopment platform, Software Development Platform S2R2Family with Intel R© Node Manager technology, that iscapable of real-time dynamic power monitoring and capping.This ability to set and manage a power cap opens up newpossibilities for managing payloads in fielded computingplatforms that have power budgets. Although it has beenshown that on modern CPUs power capping increases energyconsumption due to leakage and increased execution time(since energy = power × execution time) [2], in certainsituations some increase in execution time and, thus, energyconsumption may be acceptable. For example, in battlefieldsituations where there are soft real-time deadlines for dataprocessing that influence decision making, a specific rangeof delay in time-to-solution and, thus, energy consumptionare tolerable. However, for each target application it isnecessary to understand the impact of the various powercaps in order to determine the feasibility, in terms of time-to-solution, of enforcing such limits.

Accordingly, we used the Intel R© Data Center Manager

2012 41st International Conference on Parallel Processing Workshops

1530-2016/12 $26.00 © 2012 IEEE

DOI 10.1109/ICPPW.2012.35

246

Page 2: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

software suite with Intel R© Node Manager [3] technologyto investigate the impact of power capping on a singlenode executing an application on one core. We used twoapplications of interest to the U.S. Army for this purpose:Synthetic Aperture Radar (SAR) image formation [4] andcomputer stereo matching [5], which are executed on fieldedcomputing platforms. Our goal was to understand if powercapping can be used effectively to facilitate battlefield oper-ations, i.e., if certain power caps lead to tolerable delays inapplication time-to-solution. However, it is important to notethat, in contrast, to realize economy of scale, Intel R© DataCenter Manager with Intel R© Node Manager is meant to beused to manage a system comprised of a large number ofservers with varying workloads. The Return on Investment iscost avoidance in the form of down time and data corruptionresulting from power outages.

In addition, in an attempt to quantify the performanceimpact of power capping and understand the techniques usedto reduce node power consumption, we employed perfor-mance event counters and a program that strides throughmemory invoking different levels of the hierarchy [6] tocapture different performance metrics for each run. We usethis information to try to explain the resultant applicationexecution time behavior.

The remainder of the paper is organized as follows.Section II briefly explains Intel R© Data Center Manager andprovides an overview of techniques that are used to reducepower consumption. Section III describes our experimen-tal methodology. Next, the results of our experiments arepresented and discussed in Section IV. Finally, Section Vpresents our conclusions and plans for future work.

II. REDUCING POWER CONSUMPTION: INTEL R© DCM

Dynamic power reduction at the CPU level is facilitatedby CPUs of recent years having different processor per-formance states, called P-states, and CPU operating states,called C-states. The Advanced Configuration and PowerInterface (ACPI) [7] specification defines these power man-agement states. P-states (the number being dependent onthe processor) translate to a range of different frequenciesand voltages that consume different amounts of power,with higher P-state numbers representing slower processorspeeds and, thus, lower power consumption and throughput.When the P-state is configured properly according to systemworkload, power savings can be realized but, as a result,throughput is diminished. C-states allow an idle processor(in any other C-state besides C0) to turn off unused com-ponents to save power. Higher C-state numbers representdeeper CPU sleep states (with slower wake-up times) andindicate that components may be shut down to save power.

A. Intel R© Data Center Manager

Exploiting this technology, Intel R© Data Center Manager(DCM), which runs on a management server, manages

the power consumption of the nodes of a data center.DCM power capping services focus on controlling resourceusage to safeguard against over utilization of constrainedcapacity. Architecturally, the Platform Controller Hub (PCH)has management engine firmware that, using the industrystandard Intelligent Platform Management Interface (IPMI),controls the platform’s power and thermal capabilities viathe DCM. In turn, the DCM connects to the platform’sBaseboard Management Controllers (BMC), each of whichis capable of monitoring and dynamically regulating thepower consumption of its node. Because a BMC is con-nected to its own Network Interface Controller (NIC), thisis accomplished out-of-band, i.e., without going throughthe operating system. If a power cap is currently beingenforced on the platform, a BMC monitors its node’s powerconsumption. When it reaches a point above the level ofthe power cap, then the BMC attempts to reduce powerconsumption by changing the P-state of each of its CPUs.Since a particular CPU has only a fixed number of P-states, if the power cap falls between the power consumptionassociated with two P-states, the BMC switches between thetwo states in an attempt to honor the power cap.

B. Reducing Power Consumption

Dynamic frequency and voltage scaling (DVFS) is the pri-mary method used by Intel R© Power Management technolo-gies to realize a specified power cap. Dynamic frequencyscaling (also known as CPU throttling) automatically adjuststhe frequency of a microprocessor either to conserve poweror reduce the amount of heat generated by the chip. It isused (1) to prolong the operation of battery-powered mobiledevices, where the energy source is limited; (2) to decreaseenergy and cooling costs for lightly loaded computers –decreasing heat output allows system cooling fans to bethrottled down or turned off, reducing noise levels andfurther decreasing power consumption; and (3) to reduceheat in insufficiently cooled systems, e.g., in over-clockedsystems, when the temperature reaches a certain threshold.

As described in [8], the dynamic (switching) power dis-sipated by a CMOS chip is C × f × V 2, where C is thecapacitance being switched per clock cycle, V is the voltage,and f is the switching frequency (a unitless quantity). Basedon this equation, assuming voltage and capacitance remainconstant, dynamic power dissipation varies linearly withfrequency. But, dynamic power does not account for thetotal power of the chip; there also is static power, whichis primarily due to various leakage currents. The amount ofstatic power is related to, among other things, the heat ofthe processor and, thus, is indirectly affected by frequencyscaling.

Dynamic voltage scaling is another power conservationtechnique that is often used in conjunction with frequencyscaling. Because of the V 2 component of the equation forthe dynamic power dissipated by a CMOS chip, saving the

247

Page 3: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

most power can be better achieved by a combination ofdynamic frequency scaling and voltage scaling. In manyconstant-voltage cases it is more efficient to run briefly atpeak speed and stay in a deep idle state for a longer time(called race to idle), than it is to run at a reduced clockrate for a long time and only stay briefly in a light idlestate. However, reducing voltage along with clock rate canchange those tradeoffs and, as a result, DVFS-driven race-to-idle may not always produce the best energy efficiency. Inaddition, since dynamic frequency scaling affects the numberof instructions a processor can issue in a given amount oftime, it can greatly affect the performance of CPU-boundworkloads. Thus, the energy efficiency of DVFS also isrelated to workload resource usage characteristics.

Dynamic cache reconfiguration (DCR) is another tech-nique used to reduce CPU power consumption. DCR, whichshuts off parts of the cache or changes the cache associativ-ity, has been shown to effectively reduce power consumptionin embedded as well as real-time systems [9]. Consequently,there are various algorithms that make effective use ofDCR [10] [11] [12] [13] [14]. A few researchers also havesuggested processor architectures that support DCR [15][16] [17], and often frequency scaling is combined withDCR [18] [19], resulting in a higher reduction in powerconsumption.

In the case of very low power caps that are close toa system’s idle power consumption, pure DVFS may notbe sufficient to reduce power consumption to the desiredlevel. In this case, DCR and other techniques that shut offspecific architectural components might be adopted in orderto further reduce consumption to the desired level. Our studyattempts to identify if and when such techniques are adoptedby Intel R© Power Management technologies.

III. METHODOLOGY

Our study was conducted on an experimental platform thatconsists of two Intel R© 2.7GHz eight-core (130W) SandyBridge Romley E5-2680 processors that are capable ofDVFS, which allows the system to run at different levelsof power consumption. Each core has:

• 16 P-states,• 32KB L1 data cache,• 32KB L1 instruction cache,• 256KB unified L2 cache,• 20MB shared unified L3 cache, and• 64GB RAM.This platform has a BMC chip set (see Section II)

mounted on the motherboard. The BMC has its own dedi-cated Ethernet controller to allow the server running Intel R©

DCM to communicate directly with the BMC to gathersystem diagnostics information and to set policies regardingpower capping, which the BMC then enforces on its CPUs.We captured the average power consumption of the platformusing a Watts Up! meter.

Table IBASELINE POWER CONSUMPTION AND EXECUTION TIME FOR

SIRE/RSM AND STEREO MATCHING

Code Input Average NodePowerConsumption(Watts)

ExecutionTime(seconds)

SIRE/RSM Lam Dataset(large image)

157 6m 17s

StereoMatching w/simulatedannealing

Three-layerwedding cake

153 1m 31s

Using PAPI [20] and the Romley’s performance counters,we measured the effect of power capping on applicationexecution time (cycle count × clock speed) and collecteddifferent performance data, i.e., the number of L1, L2, andL3 cache misses as well as the number of instruction anddata TLB misses. This data was used to understand why insome instances power capping significantly increases execu-tion time and to determine if techniques other than DVFSare used to reduce power consumption. For each application,it is, of course, expected that the number of instructionscommitted remains constant under different power caps. Incontrast, due to speculative execution, different power capsmay change the number of instructions, including loads andstores, executed as well as the number of cache and TLBmisses. And changes in the number of cache and/or TLBmisses would suggest that cache and/or TLB configurationsmay have been modified.

Intel R© Data Center Manager software (described in Sec-tion II) was used to implement power capping policies on thesoftware development platform. The two applications usedin the experiments are: (1) Synthetic Aperture Radar (SAR)image formation (SIRE/RSM) [4] and (2) stereo matchingusing the simulated annealing algorithm (Stereo Matching)[5]. These two applications are executed on field deployablecomputer systems.

Table I shows the baseline behavior (without powercapping) of the two applications in terms of node powerconsumption and execution time. As shown, the executiontime of Stereo Matching and SIRE/RSM is 6m 17s and1m 31s, respectively, and their power consumption is in therange of 153-157 Watts. Thus, we studied their performanceat nine different power caps: 160 (greater than the averagebaseline node power consumption of both applications),155, 150, 145, 140, 135, 130, 125, and 120 Watts. Eachapplication, given the same input, was executed five timesunder each power cap and the results, i.e., the data associatedwith each performance metric, were averaged. The datapresented in Table II and Figures 1 and 2 are these averages.Note that the idle power was between 100 and 103 Watts.

Since the experimental data presented in Section IVsuggests that cache configurations are being modified to

248

Page 4: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

manage power consumption, we attempted to confirm thisby using the program in [6], which strides through memoryto invoke different levels of the memory hierarchy. The codeincludes a nested loop that reads and writes memory atdifferent strides and cache sizes. The results of running thecode can be used to identify the configuration of the memoryhierarchy of the processor on which it is executed as wellas the access times of the various levels of the hierarchy.

IV. RESULTS

The results of our experiments are shown in Table II andFigures 1 through 4. As expected, for each application thenumber of instructions committed is identical. In contrast,due to speculative execution, the number of instructions, inparticular the number of loads and stores, executed differ.However, these differences across experiments with variouspower caps are small, i.e., at most 0.36%, and, thus, therelated data is not provided.

A. Execution Time and Power/Energy Consumption

As shown in Table II, in general, the average node powerconsumption is under the power cap; this is not the caseat 120 (vs. 124.9) Watts for Stereo Matching and 155(vs. 155.7), 125 (vs. 125.7), and 120 (vs. 124) Watts forSIRE/RSM. From Table II and Figures 1 and 2 we observethe following;

• As the power cap is lowered, in general, the executiontime of both applications increases as does total energyconsumption. As expected and consistent with the find-ings of Sueur and Heiser [2], total energy consumptionis lowest at power caps of 155 and 160 Watts, whichare closest to the corresponding average baseline nodepower consumption. From 160 to 140 Watts this growthis relatively small, i.e., less than or equal to 40%, whileat 135 Watts execution time and energy consumptionbegin to grow much more rapidly, the increase in energyconsumption always tracking the increase in executiontime. The execution time peaks at 1,104% and 3,467%higher than the baseline for Stereo Matching and 193%and 2,583% higher than the baseline for SIRE/RSM at125 and 120 Watts, respectively.

• When the system is capped at 160 Watts there aresmall differences with regard to the execution timeas well as the metrics related to power and energy.This is likely due to the fact that the average baselinenode power consumption of both applications is below160 Watts. Surprisingly, the differences w.r.t. memoryrelated metrics are much larger. This could be overheadassociated with power capping.

• At the lower power caps of 130, 125, and 120 Wattsthe frequency remains constant, which implies that atthese power caps DVFS is not being employed.

The increase in execution time for SIRE/RSM is boundedby 25% all the way down to a power cap of 140 Watts

(140W). Then the increase becomes more substantial: 58%(135W), 93% (130W), 193% (125W), and 2,583% (120W).In contrast, the increase for Stereo Matching is boundedby 25% down to a power cap of 145 Watts. Then theincrease becomes even more substantial than the increasefor SIRE/RSM: 40% (140W), 107% (135W), 444% (130W),1,104% (125W), and 3,467% (120W). Thus, the data demon-strates that SIRE/RSM is more amenable to power cappingthan is Stereo Matching.

B. Cache and TLB Behavior

For Stereo Matching, as the power cap is lowered, thenumber of L2 cache, L3 cache, and instruction TLB missesgenerated increase dramatically and likely contribute to theincrease in execution time and energy consumption. Theexecution time of SIRE/RSM is affected in much the sameway by instruction TLB misses. However, in general, there isa decrease in the number of L1 cache misses and essentiallyno increase in the number of L2 or L3 cache misses.

Considering the execution time behavior of both applica-tions when power is capped at or below 135 Watts, it appearsthat more than DVFS is being employed in order to managepower consumption. The collected performance event countssuggest that, at the least, the configuration of the memoryhierarchy changes at the lower power caps. For example,for Stereo Matching executed with power caps of 125 and120 Watts, the number of L2 and L3 cache misses increasesignificantly, up to 244% and 371% above the baseline,respectively. This also is true for the number of instructionTLB misses, which increase up to 6,395%, while the numberof data TLB misses remain fairly constant (bounded byan increase of 6.85%). For SIRE/RSM the number of L1,L2, and L3 cache misses are essentially unchanged and thenumber of data TLB misses never increases more than 1%over the baseline except when the application is executedwith power caps of 125 and 120 Watts, which results in thenumber being 2.25% and 14.67% higher than the baseline,respectively. This more favorable behavior is likely due tothe fact that SIRE/RSM processes, in a stream-like fashion,data stored in an array that is too large to fit in any one ofthe caches. It iteratively loops through the array elements toremove noise, generating a sequence of compulsory misses,followed by sequences of conflict misses. This accountsfor the number of L2 and L3 cache misses essentiallyremaining constant as the power cap is reduced. Even ifpower consumption is being reduced at lower power capsby decreasing cache set associativity, this would not affectthe number of cache misses generated by SIRE/RSM. Note,however, that the experiments discussed next reveal thatthe average access time for each level of the memoryhierarchy increases at the lowest cap (120 Watts), which,in turn, increases execution time and energy consumption.In contrast, for SIRE/RSM executed with these power caps(125 and 120 Watts), there is a marked difference in the

249

Page 5: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

Figure 1. Raw performance data averaged over five runs for SIRE/RSM (normalized)

baseline 160 155 150 145 140 135 130 125 120

0.2

0.4

0.6

0.8

1

1.2

Power Cap (Watts)

TLB Instruction Misses Frequency TimePower Consumption Energy Consumption

Figure 2. Raw performance data averaged over five runs for simulated annealing at different power caps (normalized)

baseline 160 155 150 145 140 135 130 125 120

0.2

0.4

0.6

0.8

1

1.2

Power Cap (Watts)

L2 Miss Rate L3 Miss Rate TLB Instruction MissesFrequency Time Power Consumption

Energy Consumption

number of instruction TLB misses, i.e., an increase fromthe baseline of 1,085% and 8,481%, respectively.

To attempt to understand the reason for these large in-creases in the number of cache misses, we ran the programin [6], which strides through memory to invoke differentlevels of the memory hierarchy, on the experimental platformwith no power cap and with a power cap of 120 Watts.Figures 3 and 4 show the results of these two experiments,respectively. The following information is inferred fromFigure 3:

1) the L1 data cache size is between 32KB and 64KB(actual size: 32KB);

2) L2 unified cache size is between 256KB and 512KB(actual size: 256KB);

3) L3 cache size is between 16MB and 32MB (actualsize: 20MB);

4) L1 data cache access time and miss penalty are 1.5nsand 2.0ns, respectively;

5) L2 and L3 miss penalties are 5.1ns and 37.1ns, re-spectively;

6) main memory access time is 60ns;7) block sizes of the L1 data, L2, and L3 caches are

identical, i.e., 64B; and8) L1 data, L2, and L3 caches are 8, 8, and 20-way set

associative, respectively.

A comparison of Figure 4, which depicts the behavior ofthe code executed with a 120 Watt power cap, with Figure 3,which illustrates its behavior with no power cap, reveals thatthe average access time associated with each level of thememory hierarchy increases in the 120 Watt power cappedexecution environment. However, due to the dynamic natureof how the power cap is enforced, the average access timebehaviors are not consistent with what we would expect.Take the 32K array as an example: although the 8K strideproduces the lowest average access time, the 16K and 32Kstrides produce higher average access times, when theyshould not. (Note that in Figure 3 the 8K, 16K, and 32Kstrides all produce the lowest average access time.) And, thisphenomenon is not limited to the 32K array; it occurs forall but the 64K, 128K, 256K, 512K, 16M and 64M arrays.

250

Page 6: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

Table IIPERFORMANCE DATA FOR EXPERIMENTS (AVERAGED OVER FIVE RUNS) W/O POWER CAP (BASELINE) AND W/NINE DIFFERENT POWER CAPS,

INCLUDING PERCENTAGE DIFFERENCE (ROUNDED TO THE CLOSEST INTEGER) BETWEEN EACH DATUM AND THE BASELINE DATUM

Code Expt.Label

Power CapData

Average NodePower

ComputedEnergy

Consumption

AverageFrequency

ExecutionTime

Watts % Diff Joules % Diff Value % Diff h:m:s % DiffA0 baseline 153.1 0 13,626.2 0 2,701 0 0:01:29 0

(no power cap)A1 160 153.3 0 13,435.2 -1 2,701 0 0:01:32 3A2 155 152.7 0 13,132.7 -4 2,701 0 0:01:29 0A3 150 139.9 -9 14,587.3 7 2,699 0 0:01:37 9

Stereo A4 145 142.4 -7 15,239.0 12 2,697 0 0:01:48 21Matching A5 140 136.6 -11 17,072.9 25 2,168 -20 0:02:04 40

A6 135 131.3 -14 24,155.5 77 1,274 -53 0:03:04 107A7 130 126.8 -17 58,725.4 331 1,207 -55 0:08:03 444A8 125 123.0 -20 131,619.6 866 1,200 -55 0:17:50 1,104A9 120 124.9 -18 395,921.2 2,805 1,200 -55 0:52:48 3,467B0 baseline 156.7 0 59,249.3 0 2,701 0 0:06:18 0

(no power cap)B1 160 155.5 -1 59,246.0 0 2,701 0 0:06:18 0B2 155 155.7 -1 58,997.1 0 2,701 0 0:06:24 2B3 150 148.8 -5 60,252.0 2 2,065 -24 0:06:45 7B4 145 142.7 -9 61,636.1 4 1,752 -35 0:07:12 14

SIRE/RSM B5 140 139.0 -11 63,363.7 7 2,422 -10 0:07:35 21B6 135 132.9 -15 79,587.8 34 1,285 -52 0:09:58 58B7 130 128.3 -18 93,770.6 58 1,200 -56 0:12:11 93B8 125 125.7 -20 101,950.4 72 1,200 -56 0:18:27 193B9 120 124.0 -21 1,257,686.5 2,023 1,200 -56 2:48:59 2,583

Expt.Label

L1 Misses L2 Misses L3 Misses TLB DataMisses

TLBInstruction

MissesValue % Diff Value % Diff Value % Diff Value % Diff Value % Diff

A0 1,664,150,370 0 69,043,027 0 14,671,610 0 134,056,322 0 61,607 0A1 1,667,870,256 0 67,249,375 -3 14,872,432 1 135,932,655 1 49,346 -20A2 1,667,758,785 0 65,026,350 -6 13,740,525 -6 140,350,684 5 105,395 71A3 1,667,461,417 0 66,184,173 -4 13,434,659 -8 140,630,163 5 360,720 486A4 1,663,655,662 0 67,455,109 -2 14,024,469 -4 135,774,772 1 224,101 264A5 1,664,549,387 0 71,503,278 4 17,367,776 18 143,234,324 7 217,765 253A6 1,670,697,877 0 72,539,830 5 17,704,234 21 127,382,089 -5 303,684 393A7 1,671,295,201 0 75,748,920 10 17,441,906 19 126,996,878 -5 335,057 444A8 1,695,641,334 2 209,086,950 203 69,134,669 371 141,528,545 6 1,336,131 2,069A9 1,705,299,224 2 237,245,886 244 65,978,341 350 142,760,702 6 4,001,531 6,395B0 2,766,551,199 0 601,071,329,017 0 157,727,205,389 0 244,552,195 0 27,115 0B1 2,767,961,796 0 601,076,173,975 0 157,727,205,387 0 245,518,133 0 34,491 27B2 2,746,071,954 -1 601,068,229,603 0 157,727,205,382 0 245,292,453 0 154,275 469B3 2,749,596,480 -1 601,078,326,391 0 157,727,205,392 0 245,690,611 0 128,642 374B4 2,725,617,789 -1 601,078,734,429 0 157,727,205,381 0 246,929,317 1 69,617 157B5 2,707,838,786 -2 601,079,254,577 0 157,727,205,385 0 244,813,158 0 194,965 619B6 2,683,490,540 -3 601,083,525,796 0 157,727,205,383 0 245,019,983 0 122,605 352B7 2,677,915,605 -3 601,091,823,236 0 157,727,205,384 0 245,512,154 0 124,709 360B8 2,672,890,101 -3 601,089,658,390 0 157,727,205,401 0 250,057,440 2 321,388 1,085B9 2,680,458,268 -3 601,749,296,673 0 157,727,205,442 0 280,430,595 15 2,326,863 8,481

It is likely that memory gating is partially responsiblefor these increases. However, a reduction in the set as-sociativity of the L1, L2, and/or L3 caches could be areason for these increases as well. Unfortunately, due tothe unexpected behavior discussed above, we cannot usethese two experiments to definitively determine either the setassociativity of the caches in the compute environment withthe 120 Watt power cap or if cache associativity is reducedto decrease power consumption. More experimentation is

needed to determine what is actually going on. However,our experimental data clearly suggests that when the appli-cations are executed with lower power caps, techniques thatinvolve the configuration of the memory hierarchy are beingemployed in order to reduce power consumption.

C. Discussion

For battery-powered devices our results indicate that en-ergy reserves will be drained more quickly as the power cap

251

Page 7: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

Figure 3. CACTI micro-benchmark run with no power cap enforced

8B 16B 32B 64B 128B 256B 512B 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M 8M 16M 32M10−1

100

101

102

103

Stride

Acc

ess

Tim

e(n

s)4K 8K 16K32K 64K 128K

256K 512K 1M2M 4M 8M

16M 32M 64M

Figure 4. CACTI micro-benchmark run with a 120 Watt power cap enforced

8B 16B 32B 64B 128B 256B 512B 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M 8M 16M 32M10−1

100

101

102

103

104

105

106

Stride

Acc

ess

Tim

e(n

s)

4K 8K 16K32K 64K 128K

256K 512K 1M2M 4M 8M

16M 32M 64M

is lowered below 155 Watts, shortening battery life and, thus,device operation. Nonetheless, in environments where poweris provided by a generator, certain levels of power capping,even with increases in execution time, may be acceptable.

In terms of systems where power is provided by a battery,power capping has no value when the workload powerconsumption is constant, i.e., not changing dynamically,predictable, and lower than the capacity of the powersupply. Power capping is best used when the workload isunpredictable in terms of its power consumption and insituations where during phases of the workload that requireno/negligible computing resources there is a substantial dif-ference in power consumption between the options of restingan uncapped system vs. a capped system that continuesto run. One might come to a different conclusion in suchsituations. This needs further investigation.

V. CONCLUSIONS AND FUTURE WORK

Although our results are not conclusive in that we only ex-perimented with two applications, the results indicate that (1)the feasibility of using power capping in fielded computing

platforms depends on application characteristics and (2) casestudies are essential to identify target applications amenableto power capped execution. As expected, as the power capis lowered, application execution time increases, resultingin no energy savings. However, for fielded systems, wheredata processing has a power budget, certain ranges of delayin terms of time-to-solution may be acceptable and, thus,power capping can be useful, e.g., to facilitate battlefielddecision making.

With respect to techniques used to lower power consump-tion, our results suggest that: (1) DVFS is not employed tomanage power consumption at lower power caps; (2) whenexecuting with lower power caps techniques that involve theconfiguration of the memory hierarchy are being employedin order to reduce power consumption; and (3) the methodsused to manage power consumption at lower power capsprovided small decreases in power consumption at the costof high losses in execution time performance.

We would like to extend this study to (1) explore howmulti-core applications are affected by power capping; (2)

252

Page 8: [IEEE 2012 41st International Conference on Parallel Processing Workshops (ICPPW) - Pittsburgh, PA, USA (2012.09.10-2012.09.13)] 2012 41st International Conference on Parallel Processing

determine, using microbenchmarks, what techniques otherthan DVFS are being used to manage power consumption;and (3) experiment using unpredictable workloads. Moreimportantly, we would like to develop a methodology forcharacterizing applications with regard to their amenabilityto power capped execution.

ACKNOWLEDGMENT

This work was made possible by the support of the ArmyResearch Laboratory via the Army High Performance Com-puting Research Center (Grant No. W11NF-07-2-0027) andby UTEP being part of an early access program for the newIntel R© Xeon R© processor E5 family architecture utilizingIntel R© Power Management technologies. We thank the U.S.Army and Intel Corporation, as well as our colleagues,Joshua McKee and Ricardo Portillo, for their assistance inthis research.

REFERENCES

[1] W. Huang, K. Rajamani, M. R. Stan, and K. Skadron,“Scaling with design constraints: predicting the future of bigchips,” IEEE Micro, vol. 31, no. 4, 2011, pp. 16–29.

[2] E. L. Sueur and G. Heiser, “Dynamic voltage and frequencyscaling: the laws of diminishing returns,” in Proc. 2010 Work-shop on Power Aware Computing and Systems (HotPower’10), 2010.

[3] “Intel R© Cloud Builders Guide to Cloud Design and Deploy-ment on Intel R© Platforms,” white paper, Intel, 2011.

[4] L. Nguyen, Signal and Image Processing Algorithms for theU.S. Army Research Laboratory Ultra-wideband (UWB) Syn-chronous Impulse Reconstruction (SIRE) Radar, tech. report,ARL-TR-4784, U.S. Army Research Laboratory, Adelphi,MD, 2009.

[5] D. Shires, Exploiting Parallelism in a Monte Carlo Image-Matching Algorithm, tech. report, ARL-TR-667, US ArmyResearch Laboratory, Aberdeen Proving Ground, MD, 1995.

[6] D. A. Patterson and J. L. Hennessy, Computer Architecture:A Quantitative Approach, 5th ed. Morgan Kaufmann, 2012.

[7] The ACPI Specification, v. 5.0, joint specification by HewlettPackard, Intel, Microsoft, Phoenix, and Toshiba, Dec. 2011;http://www.acpi.info/spec.htm.

[8] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, DigitalIntegrated Circuits, 2nd ed. Prentice Hall, 2003.

[9] W. Wang, P. Misra, and A. Gordon-Ross, “Dynamiccache reconfiguration for soft real-time systems,” ACMTransactions on Embedded Computing Systems, vol. 0, no. 1,2011.

[10] W. Wang, S. Ranka, and P. Mishra, “A general algorithmfor energy-aware dynamic reconfiguration in multitaskingsystems,” in Proc. 24th Int’l. Conference on VLSI Design(VLSID ’11), 2011, pp. 334–339.

[11] H. Hajimiri and P. Mishra, “Intra-task dynamic cache recon-figuration,” in Proc. 25th Int’l. Conference on VLSI Design(VLSID ’12), 2012, pp. 430–435.

[12] S.-L. Lu, A. Alameldeen, K. Bowman, Z. Chishti, C. Wilker-son, and W. Wu, “Architectural-level error-tolerant techniquesfor low supply voltage cache operation,” in Proc. 2011 IEEEInt’l. Conference on IC Design Technology (ICICDT), 2011,pp. 1–5.

[13] W. Wang and P. Mishra, “Dynamic reconfiguration of two-level cache hierarchy in real-time embedded systems,” J. LowPower Electronics, vol. 7, no. 1, 2011, pp. 17–28.

[14] W. Wang, P. Mishra, and S. Ranka, “Dynamic cache recon-figuration and partitioning for energy optimization in real-time multi-core systems.” in Proc. 2011 Design AutomationConference (DAC 2011), 2011, pp. 948–953.

[15] M. Hubner, C. Tradowsky, D. Gohringer, L. Braun, F. Thoma,J. Henkel, and J. Becker, “Dynamic processor reconfigu-ration,” in Proc. 2011 Int’l. Conference on ReconfigurableComputing and FPGAs (ReConFig), 2011, pp. 123–128.

[16] Y.-J. Chen, C.-L. Yang, J.-W. Chi, and J.-J. Chen, “Taclc:timing-aware cache leakage control for hard real-time sys-tems,” IEEE Transactions on Computers, vol. 60, no. 6, 2011,pp. 767–782.

[17] K. T. Sundararajan, T. M. Jones, and N. Topham, “A recon-figurable cache architecture for energy efficiency,” in Proc.8th ACM Int’l. Conference on Computing Frontiers (CF ’11),2011, pp. 9:1–9:2.

[18] W. Wang and P. Mishra, “Leakage-aware energy minimizationusing dynamic voltage scaling and cache reconfiguration inreal-time systems,” in Proc. 23rd Int’l. Conference on VLSIDesign (VLSID ’10), 2010, pp. 357–362.

[19] W. Wang and P. Mishra,“System-wide leakage-aware energyminimization using dynamic voltage scaling and cache re-configuration in multitasking systems,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 20, no. 5,2012, pp. 902–910.

[20] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci,“A portable programming interface for performance eval-uation on modern processors,” The Int’l. Journal of HighPerformance Computing Applications, vol. 14, 2000, pp. 189–204.

253