puppet: a new power-performance modeling tool for high ......puppet: a new power-performance...

PuPPET: A New Power-Performance Modeling Tool

for High Performance Computing

Li Yu,* Zhou Zhou,* Sean Wallace,* Zhiling Lan,* and Michael E. Papka†

*Illinois Institute of Technology, Chicago, IL, USA

†Argonne National Laboratory, Argonne, IL, USA

{lyu17, zzhou1, swallac6, lan}@iit.edu and [email protected]

Abstract—As high performance computing (HPC) continuesto grow in scale and complexity, energy becomes a criticalconstraint in the race to exascale computing. The days of“performance at all cost” are coming to an end. Whileperformance is still a major objective, future HPC will have todeliver desired performance under the energy constraint. Amongvarious power management methods, power capping is a widelyused approach. Unfortunately, the impact of power capping onsystem performance, user jobs, and power-performance efficiencyare not well studied due to many interfering factors imposedby system workload and configurations. To fully understandpower management in extreme scale systems with a fixed powerbudget, we introduce a power-performance modeling tool namedPuPPET (Power Performance PETri net). Unlike the traditionalperformance modeling approaches like analytical methods ortrace-based simulators, we explore a new approach – coloredPetri nets – for the design of PuPPET. PuPPET is fast andextensible for navigating through different configurations. Moreimportantly, it can scale to hundreds of thousands of processorcores and at the same time provide high levels of modelingaccuracy. We validate PuPPET by using system logs (i.e., job traceand power data) collected from the production 48-rack IBM BlueGene/Q supercomputer at Argonne National Laboratory. Ourtrace-based validation demonstrates that PuPPET is capable ofmodeling the dynamic execution of parallel jobs on the machineby providing an accurate approximation of energy consumption.In addition, we present two case studies of using PuPPET to studypower-performance tradeoffs using power-aware job allocationand dynamic voltage and frequency scaling.

I. INTRODUCTION

Production petascale systems are being designed anddeployed to meet the increasing demand for computationalcycles made by fields within science and engineering. Withtheir growing performance, energy consumption becomes animportant concern. It is estimated that the energy cost of asupercomputer during its lifetime can surpass the equipmentitself [7]. This introduces the need for energy-efficientcomputing. A number of power management technologies havebeen presented [3, 31], and power capping (i.e., limiting themaximum power a system can consume at any given time)is a well-known approach. For instance, power-aware joballocation and dynamic voltage and frequency scaling (DVFS)are two common power capping mechanisms. To control thepeak power within a predefined threshold, the former controlsthe overall system power by dynamically allocating availableresources to the queued jobs according to their expected powerrequirements [41], while the latter limits the overall systempower by adaptively adjusting processor voltage and frequency

(DVFS) [21].

While power-aware job allocation and DVFS controlthe maximum system power through different mechanisms,they both inevitably degrade performance. Understanding theperformance impact of different capping methods is criticalfor the development of effective power management methods.Important questions arising in this context include: for agiven system and workload, which is an appropriate method?And what is the potential impact of power managementon system performance? These questions must be answeredand their importance grows with the increasing scale andcomplexity of high performance computing. Nevertheless,these are intrinsically difficult questions for many reasons.The optimal power management scheme depends on manyfactors, such as job arrival rate, workload characteristics,hardware configuration, power budget, and scheduling policies.For DVFS, other relevant factors include the processor power-to-frequency relationship and the workload time-to-frequencyrelationship. For power-aware job allocation, perhaps the mostimportant factor is the ratio between power consumption at idleprocessor state versus that at full processor speed. Moreover,these factors are often interrelated, making the analysis evenharder. It is simply impossible to examine all these factorsvia experiments on production systems. To fully understanddifferent power capping mechanisms on system performance,we have developed a new analysis tool, which we describein this paper, to predict system performance as a function ofthese factors.

Models are ideal tools for navigating a complex designspace and allow for rapid evaluation and exploration of detailedwhat-if questions. This motivates us to look for a modelingmethod that satisfies three basic requirements: (1) scalability- it should be capable of modeling large scale systems withlow overhead; (2) high-fidelity - it should be accurate enoughto quantify the power-performance tradeoffs introduced bydifferent external factors; (3) extensibility - it should beflexible for easy expansion such as adding different systemconfigurations and/or new functionalities.

Existing modeling methods can be broadly classifiedinto three categories: analytical modeling, simulation and/oremulation, and queuing modeling [4]. Unfortunately, none ofthem meets the above three requirements. Analytical modelingmethods are fast, but they only provide rough predictiveresults, thus do not satisfy the second requirement (i.e.,high-fidelity). Further, analytical modeling cannot capturedynamic changes (e.g., dynamic job submission and execution,

frequency tuning). Trace-based simulators can provide highlyaccurate representations of system behaviors such as dynamicjob scheduling [27, 40]. However, no existing simulatorhas been extended to study power-performance tradeoffs.Such a functional extension is not trivial, and wouldrequire a significant amount of engineering efforts. In otherwords, this approach does not satisfy the third requirement(i.e., extensibility). The conventional queuing methods (e.g.,Markov modeling) have been investigated to deal withcomplicated programs; however, they suffer from the stateexplosion problem, thus cannot satisfy the first requirement(i.e., scalability).

In this work, we explore a new modeling approachby presenting a colored Petri net named PuPPET (PowerPerformance PETri net) for quantitative study and predictiveanalysis of different power management mechanisms onextreme scale systems. Colored Petri net (CPN) is a recentadvance in the field of Petri nets [20]. It combines thecapabilities of Petri nets with a high-level programminglanguage, where Petri nets provide the primitives for processinteraction and the programming language provides theprimitives for the definition of data types and manipulationsof data value. A significant benefit of CPNs as compared tothe traditional Petri nets is their scalability because of theintroduction of the hierarchy design and the colors.

PuPPET is designed to satisfy the aforementioned require-ments. First, the hierarchical support offered by PuPPETmakes it possible to model large systems in a modularway. Unlike conventional queuing methods, PuPPET’s modelsize does not grow dramatically as the number of systemcomponents increase. Such a feature is critical for the modelingof large-scale systems targeted in this work. As we willshow later in our experiments, PuPPET can effectively modelproduction systems with hundreds of thousands of processorcores and at the same time provide high levels of modelingaccuracy. Second, by taking advantage of timed transitions andcolored tokens, PuPPET is capable of providing a very preciseabstraction of system behaviors, thus enabling an accuratesimulation of the real systems. By combining the capabilitiesof Petri nets with the capabilities of a high-level programminglanguage, our model is capable of taking into considerationvarious factors that may affect power management. Job relatedfactors such as job arrival time, job size and job runtime aredescribed naturally by token colors in the net and a graphicalrepresentation allows the model to capture interactions amongthese factors in an intuitive way. Finally, the graphical formof the model enables us to add, modify or remove variousinteractions easily, providing a high extensibility. In addition,the availability of various well-developed Petri net simulationtools enables us to build models at a high level, hence makingthe model construction and extension much easier and faster[20].

We validate the accuracy of PuPPET by using the systemtraces (i.e., job trace and power data) collected from theproduction Mira machine, an IBM Blue Gene/Q system, atArgonne National Laboratory. Our experiments show thatPuPPET can effectively model batch scheduling and systempower consumption with a high accuracy, e.g., an error of lessthan 4%. Given that Mira is a 48-rack petascale machine with786, 432 cores, our results also indicate that PuPPET is highly

scalable.

We also present two case studies to explore the useof PuPPET for power-performance analysis. We investigatethe performance impact of different power capping strategiesunder a variety of system configurations and workloads.PuPPET enables us to study the effects of a variety of factors,such as workload characteristics, processor power-to-frequencyrelationship, workload time-to-frequency relationship, as wellas the idle-to-full power consumption relationship, in a unifiedanalysis environment. These case studies provide us usefulinsights about using power capping on extreme scale systems.For example:

• Given a fixed power cap, both power-aware allocationand DVFS can effectively restrict the overall systempower within the limit. While both mechanismstrade performance for power saving, the degrees ofperformance degradation are not the same. WhileDVFS seems to cause less impact on systemperformance, it could significantly impact hardwarelifetime reliability, as high as up to 3X higher failurerate [29]. Hence, unless in case of a tight power capand high system utilization, power-aware allocationis a preferred power capping solution due to itscomparable performance with DVFS and no impactto system reliability.

• Workload characteristics influencethe power-performance efficiency of power capping.In our experiments, greater performance impact isobserved in the months where system utilization isrelatively high.

• While batch scheduling policy does not cause sub-stantial power-performance difference when applyingpower capping, we observe that the scheduling policyadopted by Mira is more robust to system performancedegradation caused by power management than thewell-known first-come, first-served (FCFS) policy,especially with respect to job wait time.

While this paper focuses on power-performance analysisof power capping methods, PuPPET is extensible and canbe easily augmented to analyze other power managementmechanisms, other hardware scaling like memory, or addingnew constraints such as reliability. The extension can beachieved by adding new modules. These modules can beintegrated or individually disabled for studying differentscenarios, thereby providing a very powerful tool for modelingand analyzing extreme scale systems.

The rest of the paper is organized as follows. Section IIpresents background information. Section III describes ourmodel design. Section IV presents model validation by meansof real traces. Section V presents two case studies of using themodel. Section VI discusses related work. Finally, we concludethe work in Section VII.

II. BACKGROUND

Before presenting our model, we provide some backgroundinformation in this section.

A. Power Consumption

For a computing node in a system, the power is mainlyconsumed by its CMOS circuits, which is captured by

P = V 2 ∗ f ∗ CE , (1)

where V is the supply voltage, f is the clock frequencyand CE is the effective switched capacitance of the circuits.According to different environments, the power consumptioncan be further approximated by

P ∝ fα, (2)

where the frequency-to-power relationship α typically fallsinto the scope from one to three [22]. This implies thatlowering the CPU speed may significantly reduce the energyconsumption of the system. However, lowering the CPU speedalso decreases the maximum achievable clock speed, whichleads to a longer time to complete an operation. Typically thetime to finish an operation is inversely proportional to the clockfrequency and can be represented as

t ∝1

fβ. (3)

where the frequency-to-time relationship β is often approxi-mated by 1.0.

B. Colored Petri Net

A Petri net is a mathematical modeling language forformal description of dynamic behaviors of a system. Itis particularly well suited to systems that are characterizedas being concurrent, asynchronous, distributed, parallel,nondeterministic, or stochastic. A Petri net consists of a setof places and transitions linked by edges, with tokens in someplaces. Places are used to represent system states, transitionsare used to indicate system events, and edges specify therelationships between system states and events. To indicatea change in system state, tokens that reside in one place willmove to another place according to the firing rules imposedby the transition [25, 8].

Transition Place Module

InhibitorEdge Edge

Fig. 1. Basic elements used in colored Petri nets.

Colored Petri net (CPN) is a recent advance in the fieldof Petri nets by adding colors to tokens [20]. It has severalextensions from the traditional Petri net. First, CPN allows ahierarchical design, in which a module at a lower level can berepresented by a transition at a higher level. Such an extensionmakes it possible to model large systems in a scalable andmodular way. Second, tokens or transitions in a CPN canbe associated with time. Timed transitions fire depending ona deterministic delay or a stochastically distributed randomvariable. This extension provides an accurate control of timingin the model. Third, a variety of edges are added to improve

the functionality over the traditional Petri net. For example,the inhibitor edge is used to block transitions it connectsto. Figure 1 summarizes the basic elements used in CPNs.Additional details about CPN can be found in [20].

Although CPNs have been widely used for modeling largescale systems like biological networks [23], to our knowledgethis is the first attempt of applying CPN for power-performancemodeling in HPC.

C. Batch Scheduling

Batch scheduling is typically used for resource man-agement and job scheduling in HPC. Figure 2 illustratestypical batch scheduling on supercomputers. The resourcemanager is responsible for obtaining information aboutresource availability, waiting queue, and the running status ofcompute nodes. It runs several daemon processes on the masternodes and compute nodes to collect such information. The jobscheduler communicates with the resource manager to makescheduling decisions based on the current system status. Thejob waiting queue receives and stores jobs submitted by users.Users can query the resource manager to get the status of theirjobs.

Fig. 2. Batch scheduling in HPC.

The job scheduler periodically examines the jobs in thewaiting queue and the available resources via the resourcemanager, and determines the order in which jobs will beexecuted. The order is decided based on a set of attributesassociated with jobs such as job arrival time, job size (i.e.,the amount of nodes needed), the estimated runtime, etc.First-come, first-served (FCFS) with EASY backfilling is acommonly used scheduling policy in HPC [12]. Under FCFS-EASY, jobs are served in first-come, first-served order, andsubsequent jobs continuously jump over the first queued jobas long as they do not violate the reservation of the first queuedjob.

In this work we study another job scheduling policynamed WFP which has been used on different generationsof Blue Gene systems including the current Blue Gene/Q atArgonne[33, 32]. Unlike FCFS that determines job orderingbased on their arrival times, WFP uses a utility function asdefined below to determine job ordering. It favors large andold jobs, adjusting their priorities based on the ratio of waittime to their requested wall clock times.

job size× (job queued time

job runtime)3. (4)

III. PUPPET DESIGN

PuPPET consists of three interacting modules, namely,batch scheduling, power-aware allocation and CPU scaling,

to model user jobs from their submission to their completion.Figure 3 shows the top level design, in which these modules areconnected through five states (i.e., places). Batch schedulingorders the jobs in the queuing state according to a schedulingpolicy, e.g., job arrival time (FCFS) or utility score (WFP), andallows small jobs to skip ahead provided they do not delaythe job at the head of the queue (i.e., backfilling). Power-aware allocation provides coarse-grained power capping byallocating queued jobs onto computer nodes based on overallsystem power status and job power requirement. In this work,we develop a net to describe two power-aware allocationstrategies for power capping.1 CPU scaling dynamicallyadjusts the processors’ clock frequency for fine-grained powercapping. It interacts with jobs in the running state. Dynamicsystem states are modeled by job submission, job queuing, joballocation, node allocation, and power state changing.

The five states possess different system information.Queuing keeps a list of queued jobs, whose order can bechanged dynamically by the batch scheduling module. Runningholds jobs in the running state, where job execution isintercepted by the CPU scaling module. Power indicates thecurrent power level of the system, based on which Power-aware allocation module or CPU scaling module is ableto conduct power capping. Runtime Info provides the batchscheduling module with system resource information. Nodesrepresents the number of available computer nodes. We givethe details of these modules and states in the followingsubsections, using the inscriptions summarized in Table I.

State

State

State

State

State

Queuing Running

Power

Nodes

Batch Scheduling

Module

CPU Scaling

Module

Power-AwareAllocation

Module

Finish

Module

Module

Module

RuntimeInfo

Fig. 3. Top level design of PuPPET.

A. Batch Scheduling

This module is used to model batch scheduling on HPCsystems. Figure 4 presents the net design for FCFS and WFP,along with EASY backfilling. As mentioned in Section II,FCFS-EASY is a widely used batch scheduling policy, andit is estimated that more than 90% of batch schedulers usethis as default scheduling configuration [24]. WFP-EASY isthe production batch scheduling policy used on a number ofproduction supercomputers at Argonne, including the current48-rack Blue Gene/Q machine.

1In this paper we distinguish job scheduling from job allocation: jobscheduling focuses on ordering user jobs, whereas job allocation is dedicatedto assigning the queued jobs onto available resources.

FCFS-EASY

WFP-EASY

Trigger

RuntimeInfo

Queuing

User1`(8,31,81,5,1.0)@+1++1`(1,40,23,5,1.0)@+2++1`(2,40,27,5,1.0)@+3

Queuing

User

JobsTrigger

RuntimeInfo

Finish

EASY

Submit

Finish

WFP

Submit

EASY

jobs

Backfill jobs u

jobs^^[job]jobs

job

jobs

job

WFP jobs

jobs

u

jobs^^[job]jobs

Backfill jobs u

u

Fig. 4. Batch scheduling module

As shown in Figure 4, jobs leave from the place User inthe net according to their arrival times. Every job is describedin the form of (js, rt, jp, et, pf)@ + st, where js is job size,rt is job runtime, jp is job power profile, st is job arrivaltime, et records the time when a job enters a new place, andpf indicates the frequency rate of the processor that a job runson. Here, js and rt are supplied by users, jp is an estimate thatcan be obtained from historical data [39], and all the othersare maintained by PuPPET.

The job list in the place Queuing accepts the jobs submittedfrom users. Once a job enters or leaves the system (i.e.,firing of Submit or Finish), the place Trigger receives a signaland launches a job scheduling process. For FCFS-EASY,the transition EASY fires and uses the function Backfill to“backfill” jobs according to runtime resource information fromthe place Runtime Info. For WFP-EASY, jobs are sorted bytheir utility scores (i.e., firing of WFP) before backfilling starts.

B. Power-Aware Allocation

This module intends to model coarse-grained powercapping by allocating queued jobs onto available nodes basedon the system power status and job power requirement.Specifically, for each job at the head of the queue, if itsestimated power requirement makes the overall system powerexceed the power cap, the allocator either blocks job allocationor opportunistically allocates other less power-hungry jobs:

• BLOCK: the first queued job and its successors areblocked until there is power available to execute it(e.g., when another job finishes and leaves the system).

• WAIT: the first queued job is hold in a wait queue andyields to other less power-hungry jobs, until its wait

Inscriptions Type Description

job variable A job has six attributes: js, rt, jp, et, pf and st.

jobs,ws variable A job list.

pow,old pow variable System power level at the current and the previous time steps.

P1,...,Pm variable Processor power rates.

f,pf,new f,f1,...,fm variable Processor frequency rates.

[expr] symbol A guard on a transition that is able to fire if expr evaluates to true.

@+ expr symbol A time delay specified by expr.

T() function The current model time.

WFP() function Schedule jobs using WFP.

Backfill() function Schedule jobs using backfilling.

PowAllocate() function Search for jobs satisfying power cap from the wait queue.

Update() function Used to update job’s remaining execution time when CPU speed changes.

TABLE I. MAJOR INSCRIPTIONS USED IN THE PUPPET MODULES.

Nodes

Power

WaitQueue

Running

Ready

Queuing

TriggerPowerRecord

Allocate

[jp+pow<Pcap]

Wait

[jp+pow>= Pcapand length ws<l]

AllocateW

[jp+pow<Pcap]

Run

JobSchdule

PowerChange

[pow!=old_pow]

pow

pow pow

js`node jobjob

js`node

ws

ws^^[job]et=T()

job::ws ws

job

wsPowAllocate ws (Pcap-pow)

job::jobs

jobs

job::jobsjobs

job

pow jp+pow

Pcap-pow

Pcap-pow

old_pow

pow

pow

Block

if T()-et>w

Fig. 5. Power-aware allocation module

time exceeds a predefined value or the wait queue isfull.

Figure 5 presents the net design. We use Pcap to representthe power cap imposed on the system, w to indicate themaximum wait time of a job in the wait queue, and l to restrictthe length of the wait queue. The module is centered aroundthree transitions: Allocate, Wait and Allocate W. The head jobin the place Queuing is either being allocated onto nodes orbeing held in the wait queue. The transition Allocate fires ifthe job’s power requirement does not make the overall systempower exceed the power cap; otherwise, the transition Waitfires as long as the wait queue is not full. Jobs in the waitqueue can be allocated onto computer nodes by the firing ofthe transition Allocate W which is set to a higher firing prioritythan the transition Allocate.

To model the BLOCK strategy, we set the wait queuelength l to zero. If the head job in the place Queuing makes thetotal system power exceed the power cap, it and its successorsare blocked until the power cap is satisfied. Similarly, the

WAIT strategy is modeled by setting l to a positive integer.If the head job breaks the power cap, it is moved to the placeWait Queue. Once system power changes, the transition PowerChange fires and starts a selection process for the jobs in thewait queue. The function PowAllocate selects some jobs thatcan be allocated under the current system power and puts themat the head of the wait queue. However, if any job in the waitqueue exceeds the maximum wait time (denoted by w), thejob is kept at the head and a signal is sent to the place Block.The successors are also blocked until the job at the head ofthe wait queue is allocated.

Note that the original batch scheduling is modeled bysimply setting Pcap to ∞, indicating that the transition Waitwill never fire and the jobs in the place Queueing are allocatedas long as there are sufficient computer nodes.

C. CPU Scaling

This module models fine-grained power capping throughdynamic voltage and frequency scaling (DVFS). Specifically,in the case that a job arrival makes the total system powerexceed the power cap, DVFS is explored to make theprocessors run at a lower power state, thus limiting the powerwithin the cap.

Figure 6 presents the detailed net design. When systempower changes due to job arrival or leaving, the powerindicated by the place New Power will be updated. Accordingto this new system power and the power cap, one ofthe transitions in {T1, T2, ..., Tm} fires, meaning that theprocessors are going to run at power rate Pi.

In addition, once Ti fires, the processor frequency rateis updated according to Equation 2, and the remaining jobexecution time is modified according to Equation 3. Note thatwe assume there is no latency involved in CPU scaling.

D. Tool Implementation

We use CPN Tools to construct PuPPET. CPN Tools is awidespread tool for editing, simulating, and analyzing coloredPetri nets [5]. It is free of charge, and has been in activedevelopment since 1999. More importantly, it provides muchfaster simulation capabilities as compared to other tools wehave investigated. PuPPET can be downloaded from our serverhttp://bluesky.cs.iit.edu/puppet/. It is directly executable by anappropriate tool like CPN Tools. Currently, PuPPET acceptsjob traces in the standard workload format adopted by thecommunity [1]. PuPPET can be simulated interactively orautomatically. In an interactive simulation the user is in control.

...... ......

P1PiPm

Running

Power

Frequence

NewPower

PowerRecord

Tm

[bm>pow>=am]

Ti

[bi>pow>=ai]

T1

[b1>pow>=a1]

Job StateUpdate

@+1

[rt>0]

PowerChange

[pow!=old_pow]

f

pow pow

fi

if pf=new_f then rt=rt-1else Update(rt,pf,new_f)

f

job

powold_pow

pow

pow

f

new_f new_f

pow

f1fm

Fig. 6. CPU scaling module.

It provides a way to walk through the model, and checkingwhether the model works as expected. Automatic simulationsare similar to job execution.

IV. MODEL VALIDATION

Validating a power-performance model like PuPPET ischallenging because system logs (e.g., workload trace andpower data) of production systems are not easily accessible.Fortunately, Mira at Argonne provided us a unique opportunityto validate our work by means of system traces collected fromthe 48-rack system. Each rack consists of two mid-planes, eightlink cards, and two service cards. A mid-plane contains 16node boards and each node board holds 32 compute cards.The entire system contains 786, 432 cores, offering a peakperformance of 10 petaflops. Cobalt is used on Mira forresource management and job scheduling by using the WFPpolicy described in Section II.

We collected job log data from Mira between Januaryand April 2013. In these four months, 16, 044 jobs wereexecuted on the machine. A summary of these jobs is presentedin Figure 7. A distinguishing workload difference for thesemonths is that system utilization is low in January andFebruary and high in March and April. As we will show inthis paper, this workload difference impacts the effect of powermanagement. Furthermore, Mira was deployed with powermonitoring tools that collect and store power-related systemdata in an environmental database [37].

From the job log data, we extracted job attributes (jobsize, job runtime, and job arrival time) for all 16,044 jobsand correlated that information to their respective job powerprofiles found in the environmental database. We measured jobsize in number of racks, job time in minutes, and job power

0

200

400

600

800

1000

1200

1400

1K 2K 4K 8K 12K16K

24K32K

48K

Job r

untim

e (

min

ute

)

Job size (node)

2013-1

0

200

400

600

800

1000

1200

1400

1K 2K 4K 8K 12K16K

24K32K

48K

Job r

untim

e (

min

ute

)

Job size (node)

2013-2

0

200

400

600

800

1000

1200

1400

1K 2K 4K 8K 12K16K

24K32K

48K

Job r

untim

e (

min

ute

)

Job size (node)

2013-3

0

200

400

600

800

1000

1200

1400

1K 2K 4K 8K 12K16K

24K32K

48K

Job r

untim

e (

min

ute

)

Job size (node)

2013-4

Fig. 7. Summary of job traces

profile in watts. We also rounded these measured numbers tothe nearest integers for the purpose of modeling. The defaultmodel simulation time step is set to one minute. The energyconsumption within the time step [t, t + 1] is approximatedas the number of tokens in the place Power at time t. Forexample, if at the 5th minute the number of tokens in SystemPower is 600, then the energy usage during the following timestep is estimated as 600/1000 ∗ 1/60 = 0.01kWh.

0 10 20 30 40 50 60 70 80 90 100 110 1200

1

2

3

4

5

6

7

8x 10

5

Time(day)

Daily

Pow

er (k

wh)

Actual

Simulation

Fig. 8. Daily energy consumption.

For model validation, we compared the modeled energyconsumption to the actual energy consumption extracted fromthe environmental data. Figure 8 plots the daily energyconsumptions obtained from PuPPET and the actual powerdata from the environmental database. The plot clearly showsthat PuPPET is highly accurate, with an average relativeerror less than 0.5%. We also compared scheduling metrics(e.g., utilization, average job wait time,etc.) obtained fromPuPPET and the job log, which turned out to be very similar.Together these results indicate that PuPPET can effectivelymodel dynamic job scheduling and allocation with high fidelity.

Model accuracy is influenced by simulation time step.A smaller time step means higher accuracy; nevertheless, asmaller time step also increases simulation overhead. We alsoconducted a set of experiments to evaluate the impact ofdifferent simulation time steps, and the results are shownin Table II. In this table, we increased the simulation timestep from 1 minute to 60 minutes. We believe that themodel accuracy can be further improved by setting a smallersimulation time step, with a higher modeling overhead.

PuPPET is fast. All the experiments were conducted on alocal PC that is equipped with a Intel Quad 2.83GHz processorand 8GB memory, running windows 7 professional 64-bit

Simulation Time Step (min) Maximum Energy Error

1 3.84%

5 5.03%

15 8.85%

30 15.32%

60 18.09%

TABLE II. IMPACT OF SIMULATION TIME STEPS ON MODEL ACCURACY

operating system. For the 4-month trace simulation, it tooka couple of minutes.

V. CASE STUDIES

In this section, we give two case studies to illustrate theuse of PuPPET for quantitative analysis of power-performancetradeoffs under different power capping mechanisms. Weanalyze both fine-grained (e.g., DVFS) and coarse-grained(e.g., power-aware allocation) power capping strategies. Ourstudy evaluates these strategies by navigating through differentconfiguration spaces. Specifically, three power-performancemetrics are used for analysis:

• System Utilization Rate. This metric represents theratio of node hours used by user jobs to the totalelapsed system node hours. It is a commonly usedmetric for evaluating system performance.

• Average Job Wait Time. This metric denotes theaverage time elapsed between the moment a job issubmitted to the moment it is allocated to run. Thismetric is used to measure scheduling performancefrom the user’s perspective.

• Energy Delay Product (EDP). This metric is definedas the product of the total energy consumption and themakespan of all the jobs (i.e., the total length of theschedule when all the jobs have finished processing).It is commonly used to measure power-performanceefficiency. Obviously, a lower value means betterpower-performance efficiency [19].

In these case studies, the default setting is as follows: thepower cap is set to 45, 720kw (i.e., 70% of the maximumpower 65, 314kw in January), α is set to 2, β is set to 1, andthe idle processor power is set to be 30% of the power at thefull speed.

A. Case Study 1: Power Capping via DVFS

In this study, we present a case study of using PuPPET toanalyze fine-grained power capping via DVFS. Note that inour experiments, we assume there is no latency involved inDVFS tuning. The power-performance efficiency of DVFS isinfluenced by many factors such as workload characteristics,hardware configuration, and scheduling policies. In order toinvestigate the impact of different factors on DVFS, weconduct three sets of experiments.

In the first set, we consider various scenarios in whichthe processors have different frequency-to-power relationships(i.e., α in Equation 2). According to the literature and industryreports [14, 2], we found that for most modern processors, therelationship typically falls between linear and cubic. Hence,

we examine three different cases: linear (α = 1), square(α = 2) and cubic (α = 3). Figure 9 presents the results.We make several interesting observations. First, as expected,in the case of cubic relationship, voltage and frequency tuninghas the smallest impact on job execution time, thereby leadingto less impact on system performance. Second, with both batchscheduling policies (FCFS and WFP), DVFS improves systemutilization and energy delay product, while increasing averagejob wait time. In order to limit the overall system power withinthe cap, DVFS extends job execution times (as the processorsdo not run at full speed), thereby prolonging job wait timeand improving system utilization. Although extending jobexecution times may increase the scheduling makespan, DVFSreduces the amount of energy required for jobs at the sametime (note that for linear relationship, energy consumed byjobs does not change). Further, as DVFS improves systemutilization rate, the energy consumed by idle nodes is reduced.This offsets the cost introduced by the makespan extension,thus resulting in better power-performance efficiency. Third,the amount of system performance impact caused by DVFSis greatly influenced by workload characteristics. Specifically,more performance change is observed in March and Aprilthan in January and February. As stated earlier, the systemutilization rate is higher in March and April. Finally, bycomparing the top figures with the bottom figures, it is clearthat the increase of average job wait time using WFP is notas significant as that using FCFS, or in other words, WFP ismore robust to system performance impact caused by DVFSthan FCFS. One reason is that as shown in Equation 4, WFPconsiders job queued time when ordering jobs for execution.In general, we observe that the trend of performance changeis similar by using FCFS and WFP. Hence, in the rest of thissubsection, we only present the results of using WFP due tothe space limit.

In the second set, we study the impact of different powercaps, i.e., from 100% to 40% of 65, 314kw (the maximumpower in January). Figure 10 presents the results. Here, weuse the default square frequency-to-power relationship. DVFStends to increase system utilization rate when a tighter powercap is imposed. Similar to the first set of experiments, work-load characteristics influence power-performance efficiencycaused by DVFS (e.g., higher performance impact in Marchand April than that in January and February). An importantobservation we make from this set of experiments is that tighterpower budget (e.g., 40%) could lead to worse EDP, especiallyin the case of high system utilization.

In the third set, we consider various scenarios in which theworkload has different frequency-to-time relationships (i.e., βin Equation 3). Figure 11 presents the results. As we can see, ahigher β value does not increase EDP as excepted. As shownin Equation 3, a higher β value means a longer executiondelay when CPU frequency is lowered, thus increasing thescheduling makespan. On the other hand, a longer executiondelay increases system utilization and hence saves the energyconsumed by idle nodes. This observation implies that the idlepower (which is set to 30% of the node power in active state)is a dominant factor for system power-performance efficiency(e.g., EDP here).

Jan. Feb. Mar. Apr.0

0.2

0.4

0.6

0.8

1.0

Month

Syste

m U

tiliz

ation

LINEAR

SQUARE

CUBIC


5

10

15

Month

Avera

ge W

ait T

ime

LINEAR

SQUARE

CUBIC


0.2

0.4

0.6

0.8

1.0

Month

Energ

y D

ela

y P

roduct

LINEAR

SQUARE

CUBIC


0.2

0.4

0.6

0.8

1.0

Month

Syste

m U

tiliz

ation

LINEAR

SQUARE

CUBIC


5

10

15

Month

Avera

ge W

ait T

ime

LINEAR

SQUARE

CUBIC


0.2

0.4

0.6

0.8

1.0

Month

Energ

y D

ela

y P

roduct

LINEAR

SQUARE

CUBIC

Fig. 9. Power-performance impact under varied frequency-to-power relationships (i.e., varying α of Equation 2 from 1.0 to 3.0) by using DVFS for powercapping. Top: FCFS-EASY, and all the results are normalized to FCFS-EASY w/o capping. Bottom: WFP-EASY, and all the results are normalized to WFP-EASYw/o capping.


0.2

0.4

0.6

0.8

1.0

1.2

1.4

Month

Syste

m U

tiliz

ation

80%

60%

40%


30

60

90

120

Month

Avera

ge W

ait T

ime

80%

60%

40%


0.2

0.4

0.6

0.8

1.0

1.2

Month

Energ

y D

ela

y P

roduct

80%

60%

40%

Fig. 10. Power-performance impact under varied power caps (i.e., varying the power budget from 80% to 40% of the maximum power) by using DVFS forpower capping. The results are normalized to the case w/o capping.


0.5

1.0

Month

Syste

m U

tiliz

ation

beta=0.5

beta=1.0

beta=1.5

beta=2.0


5

10

15

Month

Avera

ge W

ait T

ime

beta=0.5

beta=1.0

beta=1.5

beta=2.0


0.5

1.0

Month

Energ

y D

ela

y P

roduct

beta=0.5

beta=1.0

beta=1.5

beta=2.0

Fig. 11. Power-performance impact under varied frequency-to-time relationships (i.e., varying β of Equation 3 from 0.5 to 2) by using DVFS for powercapping. All the results are normalized the case w/o capping.

B. Case Study 2: Power Capping via Power-Aware Allocation

In this part, we present a case study of using PuPPETto analyze coarse-grained power-aware allocation. Similar toDVFS, the power-performance efficiency of power-aware joballocation is influenced by many factors including work-load characteristics, hardware configuration, and schedulingpolicies. We conduct three sets of experiments to examineperformance changes under a variety of factors relevant topower-aware job allocation.

As described earlier, in the current PuPPET, we havedeveloped two power-aware allocation strategies: BLOCK andWAIT. In the first set of experiments, we compare BLOCKand WAIT using the same power cap. The parameters forWAIT are set to w = 500 and l = 10. Figure 12 presents ourresults. First of all, it is clear that WAIT always outperformsBLOCK across all the metrics. This is quite obvious becausethe WAIT strategy seeks to improve scheduling performanceby opportunistically allocating less power-hungry jobs without


0.2

0.4

0.6

0.8

1.0

Month

Syste

m U

tiliz

ation

BLOCK

WAIT


20

40

60

Month

Avera

ge W

ait T

ime

BLOCK

WAIT


0.5

1.0

1.5

Month

Energ

y D

ela

y P

roduct

BLOCK

WAIT


0.2

0.4

0.6

0.8

1.0

Month

Syste

m U

tiliz

ation

BLOCK

WAIT


20

40

60

Month

Avera

ge W

ait T

ime

BLOCK

WAIT


0.5

1.0

1.5

Month

Energ

y D

ela

y P

roduct

BLOCK

WAIT

Fig. 12. Power-performance impact under different power-aware allocation strategies. Top: FCFS-EASY, and all the results are normalized to FCFS-EASY w/ocapping. Bottom: WFP-EASY, and all the results are normalized to WFP-EASY w/o capping.


0.2

0.4

0.6

0.8

1.0

Month

Syste

m U

tiliz

ation

80%

60%

40%


200

400

600

800

Month

Avera

ge W

ait T

ime

80%

60%

40%


0.5

1.0

1.5

2.0

2.5

3.0

Month

Energ

y D

ela

y P

roduct

80%

60%

40%

Fig. 13. Power-performance impact under varied power caps (i.e., varying the power budget from 80% to 40% of the maximum power) by using power-awarejob allocation for power capping. All the results are normalized to the case w/o capping.

violating the power cap. Unlike DVFS, we find that power-aware allocation results in lower system utilization and higherEDP, especially in the case of March and April. Power-awareallocation intentionally delays the allocation of some power-hungry jobs, thus leading to lower system utilization andlonger scheduling makespan. A lowered system utilizationrate indicates more energy consumed by idle nodes. Becausepower-aware allocation does not change the total energyrequired by the jobs, the extended makespan and the moreamount of energy for idle nodes result in an increased EDP.When system utilization rate is low (e.g., in Jan. and Feb.),the probability that the power budget will be violated is low,which implies it is less likely for the job allocator to delay jobexecution. Further, a lower system utilization rate often impliesa longer interval between job arrivals, thus delaying a joballocation hardly affects its successors. Similar to Case Study1, we observe that WFP is more robust to system performancedegradation caused by power management than FCFS. Hence,in the rest of this subsection, we only present the results ofusing WFP due to the space limit.

In the second set, we study the impact of differentpower caps, i.e., from 100% to 40% of 65, 314kw (themaximum power in January). Figure 13 presents the results.


0.2

0.4

0.6

0.8

1.0

1.2

Month

Ener

gy D

elay

Pro

duct

IDLE = 0

IDLE = 10%

IDLE = 20%

IDLE = 30%

IDLE = 40%

IDLE = 50%

Fig. 14. Energy delay product under varied idle power rates by using power-aware job allocation for power capping. All the results are normalized to thecase w/o capping.

By comparing this figure with the DVFS results shown inFigure 10, we find that tighter power budget has moresevere impact on system performance by using power-awareallocation. When the power cap is set to 80% - 60%, althoughsystem utilization, average job wait time, and EDP are allaffected by power-aware allocation, the impact is minor.When the cap is lowered to 40%, the impact becomes veryobvious. The higher EDP value resulted by using power-awareallocation indicates that it is not a good choice in the case of

a tight power budget.

In the third set, we study the impact of idle powerconsumption on power management. We vary the idle powerconsumption from 0% to 50% of node power in active state.Given that this factor only affects EDP, Figure 14 presents theresults regarding energy delay product. First, the use of power-aware allocation does not decrease EDP, which is also depictedin Figure 12. The reason is while power-aware allocation doesnot change the total energy required by the jobs, it may causean extended makespan and more amount of energy required foridle nodes. Furthermore, we observe that this idle power factorhas a more significant impact on EDP when system utilizationrate is high (e.g., in Mar. and Apr.). As shown in Figure 12,when system utilization rate is high, power-aware allocationmay decrease system utilization for the purpose of limiting thepeak power consumption. A lower system utilization meansmore amount of energy required for idle nodes and longermakespan, thus leading to a higher EDP.

VI. RELATED WORK

Modern systems are designed with various hardwaresensors that collect power-related data and store these datafor system analysis. System level tools like LLView [6] andPowerPack [16] have been developed to integrate the powermonitoring capabilities to facilitate systematic measurement,model and prediction of performance. Goiri et al. used externmeters to measure the power consumption of a node during itsrunning time [17]. Feng et al. presented a general frameworkfor direct, automatic profiling of power consumption on high-performance distributed systems [13]. The performance API(PAPI) has been extended to access internal power sensorreadings on several architectures including Intel SandyBridgechips and Nvidia GPUs [38]. In our recent work, wehave developed a power profiling library called MonEQ foraccessing internal power sensor readings on IBM Blue Gene/Qsystems [36].

Modeling power, performance and their tradeoffs has beendone on various systems. Analytical modeling is a commonlyused method, which mainly focuses on building mathematicalcorrelations between power and performance metrics of thesystem. Chen et al. proposed a system level power model foronline estimation of power consumption using linear regression[9]. Curtis-Maury et al. presented a online performanceprediction framework to address the problem of simultaneousruntime optimization of DVFS and DCT on multi-core systems[10]. Dwyer et al. designed and evaluated a machine learningmodel that estimates performance degradation of multicoreprocessors in HPC centers [11]. Subramaniam et al. built aregression model for the power and performance of scientificapplications and used this model to optimize energy efficiency[30]. Tiwari et al. developed CPU and DIMM power andenergy models of three widely used HPC kernels by trainingartificial neural networks [35]. While these models providegood estimation of power and/or performance metrics, theycannot capture the dynamic, complicated power-performanceinteractions exhibiting in large-scale systems.

There are several studies of applying stochastic modelsfor performance or power anaysis. B.Guenter et al. adopteda Markov model for idleness prediction and also proposed

power state transitions to remove idle servers [18]. Qiu etal. introduced a continuous-time and controllable Markovprocess model of a power-managed system [26]. Ronget al. presented a stochastic model for a power-managed,battery-powered electronic system, and formulated a policyoptimization problem to maximize the capacity utilization ofthe battery powered systems [28]. Unlike these studies relyingon a Markov process, our model is built on colored Petrinets, thus being more robust to the potential space explosionproblem that is commonly observed in Markov models.

The studies closely related to ours are [15] and [34]. In[15], Gandhi et al. used queueing theory to obtain the optimalenergy-performance tradeoff in server farms, and in [34] Tianet al. proposed a model using stochastic reward nets (SRN)to analyze the performance and power consumption underdifferent power states. Distinguishing from these studies, ourwork targets supercomputers with unique batch scheduling andparallel workload. Moreover, our model adopts the advancedfeature of colored Petri nets which provides a more accuraterepresentation of the dynamic system being analyzed. To thebest of our knowledge, this is the first colored Petri netmodeling work to study the power-performance tradeoffs forHPC.

VII. CONCLUSION

In this paper,we have presented PuPPET, a colored Petri netfor modeling and analyzing power management on extremescale systems. By using the advanced features of coloredPetri nets, PuPPET is capable of capturing the complicatedinteractions among different factors that can affect power-performance efficiency on HPC systems. Our trace-basedvalidation demonstrates that it can effectively model thedynamic execution of parallel jobs on a system by providingan accurate approximation of energy consumption. A salientfeature of the proposed PuPPET is that it can scale to hundredsof thousands of processor cores and at the same time providehigh levels of modeling accuracy. Such a feature is crucial forpower-performance tradeoff analysis of extreme scale systems.Moreover, we have explored the use of PuPPET to studythe power-performance tradeoffs under two well-known powercapping methods, i.e., power-aware job allocation and DVFS.Our case studies provide us useful insights about using powercapping on extreme scale systems.

PuPPET provides a convenient modeling tool for users toset different parameters and study power-performance tradeoffson large-scale systems. We believe it has many other potentialusages, in addition to the case studies presented in this work.For instance, it can be used to find the optimal power policythat minimizes power consumption under a time constraint.It can also be extended to incorporate resiliency analysisby adding a module for describing failure behaviors. Theresulting model can be used to study the key tradeoffs amongperformance, power, and reliability for supercomputing. Allthese are part of our ongoing work.

ACKNOWLEDGMENTS

This work is supported in part by US National ScienceFoundation grant CNS-1320125 and Department of Energycontract DE-AC02-06CH11357.

REFERENCES

[1] The standard workload format. http://www.cs.huji.ac.il/labs/parallel/workload/swf.html.

[2] Enhanced intel speedstep technology for the intel pentiumm processor. Technical report, Intel Corporation, 2004.

[3] Mira: Next-generation supercomputer. https://www.alcf.anl.gov/mira, 2012.

[4] Ascr workshop on modeling and simulation of exascalesystems and applications. Technical report, DOE Officeof Science, 2013.

[5] Cpn tools. http://cpntools.org/, 2013.[6] LLview: graphical monitoring of loadleveler controlled

cluster. http://www.fz-juelich.de/jsc/llview/, 2013.[7] S. Ashby. The opportunities and challenges of exascale

computing. Technical report, DOE Office of Science,2010.

[8] G. Balbo. Introduction to generalized stochastic petrinets. In Proceedings of SFM, pages 83–131, 2007.

[9] X. Chen, C. Xu, R. P. Dick, and Z. M. Mao. Performanceand power modeling in a multi-programmed multi-coreenvironment. In Proceedings of DAC, pages 813–818,2010.

[10] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S.Nikolopoulos, B. R. de Supinski, and M. Schulz. Pre-diction models for multi-dimensional power-performanceoptimization on many cores. In Proceedings of PACT,pages 250–259, 2008.

[11] T. Dwyer, A. Fedorova, S. Blagodurov, M. Roth, F. Gaud,and J. Pei. A practical method for estimating performancedegradation on multicore processors, and its applicationto hpc workloads. In Proceedings of SC, page 83, 2012.

[12] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C.Sevcik, and P. Wong. Theory and practice in parallel jobscheduling. In Proceedings of JSSPP, pages 1–34, 1997.

[13] X. Feng, R. Ge, and K. W. Cameron. Power and energyprofiling of scientific applications on distributed systems.In Proceedings of IPDPS, pages 658–671, 2005.

[14] M. Floyd, M. Allen-Ware, K. Rajamani, B. Brock,C. Lefurgy, and et al. Introducing the adaptive energymanagement features of the power7 chip. IEEE Micro,31:60–75, 2011.

[15] A. Gandhi, M. Harchol-Balter, and I. Adan. Server farmswith setup costs. Perform. Eval., 67:1123–1138, 2010.

[16] R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, andK. Cameron. Powerpack: Energy profiling and analysis ofhigh-performance systems and applications. IEEE Trans.Parallel Distrib. Syst., 21:658–671, 2010.

[17] I. Goiri, K. Le, M. Haque, R. Beauchea, T. Nguyen,J. Guitart, J. Torres, and R. Bianchini. Greenslot:Scheduling energy consumption in green datacenters. InProceedings of SC, pages 1–11, 2011.

[18] B. Guenter, N. Jain, and C. Williams. Managing cost,performance, and reliability tradeoffs for energy-awareserver provisioning. In Proceedings of INFOCOM, page2011, 1332-1340.

[19] J. H. L. III, K. T. Pedretti, S. M. Kelly, W. Shu, K. B.Ferreira, J. V. Dyke, and C. T. Vaughan. Energy-Efficient High Performance Computing - Measurementand Tuning. Springer, 2012.

[20] K. Jensen and L. M. Kristensen. Coloured PetriNets - Modelling and Validation of Concurrent Systems.

Springer, 2009.[21] C. Lefurgy, X. Wang, and M. Ware. Server-level power

control. In Proceedings of ICAC, 2007.[22] T. Martin and D. Siewiorek. Non-ideal battery and main

memory effects on cpu speed-setting for low power. IEEETrans. VLSI System, 9:29–34, 2001.

[23] W. Marwan, C. Rohr, and M. Heiner. Petri nets inSnoopy: A unifying framework for the graphical display,computational modelling, and simulation of bacterialregulatory networks. Humana Press, 2012.

[24] A. Mualem and D. Feitelson. Utilization, predictability,workloads, and user runtime estimates in scheduling theibm sp2 with backfilling,. IEEE Transactions on Paralleland Distributed System, 12(6):529–543, 2001.

[25] T. Murata. Petri nets: Properties, analysis andapplications. Proceedings of the IEEE, 77:541–580,1989.

[26] Q. Qiu and M. Pedram. Dynamic power managementbased on continuous-time markov decision processes. InProceedings of DAC, pages 555–561, 1999.

[27] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey,and et al. The structural simulation toolkit. SIGMETRICSPerform. Eval. Rev., 38:37–42, 2011.

[28] P. Rong and M. Pedram. Battery-aware powermanagement based on markovian decision processes. InIEEE TCAD, pages 707–713, 2006.

[29] J. Srinivasan, S. Adve, P. Bose, and J. Rivers. Theimpact of technology scaling on lifetime reliability. InProceedings of DSN’04, 2004.

[30] B. Subramaniam and W. Feng. Statistical powerand performance modeling for optimizing the energyefficiency of scientific computing. In Proceedings ofGREENCOM-CPSCOM, pages 139–146, 2010.

[31] K. Sugavanam, C.-Y. Cher, J. A. Gunnels, R. A. Haring,P. Heidelberger, and et al. Design for low power andpower management in IBM Blue Gene/Q. IBM Journalof Research and Development, 57:1–11, 2013.

[32] W. Tang, N. Desai, D. Buettner, and Z. Lan. Analyzingand adjusting user runtime estimates to improve jobscheduling on Blue Gene/P. In Proceedings of IPDPS,2010.

[33] W. Tang, Z. Lan, N. Desai, and D. Buettner. Fault-awareutility-based job scheduling on Blue Gene/P systems. InProceedings of Cluster, 2009.

[34] Y. Tian, C. Lin, and M. Yao. Modeling and analyzingpower management policies in server farms usingstochastic petri nets. In Proceedings of e-Energy, page 26,2012.

[35] A. Tiwari, M. A. Laurenzano, L. Carrington, andA. Snavely. Modeling power and energy usage of hpckernels. In Proceedings of IPDPSW, pages 990–998,2012.

[36] S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, andM. Papka. Measuring power consumption on IBM BlueGene/Q. In Proceedings of HPPAC, 2013.

[37] S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, andM. Papka. Profilling benchmarks on ibm blue Gene/Q.In Proceedings of Cluster, 2013.

[38] V. Weaver, M. Johnson, K. Kasichayanula, J. Ralph,P. Luszcek, D. Terpstra, and S. Moore. Measuring energyand power with papi. In International Workshop onPower-Aware Systems and Architectures, 2012.

[39] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang,S. Coghlan, and M. Papka. Integratin meisner2009g dynamic pricing of electricity into energy awarescheduling for hpc systems. In Proceedings of SC, 2013.

[40] G. Zheng, G. Kakulapati, and L. V. Kale. Bigsim:A parallel simulator for performance prediction ofextremely large parallel machines. In Proceedings ofIPDPS, 2004.

[41] Z. Zhou, Z. Lan, W. Tang, and N. Desai. Reducingenergy costs for IBM Blue Gene/P via power-aware jobscheduling. In Proceedings of JSSPP, 2013.

puppet: a new power-performance modeling tool for high ......puppet: a new power-performance...

Documents