energy management for commercial servers

10
0018-9162/03/$17.00 © 2003 IEEE December 2003 39 COVER FEATURE Published by the IEEE Computer Society Energy Management for Commercial Servers I n the past, energy-aware computing was pri- marily associated with mobile and embedded computing platforms. Servers—high-end, mul- tiprocessor systems running commercial work- loads—typically included extensive cooling systems and resided in custom-built rooms for high-power delivery. In recent years, however, as transistor density and demand for computing resources have rapidly increased, even high-end systems face energy-use constraints. Moreover, conventional computers are currently air cooled, and systems are approaching the limits of what manufacturers can build without introducing addi- tional techniques such as liquid cooling. Clearly, good energy management is becoming important for all servers. Power management challenges for commercial servers differ from those for mobile systems. Techniques for saving power and energy at the cir- cuit and microarchitecture levels are well known, 1 and other low-power options are specialized to a server’s particular structure and the nature of its workload. Although there has been some progress, a gap still exists between the known solutions and the energy-management needs of servers. In light of the trend toward isolating disk resources in separate cabinets and accessing them through some form of storage networking, the main focus of energy management for commercial servers is conserving power in the memory and micro- processor subsystems. Because their workloads are typically structured as multiple-application pro- grams, system-wide approaches are more applica- ble to multiprocessor environments in commercial servers than techniques that are primarily applica- ble to single-application environments, such as those based on compiler optimizations. COMMERCIAL SERVERS Commercial servers comprise one or more high- performance processors and their associated caches; large amounts of dynamic random-access memory (DRAM) with multiple memory controllers; and high-speed interface chips for high-memory band- width, I/O controllers, and high-speed network interfaces. Servers with multiple processors typically are designed as symmetric multiprocessors (SMPs), which means that the processors share the main memory and any processor can access any memory location. This organization has several advantages: • Multiprocessor systems can scale to much larger workloads than single-processor sys- tems. Shared memory simplifies workload balancing across servers. The machine naturally supports the shared- memory programming paradigm that most developers prefer. Because it has a large capacity and high-band- width memory, a multiprocessor system can effi- ciently execute memory-intensive workloads. In commercial servers, memory is hierarchical. These servers usually have two or three levels of As power increasingly shapes commercial systems design, commercial servers can conserve energy by leveraging their unique architecture and workload characteristics. Charles Lefurgy Karthick Rajamani Freeman Rawson Wes Felter Michael Kistler Tom W. Keller IBM Austin Research Lab

Upload: independent

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

0018-9162/03/$17.00 © 2003 IEEE December 2003 39

C O V E R F E A T U R E

P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y

Energy Management for CommercialServers

I n the past, energy-aware computing was pri-marily associated with mobile and embeddedcomputing platforms. Servers—high-end, mul-tiprocessor systems running commercial work-loads—typically included extensive cooling

systems and resided in custom-built rooms forhigh-power delivery. In recent years, however, astransistor density and demand for computingresources have rapidly increased, even high-endsystems face energy-use constraints. Moreover,conventional computers are currently air cooled,and systems are approaching the limits of whatmanufacturers can build without introducing addi-tional techniques such as liquid cooling. Clearly,good energy management is becoming importantfor all servers.

Power management challenges for commercialservers differ from those for mobile systems.Techniques for saving power and energy at the cir-cuit and microarchitecture levels are well known,1

and other low-power options are specialized to aserver’s particular structure and the nature of itsworkload. Although there has been some progress,a gap still exists between the known solutions andthe energy-management needs of servers.

In light of the trend toward isolating diskresources in separate cabinets and accessing themthrough some form of storage networking, the mainfocus of energy management for commercial serversis conserving power in the memory and micro-processor subsystems. Because their workloads aretypically structured as multiple-application pro-grams, system-wide approaches are more applica-

ble to multiprocessor environments in commercialservers than techniques that are primarily applica-ble to single-application environments, such asthose based on compiler optimizations.

COMMERCIAL SERVERSCommercial servers comprise one or more high-

performance processors and their associated caches;large amounts of dynamic random-access memory(DRAM) with multiple memory controllers; andhigh-speed interface chips for high-memory band-width, I/O controllers, and high-speed networkinterfaces.

Servers with multiple processors typically aredesigned as symmetric multiprocessors (SMPs),which means that the processors share the mainmemory and any processor can access any memorylocation. This organization has several advantages:

• Multiprocessor systems can scale to muchlarger workloads than single-processor sys-tems.

• Shared memory simplifies workload balancingacross servers.

• The machine naturally supports the shared-memory programming paradigm that mostdevelopers prefer.

• Because it has a large capacity and high-band-width memory, a multiprocessor system can effi-ciently execute memory-intensive workloads.

In commercial servers, memory is hierarchical.These servers usually have two or three levels of

As power increasingly shapes commercial systems design, commercialservers can conserve energy by leveraging their unique architecture andworkload characteristics.

CharlesLefurgyKarthickRajamaniFreemanRawsonWes FelterMichaelKistlerTom W.KellerIBM AustinResearch Lab

40 Computer

cache between the processor and the main mem-ory. Typical high-end commercial servers includeIBM’s p690, HP’s 9000 Superdome, and SunMicrosystems’ Sun Fire 15K.

Figure 1 shows a high-level organization ofprocessors and memory in a single multichip mod-ule (MCM) in an IBM Power4 system. Each Power4processor contains two processor cores; each coreexecutes a single program context.

The processor’s two cores contain L1 caches (notshown) and share an L2 cache, which is the coher-ence point for the memory hierarchy. Each proces-sor connects to an L3 cache (off-chip), a memorycontroller, and main memory. In some configura-tions, processors share the L3 caches. The fourprocessors reside on an MCM and communicatethrough dedicated point-to-point links. Larger sys-tems such as the IBM p690 consist of multiple con-nected MCMs.

Table 1 shows the power consumption of two con-figurations of an IBM p670 server, which is amidrange version of the p690. The top row gives thepower breakdown for a small four-way server (sin-gle MCM with four single-core chips) with a 128-Mbyte L3 cache and a 16-Gbyte memory. Thebottom row gives the breakdown for a larger 16-wayserver (dual MCM with four dual-core chips) with a256-Mbyte L3 cache and a 128-Gbyte memory.

The power consumption breakdowns include

• the processors, including the MCMs withprocessor cores and L1 and L2 caches, cachecontrollers, and directories;

• the memory, consisting of the off-chip L3caches, DRAM, memory controllers, and high-bandwidth interface chips between the con-trollers and DRAM;

• I/O and other nonfan components; • fans for cooling processors and memory; and • fans for cooling the I/O components.

We measured the power consumption at idle. Ahigh-end commercial server typically focuses pri-marily on performance, and the designs incorpo-rate few system-level power-management tech-niques. Consequently, idle and active power con-sumption are similar.

We estimated fan power consumption from prod-uct specifications. For the other components of thesmall configuration, we measured DC power. Weestimated the power in the larger configuration byscaling the measurements of the smaller configura-tion based on relative increases in the componentquantities. Separate measurements were made toobtain dual-core processor power consumption.

In the small configuration, processor power isgreater than memory power: Processor poweraccounts for 24 percent of system power, memorypower for 19 percent. In the larger configuration,the processors use 28 percent of the power, andmemory uses 41 percent. This suggests the need tosupplement the conventional, processor-centricapproach to energy management with techniquesfor managing memory energy.

The high power consumption of the computingcomponents generates large amounts of heat,requiring significant cooling capabilities. The fansdriving the cooling system consume additionalpower. Fan power consumption, which is relativelyfixed for the system cabinet, dominates the smallconfiguration at 51 percent, and it is a big compo-nent of the large configuration at 28 percent.Reducing the power of computing components

L3Core Core

L2Memorycontroller

Memory

L3Memorycontroller

Memory

MemoryMemorycontroller

L3

MemoryMemorycontroller

L3

Core Core

L2

Core Core

L2

Core Core

L2

Power4 processor

Multichip module

Figure 1. A single multichip module in a Power4 system. Each processor has two processor cores, each executing a single program context.Each core contains L1 caches, shares an L2 cache, and connects to an off-chip L3 cache, a memory controller, and main memory.

Table 1. Power consumption breakdown for an IBM p670.

ProcessorI/O and I/O

IBM p670 and memory component Totalserver Processors Memory other fans fans wattsSmall 384 318 90 676 144 1,614 configuration(watts) Large 840 1,223 90 676 144 2,972 configuration(watts)

December 2003 41

would allow a commensurate reduction in cool-ing capacity, therefore reducing fan power con-sumption.

We did not separate disk power because the mea-sured system chiefly used remote storage andbecause the number of disks in any configurationvaries dramatically. Current high-performanceSCSI disks typically consume 11 to 18 watts eachwhen active.

Fortunately, this machine organization suggestsseveral natural options for power management. Forexample, using multiple, discrete processors allowsfor mechanisms to turn a subset of the processorsoff and on as needed. Similarly, multiple cachebanks, memory controllers, and DRAM modulesprovide natural demarcations of power-manage-able entities in the memory subsystem. In addition,the processing capabilities of memory controllers,although limited, can accommodate new power-management mechanisms.

Energy-management goals Energy management primarily aims to limit max-

imum power consumption and improve energy effi-ciency. Although generally consistent with eachother, the two goals are not identical. Some energy-management techniques address both goals, butmost implementations focus on only one or theother.

Addressing the power consumption problem iscritical to maintaining reliability and reducing cool-ing requirements. Traditionally, server designs coun-tered increased power consumption by improvingthe cooling and packaging technology. Morerecently, designers have used circuit and microar-chitectural approaches to reduce thermal stress.

Improving energy efficiency requires eitherincreasing the number of operations per unit ofenergy consumed or decreasing the amount ofenergy consumed per operation. Increased energyefficiency reduces the operational costs for the sys-tem’s power and cooling needs. Energy efficiency isparticularly important in large installations such asdata centers, where power and cooling costs can besizable.

A recent energy management challenge is leak-age current in semiconductor circuits, which causestransistors designed for high frequencies to con-sume power even when they don’t switch. Althoughwe discuss some technologies that turn off idle com-ponents, and reduce leakage, the primary ap-proaches to tackling this problem center onimprovements to circuit technology and microar-chitecture design.

Server workloadsCommercial server workloads include transac-

tion processing—for Web servers or databases, forexample—and batch processing—for noninterac-tive, long-running programs. Transaction and batchprocessing offer somewhat different power man-agement opportunities.

As Figure 2 illustrates, transaction-orientedservers do not always run at peak capacity becausetheir workloads often vary significantly dependingon the time of day, day of the week, or other exter-nal factors. Such servers have significant buffercapacity to maintain performance goals in the eventof unexpected workload increases. Thus, much ofthe server capacity remains unutilized during nor-mal operation.

Transaction servers that run at peak throughputcan impact the latency of individual requests and,consequently, fail to meet response time goals. Batchservers, on the other hand, often have less stringentlatency requirements, and they might run at peakthroughput in bursts. However, their more relaxedlatency requirements mean that sometimes runninga large job overnight is sufficient. As a result, bothtransaction and batch servers have idleness—orslack—that designers can exploit to reduce theenergy used.

Server workloads can comprise multiple applica-tions with varying computational and performancerequirements. The server systems’ organizationoften matches this variety. For example, a typical e-commerce Web site consists of a first tier of simplepage servers organized in a cluster, a second tier ofhigher performance servers running Web applica-tions, and a third tier of high-performance databaseservers.

Although such heterogeneous configurations pri-

020406080

100120140

Time (hour)(a)

(b)Re

ques

ts p

er h

our

(thou

sand

s)

0

100

200

300

400

500

600

Time (hour)Re

ques

ts p

er h

our

(thou

sand

s)

1 3 5 7 9 11 13 15 17 19 21 23 25

1 3 5 7 9 11 13 15 17 19 21 23 25

Figure 2. Load varia-tion by hour at (a) afinancial Web siteand (b) the 1998Olympics Web site ina one-day period.The number ofrequests receivedvaries widelydepending on timeof day and other fac-tors.

42 Computer

marily offer cost and performance benefits,they also present a power advantage:Machines optimized for particular tasks andamenable to workload-specific tuning forhigher energy efficiencies serve each distinctpart of the workload.

PROCESSORSProcessor power and energy management

relies primarily on microarchitectural en-hancements, but realizing their potentialoften requires some form of systems softwaresupport.

Frequency and voltage scalingCMOS circuits in modern microprocessors con-

sume power in proportion to their frequency and tothe square of their operating voltage. However,voltage and frequency are not independent of eachother: A CPU can safely operate at a lower voltageonly when it runs at a low enough frequency. Thus,reducing frequency and voltage together reducesenergy per operation quadratically, but onlydecreases performance linearly.

Early work on dynamic frequency and voltagescaling (DVS)—the ability to dynamically adjustprocessor frequency and voltage—proposed thatinstead of running at full speed and sitting idle partof the time, the CPU should dynamically change itsfrequency to accommodate the current load andeliminate slack.2 To select the proper CPU fre-quency and voltage, the system must predict CPUuse over a future time interval.

Much of the recent work in DVS has sought todevelop prediction heuristics,3,4 but further researchon predicting processor use for server workloadsis needed. Other barriers to implementing DVS onSMP servers remain. For example, many cache-coherence protocols assume that all processors runat the same frequency. Modifying these protocols tosupport DVS is nontrivial.

Standard DVS techniques attempt to minimizethe time the processor spends running the operat-ing system idle loop. Other researchers haveexplored the use of DVS to reduce or eliminate thestall cycles that poor memory access latency causes.Because memory access time often limits the per-formance of data-intensive applications, runningthe applications at reduced CPU frequency has alimited impact on performance.

Offline profiling or compiler analysis can deter-mine the optimal CPU frequency for an applicationor application phase. A programmer can insertoperations to change frequency directly into the

application code, or the operating system can per-form the operations during process scheduling.

Simultaneous multithreadingUnlike DVS, which reduces the number of idle

processor cycles by lowering frequency, simultane-ous multithreading (SMT) increases the processoruse at a fixed frequency. Single programs exhibitidle cycles because of stalls on memory accesses oran application’s inherent lack of instruction-levelparallelism. Therefore, SMT maps multiple pro-gram contexts (threads) onto the processor con-currently to provide instructions that use theotherwise idle resources. This increases performanceby increasing processor use. Because the threadsshare resources, single threads are less likely to issuehighly speculative instructions.

Both effects improve energy efficiency becausefunctional units stay busy executing useful, non-speculative instructions.5 Support for SMT chieflyinvolves duplicating the registers describing thethread state, which requires much less power over-head than adding a processor.

Processor packingWhereas DVS and SMT effectively reduce idle

cycles on single processors, processor packing oper-ates across multiple processors. Research at IBMshows that average processor use of real Webservers is 11 to 50 percent of their peak capacity.4

In SMP servers, this presents an opportunity toreduce power consumption.

Rather than balancing the load across all proces-sors, leaving them all lightly loaded, processorpacking concentrates the load onto the smallestnumber of processors possible and turns theremaining processors off. The major challenge toeffectively implementing processor packing is thecurrent lack of hardware support for turningprocessors in SMP servers on and off.

Throughput computingCertain workload types might offer additional

opportunities to reduce processor power. Serverstypically process many requests, transactions, or jobsconcurrently. If these tasks have internal parallelism,developers can replace high-performance processorswith a larger number of slower but more energy-effi-cient processors providing the same throughput.

Piranha6 and our Super-Dense Server7 prototypeare two such systems. The SDS prototype, operat-ing as a cluster, used half of the energy that a con-ventional server used on a transaction-processingworkload. However, throughput computing is not

Realizing the potential of

processor power and energy

management often requires

software support.

December 2003 43

suitable for all applications, so server makers mustdevelop separate systems to provide improved sin-gle-thread performance.

MEMORY POWERServer memory is typically double-data rate syn-

chronous dynamic random access memory (DDRSDRAM), which has two low-power modes:power-down and self-refresh. In a large server, thenumber of memory modules typically surpasses thenumber of outstanding memory requests, so mem-ory modules are often idle. The memory controllerscan put this idle memory into low-power modeuntil a processor accesses it.

Switching to power-down mode or back to activemode takes only one memory cycle and can reduceidle DRAM power consumption by more than 80percent. This savings may increase with newer tech-nology.

Self-refresh mode might achieve even greaterpower savings, but it currently requires several hun-dred cycles to return to active mode. Thus, realiz-ing this mode’s potential benefits requires moresophisticated techniques.

Data placementData distribution in physical memory determines

which memory devices a particular memory accessuses and consequently the devices’ active and idleperiods. Memory devices with no active data canoperate in a low-power mode.

When a processor accesses memory, memorycontrollers can bring powered-down memory intoan active state before proceeding with the access.To optimize performance, the operating system canactivate the memory that a newly scheduled processuses during the context switch period, thus largelyhiding the latency of exiting the low-power mode.8,9

Better page allocation policies can also saveenergy. Allocating new pages to memory devicesalready in use helps reduce the number of activememory devices.9,10 Intelligent page migration—moving data from one memory device to another toreduce the number of active memory devices—canfurther reduce energy consumption.9,11

However, minimizing the number of activememory devices also reduces the memory band-width available to the application. Some accessespreviously made in parallel to several memorydevices must be made serially to the same mem-ory device.

Data placement strategies originally targetedlow-end systems; hence, developers must carefullyevaluate them in the context of commercial server

environments, where reduced memory band-width can substantially impact performance.

Memory address mappingServer memory systems provide config-

urable address mappings that can be used forenergy efficiency. These systems partitionmemory among memory controllers, witheach controller responsible for multiple banksof physical memory.

Configurable parameters in the system mem-ory organization determine the interleaving—themapping from a memory address to its location in aphysical memory device. This mapping often occursat the granularity of the higher-level cache line size.

One possible interleaving allocates consecutivecache lines to DRAMs that different memory con-trollers manage, striping the physical memory acrossall controllers. Another approach involves mappingconsecutive cache lines to the same controller, usingall the physical memory under one controller beforemoving to the next. Under each controller, memoryaddress interleaving can similarly occur across themultiple physical memory banks.

Spreading consecutive accesses—or even a singleaccess—across multiple devices and controllers canimprove memory bandwidth, but it may consumemore power because the server must activate morememory devices and controllers.

The interactions of interleaving schemes with con-figurable DRAM parameters such as page-modepolicies and burst length have an additional impacton memory latencies and power consumption.Contrary to current practice in which systems comewith a set interleaving scheme, a developer can tunean interleaving scheme to obtain the desired powerand performance tradeoff for each workload.

Memory compressionAn orthogonal approach to reducing the active

physical memory is to use a system with less mem-ory. IBM researchers have demonstrated that theycan compress the data in a server’s memory to halfits size.12 A modified memory controller compressesand decompresses data when it accesses the mem-ory. However, compression adds an extra step tothe memory access, which can be time-consuming.

Adding the 32-Mbyte L3 cache used in the IBMcompression implementation significantly reducesthe traffic to memory, thereby reducing the perfor-mance loss that compression causes. In fact, a com-pressed memory server can perform at nearly thesame level as a server with no compression anddouble the memory.

Minimizing the number of active

memory devices alsoreduces the memorybandwidth availableto the application.

44 Computer

Although the modified memory controller andlarger caches need additional power, high memorycapacity servers should expect a net savings.However, for workloads with the highest memorybandwidth requirements and working sets exceed-ing the compression buffer size, this solution maynot be energy efficient. Quantifying compression’spower and performance benefit requires furtheranalysis.

Cache coherenceReducing overhead due to cache coherence traf-

fic can also improve energy efficiency. Shared-mem-ory servers often use invalidation protocols tomaintain cache coherency. One way to implementthese protocols is to have one cache level (typicallyL2 cache) track—or snoop—memory requestsfrom all processors.

The cache controllers of all processors in theserver share a bus to memory and the other caches.

Each cache controller arrives at the appropriatecoherence action to maintain consistency betweenthe caches based on its ability to observe memorytraffic to and from all caches, and to snoop remotecaches. Snoop accesses from other processorsgreatly increase L2 cache power consumption.

Andreas Moshovos and colleagues introduced asmall cache-like structure they term a Jetty at eachL2 cache to filter incoming snoop requests.13 TheJetty predicts whether the local cache contains arequested line with no false negatives. Querying theJetty first and forwarding the snoop request to theL2 cache only when it predicts the line is in the localcache reduces L2 cache energy for all accesses byabout 30 percent.

Craig Saldanha and Mikko Lipasti proposereplacing parallel snoop accesses with serialaccesses.14 Many snoop cache implementationsbroadcast the snoop request in parallel to all cacheson the SMP bus. Snooping other processors’ caches

Power and energy conservation have recently become key con-cerns for high-performance servers, especially when deployedin large cluster configurations as in data centers and Web-host-ing facilities.1

Research teams from Rutgers University2 and Duke Uni-versity3 have proposed similar strategies for managing energyin Web-server clusters. The idea is to dynamically distribute theload offered to a server cluster so that, under light load, the sys-tem can idle some hardware resources and put them in low-power modes. Under heavy load, the system should reactivatethe resources and redistribute the load to eliminate performancedegradation. Because Web-server clusters replicate file data at

all nodes, and traditional server hardware has very high basepower—the power consumed when the system is on but idle—these systems dynamically turn entire nodes on and off, effec-tively reconfiguring the cluster.

Figure A illustrates the behavior of the Rutgers system for aseven-node cluster running a real, but accelerated, Web trace.The figure shows the evolution of the cluster configuration andoffered loads on each resource as a function of time. It plots theload on each resource as a percentage of the nominal through-put of the same resource in one node. The figure shows that thenetwork interface is the bottleneck resource throughout theentire experiment.

Figure B shows the cluster power consumption for two ver-sions of the same experiment, again as a function of time. Thelower curve (dynamic configuration) represents the reconfig-

Energy Conservation in Clustered ServersRicardo Bianchini, Rutgers UniversityRam Rajamony, IBM Austin Research Lab

100

200

300

400

500

600

700

800

900

1,000

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,0000

1

2

3

4

5

6

7

8

9

10

Load

(per

cent

of o

ne n

ode)

Num

ber o

f nod

es

Time (seconds)

CPU loadDisk loadNet loadNumber of nodes

Figure A. Cluster evolution and per-resource offered loads. Thenetwork interface is the bottleneck resource throughout the entireexperiment.

Figure B. Power under static and dynamic configurations.Reconfiguration reduces power consumption significantly for mostof the experiment, saving 38 percent in energy.

100

200

300

400

500

600

700

800

Load

(per

cent

of o

ne n

ode)

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000Time (seconds)

Static configurationDynamic configuration

December 2003 45

one at a time reduces the total number of snoopaccesses. Although this method can reduce theenergy and maximum power used, it increases theaverage cache miss access latency. Therefore, devel-opers must balance the energy savings in the mem-ory system against the energy used to keep othersystem components active longer.

ENERGY MANAGEMENT MECHANISMSMost energy management techniques consist of

one or more mechanisms that determine the powerconsumption of system hardware components anda policy that determines the best use of these mech-anisms.

Although developers can implement some energy-management techniques completely in hardware,combining hardware mechanisms with softwarepolicy is beneficial. Placing the policy in softwareallows for easier modification and adaptation, aswell as more natural interfaces for users and admin-

istrators. Software policies can also use higher-levelinformation such as application priorities, work-load characteristics, and performance goals.

The most commonly used power-managementmechanisms support several operating states withdifferent levels of power consumption. We broadlycategorize these operating states as

• active, in which the processor or device contin-ues to operate, but possibly with reduced per-formance and power consumption. Processorsmight have a range of active states with differ-ent frequency and power characteristics.

• idle, in which the processor or device is notoperating. Idle states vary in power consump-tion and the latency for returning the compo-nent to an active state.

The Advanced Configuration and PowerInterface specification (www.acpi.info) provides a

urable version of the system, whereas the higher curve (staticconfiguration) represents a configuration fixed at seven nodes.The figure shows that reconfiguration reduces power consump-tion significantly for most of the experiment, saving 38 percentin energy.

Karthick Rajamani and Charles Lefurgy4 studied how toimprove the cluster reconfiguration technique’s energy savingpotential by using spare servers and history information aboutpeak server loads. They also modeled the key system and work-load parameters that influence this technique.

Mootaz Elnozahy and colleagues5 evaluated different combi-nations of cluster reconfiguration and two types of dynamic volt-age scaling—independent and coordinated—for clusters inwhich the base power is relatively low. In independent voltagescaling, each server node makes its own independent decisionabout what voltage and frequency to use, depending on the loadit is receiving. In coordinated voltage scaling, nodes coordinatetheir voltage and frequency settings. Their simulation resultsshowed that, while either coordinated voltage scaling or recon-figuration is the best technique depending on the workload, com-bining them is always the best approach.

In other research, Mootaz Elnozahy and colleagues6 proposedusing request batching to conserve processor energy under a lightload. In this technique, the network interface processor accu-mulates incoming requests in memory, while the server’s hostprocessor is in a low-power state. The system awakens the hostprocessor when an accumulated request has been pending forlonger than a threshold.

Taliver Heath and colleagues7 considered reconfiguration inthe context of heterogeneous server clusters. Their experimen-tal results for a cluster of traditional and blade nodes show thattheir heterogeneity-conscious server can conserve more thantwice as much energy as a heterogeneity-oblivious reconfigurableserver.

References1. R. Bianchini and R. Rajamony, Power and Energy Management

for Server Systems, tech. report DCS-TR-528, Dept. ComputerScience, Rutgers Univ., 2003.

2. E. Pinheiro et al., “Dynamic Cluster Reconfiguration for Powerand Performance,” L. Benini, M. Kandemir, and J. Ramanujam,eds., Compilers and Operating Systems for Low Power, KluwerAcademic, 2003. Earlier version published in Proc. WorkshopCompilers and Operating Systems for Low Power, 2001.

3. J.S. Chase et al., “Managing Energy and Server Resources inHosting Centers,” Proc. 18th ACM Symp. Operating SystemsPrinciples, ACM Press, 2001, pp. 103-116.

4. K. Rajamani and C. Lefurgy, “On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters,”Proc. 2003 IEEE Int’l Symp. Performance Analysis of Systemsand Software, IEEE Press, 2003, pp. 111-122.

5. E.N. Elnozahy, M. Kistler, and R. Rajamony, “Energy-EfficientServer Clusters,” Proc. 2nd Workshop Power-Aware ComputingSystems, LNCS 2325, Springer, 2003, pp. 179-196.

6. E.N. Elnozahy, M. Kistler, and R. Rajamony, “EnergyConservation Policies for Web Servers,” Proc. 4th Usenix Symp.Internet Technologies and Systems, Usenix Assoc., 2003, pp.99-112.

7. T. Heath et al., “Self-Configuring Heterogeneous ServerClusters,” Proc. Workshop Compilers and Operating Systemsfor Low Power, 2003, 75-84.

Ricardo Bianchini is an assistant professor in the Departmentof Computer Science, Rutgers University. Contact him [email protected].

Ram Rajamony is a research staff member at IBM AustinResearch Lab. Contact him at [email protected].

46 Computer

common terminology and a standard set ofinterfaces for software power and energymanagement mechanisms. ACPI defines upto 16 active—or performance—states, P0through P15, and three idle—or power—states, C1 through C3 (or D1 through D3 fordevices). The specification defines a set oftables that define the power/performance/latency characteristics of these states, whichare both processor- and device-dependent.

A simple but powerful approach to definingpower-management policies is to specify a set ofsystem operating states—compositions of systemcomponent states and modes—and then write apolicy as a mapping from current system state andapplication activity to a system operating state inaccordance with desired power and performancecharacteristics. Defining policies based on existingsystem states such as idle, processing-application,and processing-interrupt is an efficient and trans-parent way to introduce power-management poli-cies without overhauling the entire operatingsystem.

IBM and MontaVista Software took this ap-proach in the design of Dynamic Power Manage-ment, a software architecture that supportsdynamic voltage and frequency scaling for system-on-chip environments.15 Developers can use DPMto manipulate processor cores and related bus fre-quencies according to high-level policies they spec-ify with optional policy managers.

Although the initial work focuses on dynamicvoltage and frequency scaling in system-on-chipdesigns, developing a similar architecture for man-aging other techniques in high-end server systemsis quite possible.

Early work on software-managed power con-sumption focused on embedded and laptop sys-tems, trading performance for energy conservationand thus extending battery life. Recently,researchers have argued that operating systemsshould implement power-conservation policies andmanage power like other system resources such asCPU time, memory allocation, and disk access.16

Subsequently, researchers developed prototypesystems that treat energy as a first-class resourcethat the operating system manages.17 These proto-types use an energy abstraction to unify manage-ment of the power states and system componentsincluding the CPU, disk, and network interface.

Extending such operating systems to manage sig-nificantly more complex high-end machines anddeveloping policies suitable for server systemrequirements are nontrivial tasks. Key questions

include the frequency, overhead, and complexityof power measurement and accounting and thelatency and overhead associated with managingpower in large-scale systems.

Although it increases overall complexity, a hyper-visor—a software layer that lets a single server runmultiple operating systems concurrently—mayfacilitate solving the energy-management problemsof high-end systems. The hypervisor, by necessity,encapsulates the policies for sharing systemresources among execution images. Incorporatingenergy-efficiency rules into these policies gives thehypervisor control of system-wide energy man-agement. It also allows developers to introduceenergy-management mechanisms and policies thatare specific to the server without requiring changesin the operating systems running on it.

Because many servers are deployed in clusters,and because it’s relatively easy to turn entire sys-tems on and off, several researchers have consid-ered energy management at the cluster level. Forexample, the Muse prototype uses an economicmodel that measures the performance value ofadding a server versus its energy cost to determinethe number of active servers the system needs.18

Power-aware request distribution (PARD)attempts to minimize the number of servers neededby concentrating load on a few servers and turn-ing the rest off.19 As server load changes, PARDturns machines on and off as needed.

Ricardo Bianchini and Ram Rajamony discussenergy management for clusters further in the“Energy Conservation in Clustered Servers” sidebar.

FUTURE DIRECTIONSThree research areas will profoundly impact

server system energy management:

• power-efficient architectures, • system-wide management of power-manage-

ment techniques and policies, and • evaluation of power and performance trade-

offs.

Heterogeneous systems have significant poten-tial as a power-efficient architecture. These systemsconsist of multiple processing elements, eachdesigned for power- and performance-efficient pro-cessing of particular workload types. For example,a heterogeneous system for Internet applicationscombines network processors with a set of energy-efficient processors to provide both efficient net-work-protocol processing and application-levelcomputation.

Heterogeneous systems have

significant potentialas a power-efficient

architecture.

December 2003 47

With the increase in transistor densities, hetero-geneity also extends into the microarchitecturalarena, where a single chip combines several cores,each suited for efficient processing of a specificworkload. Early work in this area combines mul-tiple generations of the Alpha processor on a singlechip, using only the core requiring the lowest powerwhile still offering sufficient performance for thecurrent workload.20 As the workload changes, sys-tem software shifts the executing program fromcore to core, trying to match the application’srequired performance while minimizing power con-sumption.

Server systems can employ several energy-man-agement techniques concurrently. A coordinatedhigh-level management approach is critical to han-dling the complexity of multiple techniques andapplications and achieving the desired power andperformance. Specifying high-level power-man-agement policies and mapping them to availablemechanisms is a key issue that researchers mustaddress.

Another question concerns where to implementthe various policies and mechanisms: in dedicatedpower-management applications, other applica-tions, middleware, operating systems, hypervisors,or individual hardware components. Clearly, noneof these alternatives can achieve comprehensiveenergy management in isolation. At the very least,arriving at the right decision requires efficientlycommunicating information between the systemlayers.

The real challenge is to make such directedautonomy work correctly and at reasonable over-head. Heng Zeng and colleagues suggest usingmodels based on economic principles.17 Anotheroption is to apply formal feedback-control and pre-diction techniques to energy management. Forexample, Sivakumar Velusamy and colleagues usedcontrol theory to formalize cache power manage-ment.21 The Clockwork project22 applies predictivemethods to systems-level performance manage-ment, but less work has been done on using thesemethods.

Designers and implementers need techniques tocorrectly and conveniently evaluate energy man-agement solutions. This requires developing bench-marks that focus not just on peak systemperformance but also on delivered performancewith associated power and energy costs. Closely tiedto this is the use of metrics that incorporate powerconsiderations along with performance. Anotherkey component is the ability to closely monitor sys-tem activity and correlate it to energy consumption.

P erformance monitoring has come a long wayin the past few years, with high-end proces-sors supporting an array of programmable

event-monitoring counters. But system-level powermanagement requires information about bus-leveltransactions and activity in the memory hierarchy.Much of this information is currently missing.Equally important is the ability to relate the valuesof these counters to both energy consumption andthe actual application-level performance. The ulti-mate goal, after all, is to achieve a good balancebetween application performance and systemenergy consumption. ■

AcknowledgmentThis work was supported, in part, by the US

Defense Advanced Research Projects Agency(DARPA) under contract F33615-01-C-1892.

References1. T. Mudge, “Power: A First-Class Architectural Design

Constraint,” Computer, Apr. 2001, pp. 52-57.2. M. Weiser et al., “Scheduling for Reduced CPU

Energy,” Proc. 1st Symp. Operating Systems Designand Implementation, Usenix Assoc., 1994, pp. 13-23.

3. K. Flautner and T. Mudge, “Vertigo: Automatic Per-formance-Setting for Linux,” Proc. 5th Symp. Oper-ating Systems Design and Implementation (OSDI),Usenix Assoc., 2002, pp. 105-116.

4. P. Bohrer et al., “The Case for Power Managementin Web Servers,” Power-Aware Computing, R. Gray-bill and R. Melhem, eds., Series in Computer Science,Kluwer/Plenum, 2002.

5. J. Seng, D. Tullsen, and G. Cai, “Power-Sensitive Mul-tithreaded Architecture,” Proc. 2000 Int’l Conf. Com-puter Design, IEEE CS Press, 2000, pp. 199-208.

6. L.A. Barroso et al., “Piranha: A Scalable Architec-ture Based on Single-Chip Multiprocessing,” Proc.27th ACM Int’l Symp. Computer Architecture, ACMPress, 2000, pp. 282-293.

7. W. Felter et al., “On the Performance and Use ofDense Servers,” IBM J. Research and Development,vol. 47, no. 5/6, 2003, pp. 671-688.

8. V. Delaluz et al., “Scheduler-Based DRAM EnergyManagement,” Proc. 39th Design Automation Conf.,ACM Press, 2002, pp. 697-702.

9. H. Huang, P. Pillai, and K.G. Shin, “Design andImplementation of Power-Aware Virtual Memory,”Proc. Usenix 2003 Ann. Technical Conf., UsenixAssoc., 2003, pp. 57-70.

48 Computer

10. A.R. Lebeck et al., “Power-Aware Page Allocation,”Proc. Int’l Conf. Architectural Support for Pro-gramming Languages and Operating Systems (ASP-LOS), ACM Press, 2000, pp. 105-116.

11. V. Delaluz, M. Kandemir, and I. Kolcu, “AutomaticData Migration for Reducing Energy Consump-tion in Multi-Bank Memory Systems,” Proc. 39thDesign Automation Conf., ACM Press, 2002, pp.213-218.

12. R.B. Tremaine et al., “IBM Memory Expansion Tech-nology (MXT),” IBM J. Research and Development,vol. 45, no. 2, 2001, pp. 271-286.

13. A. Moshovos et al., “Jetty: Filtering Snoops forReduced Energy Consumption in SMP Servers,”Proc. 7th Int’l Symp. High-Performance ComputerArchitecture, IEEE CS Press, 2001, pp. 85-96.

14. C. Saldanha and M. Lipasti, “Power Efficient CacheCoherence,” High-Performance Memory Systems,H. Hadimiouglu et al., eds., Springer-Verlag, 2003.

15. B. Brock and K. Rajamani, “Dynamic Power Man-agement for Embedded Systems,” Proc. IEEE Int’lSOC Conf., IEEE Press, 2003, pp. 416-419.

16. A. Vahdat, A. Lebeck, and C. Ellis, “Every Joule IsPrecious: The Case for Revisiting Operating SystemDesign for Energy Efficiency,” Proc. 9th ACMSIGOPS European Workshop, ACM Press, 2000,pp. 31-36.

17. H. Zeng et al., “Currentcy: A Unifying Abstractionfor Expressing Energy Management Policies,” Proc.General Track: 2003 Usenix Ann. Technical Conf.,Usenix Assoc., 2003, pp. 43-56.

18. J. Chase et al., “Managing Energy and ServerResources in Hosting Centers,” Proc. 18th Symp.Operating Systems Principles (SOSP), ACM Press,2001, pp. 103-116.

19. K. Rajamani and C. Lefurgy, “On EvaluatingRequest-Distribution Schemes for Saving Energy inServer Clusters,” Proc. IEEE Int’l Symp. Perfor-mance Analysis of Systems and Software, IEEE Press,2003, pp. 111-122.

20. R. Kumar et al., “A Multi-Core Approach toAddressing the Energy-Complexity Problem inMicroprocessors,” Proc. Workshop on Complexity-Effective Design (WCED 2003), 2003; www.ece.rochester.edu/~albonesi/wced03.

21. S. Velusamy et al., “Adaptive Cache Decay Using For-mal Feedback Control,” Proc. Workshop on MemoryPerformance Issues (held in conjunction with the29th Int’l Symp. Computer Architecture), ACMPress, 2002; www.cs.virginia.edu/~skadron/Papers/wmpi_decay.pdf.

22. L. Russell, S. Morgan, and E. Chron, “Clockwork: ANew Movement in Autonomic Systems,” IBM Sys-tems J., vol. 42, no. 1, 2003, pp. 77-84.

Charles Lefurgy is a research staff member at theIBM Austin Research Lab. His research interestsinclude computer architecture and operating sys-tems. Lefurgy received a PhD in computer scienceand engineering from the University of Michigan.He is a member of the ACM and the IEEE. Contacthim at [email protected].

Karthick Rajamani is a research staff member inthe Power-Aware Systems Department at the IBMAustin Research Lab. His research interests includethe design of computer systems and applicationswith the focus on power and performance opti-mizations. Rajamani received a PhD in electricaland computer engineering from Rice University.Contact him at [email protected].

Freeman Rawson is a senior technical staff memberat the IBM Austin Research Lab. His research inter-ests include operating systems, middleware, andsystems management. Rawson received a PhD inphilosophy from Stanford University. He is a mem-ber of the IEEE Computer Society, the ACM, andAAAI. Contact him at [email protected].

Wes Felter is a researcher at the IBM AustinResearch Lab. His interests include operating sys-tems, networking, peer-to-peer, and the sociopolit-ical aspects of computing. Felter received a BS incomputer sciences from the University of Texas atAustin. He is a member of the ACM. Contact himat [email protected].

Michael Kistler is a senior software engineer at theIBM Austin Research Lab and a PhD candidate incomputer science at the University of Texas atAustin. His research interests include operating sys-tems and fault-tolerant computing. Contact him [email protected].

Tom W. Keller manages the Power-Aware SystemsDepartment at the IBM Austin Research Lab.Keller received a PhD in computer sciences fromthe University of Texas at Austin. Contact him [email protected].