compilers and operating systems for low...

COMPILERS ANDOPERATING SYSTEMS FORLOW POWER

COMPILERS ANDOPERATING SYSTEMS FORLOW POWER

Edited byLUCA BENINIUniversity of Bologna

MAHMUT KANDEMIRThe Pennsylvania State University

J. RAMANUJAMLouisiana State University

Kluwer Academic PublishersBoston/Dordrecht/London

Contents

List of Figures xi

List of Tables xv

Contributing Authors xvii

Preface xix

1Low Power Operating System for Heterogeneous Wireless Communica-

tion System1

Suet-Fei Li, Roy Sutton, Jan Rabaey1 Introduction 22 Event-driven versus General-purpose OS 3

2.1 PicoRadio II Protocol Design 32.2 General-purpose Multi-tasking OS 42.3 Event-driven OS 82.4 Comparison Summary 9

3 Low Power Reactive OS for Heterogeneous Architectures 123.1 Event-driven Global Scheduler and Power Management 123.2 TinyOS Limitations and Proposed Extensions 14

4 Conclusion and Future Work 15References 16

2A Modified Dual-Priority Scheduling Algorithm for Hard Real-Time Sys-

tems to Improve Energy Savings17

M. Angels Moncusí, Alex Arenas, Jesus Labarta1 Introduction 172 Dual-Priority Scheduling 193 Power-Low Modified Dual-Priority Scheduling 214 Experimental Results 285 Summary 36

References 36

3Toward the Placement of Power Management Points in Real-Time Applications37

v

vi COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

Nevine AbouGhazaleh, Daniel Mossé, Bruce Childers, Rami Melhem1 Introduction 372 Model 393 Sources of Overhead 40

3.1 Computing the New Speed 403.2 Setting the New Speed 40

4 Speed Adjustment Schemes 414.1 Proportional Dynamic Power Management 414.2 Dynamic Greedy Power Management 424.3 Evaluation of Power Management Schemes 43

5 Optimal Number of PMPs 445.1 Evaluation of the Analytical Model 45

6 Conclusion 48Appendix: Derivation of Formulas 48

References 51

4Energy Characterization of Embedded Real-Time Operating Systems 53Andrea Acquaviva, Luca Benini, Bruno Riccó

1 Introduction 532 Related Work 553 System Overview 56

3.1 The Hardware Platform 563.2 RTOS overview 57

4 Characterization Strategy 595 RTOS Characterization Results 60

5.1 Kernel Services 605.2 I/O Drivers 625.2.1 Burstiness Test 625.2.2 Clock Speed Test 635.2.3 Resource Contention Test 645.3 Application Example: RTOS vs Stand-alone 655.4 Cache Related Effects in Thread Switching 66

6 Summary of Findings 667 Conclusions 67

References 72

5Dynamic Cluster Reconfiguration for Power and Performance 75Eduardo Pinheiro, Ricardo Bianchini, Enrique V. Carrera, Taliver Heath

1 Motivation 772 Cluster Configuration and Load Distribution 78

2.1 Overview 782.2 Implementations 81

3 Methodology 834 Experimental Results 845 Related Work 896 Conclusions 91

Contents vii

References 91

6Energy Management of Virtual Memory on Diskless Devices 95Jerry Hom, Ulrich Kremer

1 Introduction 962 Related Work 973 Problem Formulation 984 EELRM Prototype Compiler 100

4.1 Phase 1 - Analysis 1004.2 Phase 2 - Code Generation 1014.3 Performance Model 1024.4 Example 1024.5 Implementation Issues 103

5 Experiments 1055.1 Benchmark Characteristics 1065.2 Simulation Results 107

6 Future Work 1107 Conclusion 111

References 111

7Propagating Constants Past Software to Hardware Peripherals on Fixed-

Application Embedded Systems115

Greg Stitt, Frank Vahid1 Introduction 1162 Example 1193 Parameters in Cores 1204 Propagating Constants from Software to Hardware 1235 Experiments 125

5.1 8255A Programmable Peripheral Interface 1265.2 8237A DMA Controller 1275.3 PC16550A UART 1285.4 Free-DCT-L Core 1285.5 Results 131

6 Future Work 1337 Conclusions 134

References 134

8Constructive Timing Violation for Improving Energy Efficiency 137Toshinori Sato, Itsujiro Arita

1 Introduction 1372 Low Power via Fault-Tolerance 1393 Evaluation Methodology 1434 Simulation Results 1435 Related Work 1476 Conclusion and Future Work 151

References 151

viii COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

9Power Modeling and Reduction of VLIW Processors 155Weiping Liao, Lei He

1 Introduction 1552 Cycle-Accurate VLIW Power Simulation 156

2.1 IMPACT Architecture Framework 1562.2 Power Models 1572.3 PowerImpact 158

3 Clock Ramping 1593.1 Clock Ramping with Hardware Prescan (CRHP) 1603.2 Clock Ramping with Compiler-based Prediction (CRCP) 1623.2.1 Basic CRCP Algorithm 1623.2.2 Reduction of Redundant Ramp-up Instructions 1643.2.3 Control Flow 1653.2.4 Load Instructions 165

4 Experimental Results 1655 Conclusions and Discussion 169

References 170

10Low-Power Design of Turbo Decoder with Exploration of Energy-Throughput

Trade-off173

Arnout Vandecappelle, Bruno Bougard, K.C. Shashidhar, Francky Catthoor1 Introduction 1732 Data Transfer and Storage Exploration Methodology 1763 Global Data Flow and Loop Transformations 178

3.1 Removal of Interleaver Memory 1783.2 Enabling Parallelism 179

4 Storage Cycle Budget Distribution 1804.1 Memory Hierarchy Layer Assignment 1814.2 Data Restructuring 1824.3 Loop Transformations for Parallelization 1834.3.1 Loop Merging 1834.3.2 Loop Pipelining 1844.3.3 Partial Loop Unrolling 1844.3.4 Loop Transformation Results 1854.4 Storage Bandwidth Optimization 185

5 Memory Organization 1865.1 Memory Organization Exploration 1865.2 Memory Organization Decision 188

6 Conclusions 190References 190

11Static Analysis of Parameterized Loop Nests for Energy Efficient Use of

Data Caches193

Paolo D’Alberto, Alexandru Nicolau, Alexander Veidenbaum, Rajesh Gupta1 Introduction 1932 Energy and Line Size 1953 Background 1954 The Parameterized Loop Analysis 197

Contents ix

4.1 Reduction to Single Reference Interference 1994.2 Interference and Reuse Trade-off 200

5 STAMINA Implementation Results 2005.1 Swim from SPEC 2000 2015.2 Self Interference 2015.3 Tiling and Matrix Multiply 202

6 Summary and Future Work 203References 203

12A Fresh Look at Low-Power Mobile Computing 209Michael Franz

1 Introduction 2092 Architecture 2113 Handover and the Quantization of Computational Resources 212

3.1 Standardization of Execution Environment’s Parameters 2143.2 A Commercial Vision: Impact on Billing, Customer Loyalty

and Churn 2154 Segmentation of Functionality: The XU-MS Split 215

4.1 Use of Field-Programmable Hardware in the Mobile Station 2174.2 Special End-To-End Application Requirements 217

5 Status and Research Vision 218References 219

Index 221

List of Figures

1.1 Model of computation for PicoRadio protocol stack 5

1.2 Implementing PicoRadio II with VCC 6

1.3 Code generation with general-purpose eCOS 7

1.4 PicoRadio II chip floorplan. Xtensa is the embedded mi-croprocessor 7

1.5 Implementing PicoRadio II Protocol stacks in TinyOS.Arrows show events/commands propagation in the system 9

1.6 Total cycle count comparison: General-purpose versusevent-driven OS. Key at right identifies system components10

1.7 Percentage breakdown comparison: General-purpose ver-sus event-driven OS. Key at right identifies system com-ponents 11

1.8 Behavior diagram of the PicoRadio sensor node 13

1.9 Architectural diagram for PicoRadio sensor node 14

2.1 Pseudo code for Power Low Modified Dual-Priority (PLMDP)Scheduling 22

2.2 Maximum extension time in three different situations 24

2.3 Execution time in LPFPS when all tasks use 100% WCET 25

2.4 Execution time in PLMDP when all tasks use 100% WCET 25

2.5 Execution time in LPFPS when all tasks use 50% WCET 27

2.6 Execution time in PLMDP when all tasks use 50% WCET 28

2.7 Comparison of both algorithms in the task set proposedby Shin and Choi [4] 28

2.8 Comparison of both algorithms when the workload ofthe system is 80% 30

2.9 System workload variation when all tasks consume the100% of WCET 31

2.10 System workload variation when all tasks consume the50% of WCET 31

2.11 System workload and harmonicity of the tasks periods variation32

xi

xii COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

2.12 Maximum task workload variation Non-harmonics periods 32

2.13 Tmin/Tmax variation 33

2.14 Comparison of both algorithms in the avionics task set [9] 34

2.15 Comparison of both algorithms in the INS task set [10] 35

2.16 Comparison of both algorithms in the CNC task set [11] 35

3.1 Actual execution times of a task set using the Static, Pro-portional and Dynamic Greedy schemes 42

3.2 Total energy consumption for different schemes versusthe number of PMPs 43

3.3 Total energy consumption for the Proportional schemeversus the number of PMP 46

3.4 Total energy consumption for Dynamic Greedy schemeversus the number of PMPs 46

4.1 The hardware platform: HP SmartBadgeIV 57

4.2 The software layer: eCos structure 58

4.3 Thread switch experiment: Energy consumption for dif-ferent clock frequencies at the maximum switching frequency62

4.4 Energy consumption of the audio driver for different clockspeeds at fixed data burstiness 64

5.1 Cluster evolution and resource demands for the WWW server85

5.2 Power consumption for the WWW server under staticand dynamic cluster configurations 86

5.3 Cluster evolution and resource demands for the WWW server86

5.4 Cluster evolution and resource demands in the power-aware OS 88

5.5 Power consumption for the power-aware OS under staticand dynamic cluster configurations 88

5.6 Cluster evolution and resource demands in the power-aware OS 89

6.1 Comparison of compiler vs. OS directed power mangement99

6.2 Sample code 103

6.3 Partial view oftomcatv’s page fault behavior during execution108

6.4 One iteration oftomcatv’s primary, outermost loop 109

7.1 Core-based embedded system design 116

7.2 A simple example of propagating constants to hardware(a) soft core, (b) synthesized core structure, (c) synthe-sized core structure after propagating constants contreg(0)=0and contreg(1)=1 121

7.3 The Intel 8255A parallel peripheral interface 122

7.4 Method for propagating constants to peripheral cores 124

List of Figures xiii

7.5 Block diagram of DCT core 129

8.1 ALU utilizing proposed technique 141

8.2 Clock signals 142

8.3 Energy consumption (Squash) 164.gzip 144

8.4 Energy consumption (Squash) 175.vpr 144

8.5 Energy consumption (Squash) 176.gcc 144

8.6 Energy consumption (Squash) 186.crafty 145

8.7 Energy consumption (Squash) 197.parser 145

8.8 Energy consumption (Squash) 252.eon 145

8.9 Energy consumption (Squash) 255.vortex 146

8.10 Energy consumption (Squash) 256.bzip2 146

8.11 Energy consumption (Reissue) 164.gzip 148

8.12 Energy consumption (Reissue) 175.vpr 148

8.13 Energy consumption (Reissue) 176.gcc 148

8.14 Energy consumption (Reissue) 186.crafty 149

8.15 Energy consumption (Reissue) 197.parser 149

8.16 Energy consumption (Reissue) 252.eon 149

8.17 Energy consumption (Reissue) 255.vortex 150

8.18 Energy consumption (Reissue) 256.bzip2 150

9.1 Flow diagram for IMPACT 156

9.2 Overall structure of PowerImpact 159

9.3 The relationship of states 161

9.4 Utilization rate for FPUs 161

9.5 Distribution of instruction numbers in bundles, with max-imum bundle width = 6 162

9.6 Insert ramp-up instructions 163

9.7 Insertion of ramp-up instructions beyond the current hyperblock164

9.8 Performance loss (in percentage as theZ-axis variable)of CRHP and CRCP approaches forequake 166

9.9 Power reduction (in percentage as theZ-axis variable) ofCRHP and CRCP approaches forequake 166

9.10 Performance loss (in percentage as theZ-axis variable)of CRHP and CRCP approaches forart 166

9.11 Power reduction (in percentage as theZ-axis variable) ofCRHP and CRCP approaches forart 167

9.12 Performance loss (in percentage) forTr = 10 andTa = 16 167

9.13 Power reduction (in percentage) forTr = 10 andTa = 16 168

xiv COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

9.14 Performance Loss (in percentage) before and after theamendment for load instruction, forTr = 10,Ta = 16 andTp = 9 169

10.1 Turbo coding-decoding scheme 174

10.2 Energy-performance trade-off 177

10.3 Transformed data flow of turbo decoding scheme 178

10.4 Parallelization of the MAP algorithm 179

10.5 Turbo decoding data flow and timing 180

10.6 Dependencies between memory accesses of two loops 183

10.7 Dependencies after merging the two loops of Figure 10.6 184

10.8 Dependencies after pipelining the merged loop of Figure 10.7185

10.9 Pareto curves for 7 workers, for two and for seven dual-port memories per worker 187

11.1 Grid cells and band cells in a plane 198

11.2 Tiling of Matrix Multiply. 6 parameters: loop boundsand A,B and C offsets 205

11.3 SWIM: calc1() in C code 206

11.4 Matrix Multiply. Two parameters: loop bounds and A offset206

11.5 Self interference and analysis results 207

12.1 System architecture 212

List of Tables

1.1 General comparison 9

1.2 Memory requirements comparison 10

2.1 Benchmark task set used by Shin and Choi [4] 24

2.2 Avionics benchmark task set [9] 33

2.3 INS benchmark task set [10] 33

2.4 CNC benchmark task set [11] 34

3.1 Theoretical versus Simulation choice of optimal numberof PMPs for the Proportional scheme 47

3.2 Theoretical versus Simulation choice of optimal numberof PMPs for the Dynamic Greedy scheme 47

4.1 Thread switch experiment: Energy variation due to dif-ferent switching frequencies with a fixed clock frequency(103.2Mhz) 62

4.2 Audio driver average power consumption due to differ-ent level of data burstiness at a fixed clock frequency 63

4.3 Average power consumption of the wireless LAN driverdue to different level of data burstiness at a fixed clockfrequency 64

4.4 Variation of the energy consumed by the audio driver inpresence of device contention for different switch frequencies65

4.5 Comparison between the energy consumed by two ver-sion of the speech enhancer: OS based and stand-alone 65

4.6 Testing parameters for the experiment related to Tables 4.7thru 4.9 65

4.7 Energy consumption of thread management and sched-uler functions at minimum and maximum clock frequencies 68

4.8 Energy consumption of thread communication and syn-cronization functions at minimum and maximum clockfrequencies 69

xv

xvi COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

4.9 Energy consumption of time management functions atminimum and maximum clock frequencies 71

4.10 Energy cost of thread switching in presence of cache-related effects 72

6.1 Page faults for different memory sizes in terms of pages,assuming that each array requires 4 pages of memory space103

6.2 Dynamic page hit/miss prediction accuracy 105

6.3 Benchmark parameters 106

6.4 Relative energy consumption of benchmark programswith EELRM energy management. Energy values arepercentages of OS approach. Active WaveLAN cardcontributes 40% to overall energy budget 107

6.5 Relative performance of benchmark programs under OSorEELRM energy management. Reported values are per-centages of∞ threshold — card always awake 110

7.1 Comparison of cores before and after constant propagation132

8.1 Processor configuration 143

8.2 Benchmark programs 146

9.1 Partitions in our power models 158

9.2 System configuration for experiments 163

10.1 Data structures, sizes and memory hierarchy layer as-signment. N is the window size,M is the number ofworkers,2NM is the size of one frame which is itera-tively decoded 181

10.2 Data structures, sizes and memory hierarchy layer as-signment after data restructuring.2N is the size of oneworker. Each of these data structures existsM times, i.e.once for each worker 182

10.3 Effect of parallelizing loop transformations on maximallyachievable throughput and latency 186

10.4 Memories architecture with simulated access energy andnumber of accesses per frame 189

11.1 Self interference example 201

11.2 Interference table, for the procedure in Figure 11.4 202

11.3 Interference table for the procedureijk matrix multiply 4in Figure 11.2 202

12.1 Different classes of execution units and applicable usagescenarios 213

Contributing Authors

Nevine AbouGhazaleh University of Pittsburgh, USAAndrea Acquaviva University of Bologna, ItalyAlex Arenas Universitat Rovira i Virgili, SpainItsujiro Arita Kyushu Institute of Technology, JapanLuca Benini University of Bologna, ItalyRicardo Bianchini Rutgers University, USABruno Bougard IMEC, BelgiumFrancky Catthoor IMEC, BelgiumBruce Childers University of Pittsburgh, USAPaolo D’Alberto University of California–Irvine, USAMichael Franz University of California–Irvine, USARajesh Gupta University of California–Irvine, USATaliver Heath Rutgers University, USALei He University of California–Los Angeles, USAJerry Hom Rutgers University, USAUlrich Kremer Rutgers University, USAJesus Labarta Universitat Politecnica de Catalunya, SpainWeiping Liao University of California–Los Angeles, USASuet-Fei Li University of California–Berkeley, USARami Melhem University of Pittsburgh, USAM. Angels Moncusí Universitat Rovira i Virgili, SpainDaniel Mossé University of Pittsburgh, USAAlexandru Nicolau University of California–Irvine, USAEduardo Pinheiro Rutgers University, USAJan Rabaey University of California–Berkeley, USABruno Riccó University of Bologna, ItalyToshinori Sato Kyushu Institute of Technology, JapanK.C. Shashidhar IMEC, BelgiumGreg Stitt University of California–Riverside, USARoy Sutton University of California–Berkeley, USAEnrique V. Carrera Rutgers University, USAFrank Vahid University of California–Riverside, USAArnout Vandecappelle IMEC, BelgiumAlexander Veidenbaum University of California–Irvine, USA

xvii

Preface

In the last ten years, power dissipation has emerged as one of the most crit-ical issues in the development of large-scale integrated circuits, and electronicsystems in general. Technology scaling is not the only cause for this trend: infact, we are moving toward a world of pervasive electronics, where our cars,houses, and even our environment and our bodies will be linked in a finely-knit network of communicating electronic devices capable of complex com-putational tasks materializing a vision of “ambient intelligence,” the ultimategoal of embedded computing. Today, power consumption is probably the mainobstacle in the realization of this vision: current electronic systems still requiretoo much power to perform critical ambient intelligence tasks (e.g., voice pro-cessing, vision, wireless communication). For this reason, power, or energy(i.e., power-performance ratio) minimization is now aggressively targeted inall the phases of electronic system design.

While early low-power (or energy-efficient) design focused on technologyand hardware optimization, it is now clear that software power optimizationis an equally critical goal. Most of complex integrated systems are highlyprogrammable. In fact, the new millennium has seen the rapid diffusion of em-bedded processor cores as the basic computational workhorse for large-scaleintegrated systems on silicon, and today we are witnessing the rebirth of mul-tiprocessor architectures, fully integrated on a single silicon substrate. It istherefore obvious that the power consumption of integrated systems dominatedby core processors and memories is heavily dependent on the applications theyrun and the middleware supporting them.

In general, we can view the software infrastructure as layered in applicationsand run-time support middleware (often called “operating system”). Applica-tions control the user-level functionality of the system, but they interface tothe SoC platform via hardware abstraction layers provided by the middleware.Software energy minimization can be tackled with some hope of success only ifboth application-level software and middleware are both optimized form maxi-mum energy efficiency. The Compilers and Operating Systems for Low Power(COLP) Workshop aims at creating a forum that brings together researchersoperating in both application-level energy optimization and low-power operat-

xix

xx COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

ing systems. The main objective of this initiative is to create opportunities forcross-fertilization between closely related areas that can greatly benefit froma tighter interaction. Papers presented at COLP are work-in-progress and areselected based on their potential for stimulating thoughts and creative discus-sions.

This book is the result of a careful (and sometimes painful) process of se-lection and refinement of the most significant contributions to the 2001 editionof COLP. The editors have first selected the papers based both on reviewerevaluations and on feedback from the audience at the oral presentation. Theyhave then solicited an extended version of the papers, in a format more suitablefor archival publications. The extended versions have then been reviewed bythe editors to ensure consistency. The results of this “distillation” process havebeen collected in this book, which we hope will bring the reader a wealth offresh and valuable ideas for further research as well as technology transfer.

Organization

The book is divided into twelve chapters. The first six chapters focus onlow energy operating systems, or more in general, energy-aware middlewareservices. The following five chapters are centered on compilation and code op-timization. Finally, the last chapter takes a more general viewpoint on mobilecomputing.

Chapter 1, entitled “Low Power Operating System for Heterogeneous Wire-less Communication Systems” is contributed by Suet-Fei Li, Roy Sutton, andJan Rabaey, from UC Berkeley. The chapter describes an ultra-low overheadoperating system for wireless microsensors and compares it with more tradi-tional embedded operating systems.

Chapter 2, “Low Power Approach in a Modified Dual Priority Schedul-ing for Hard Real-Time Systems” (by M. Angels Moncusí, A. Arenas, andJ. Labarta from Universitat Rovira i Virgili and Universitat Politecnica deCatalunya) deal with task scheduling, one of the most classical problems inreal-time operating systems, and investigates a novel dual- priority algorithmwith high energy efficiency.

The third chapter, contributed by N. Nevine AbouGazelah, D. Mossé, R. Mel-hem, and B. Childers (from University of Pittsburgh) entitled “A RestrictedModel for the Optimal Placement of Power Management Points in Real TimeApplications” deals with an important issue at the boundary between applica-tions and operating systems, namely the optimal insertion of systems calls thatdynamically change the supply voltage (and operating frequency) during theexecution of an application.

The fourth chapter, by A. Acquaviva, L. Benini and B. Riccó (Università diBologna), is entitled “Energy Characterization of Embedded Real-Time Oper-

PREFACE xxi

ating Systems.” The chapter describes a methodology for characterizing theenergy cost of most primitives and function calls in embedded operating sys-tems.

Chapter 5, by E. Pinheiro, R. Bianchini, E. Carrera and T. Heath (RutgersUniversity), is entitled “Load Balancing and Unbalancing for Power and Per-formance in cluster-Based Systems” and it deals with an important emerg-ing topic, namely low-energy multiprocessors. The chapter gives a fresh lookat load balancing issues in cluster-based systems when energy constraints aretight.

Chapter 6 closes the first group. It is entitled “Energy Management of Vir-tual Memory on Diskless Devices” (by J. Hom and U. Kremer, Rutgers Uni-versity) and it deals with virtual memory, one of the basic hardware abstractionlayers provided by standard operating systems.

The next chapter, entitled “Propagating Constants Past Software to Hard-ware Peripherals in Fixed-Application Embedded Systems,” contributed byG. Stitt and F. Vahid, discusses how propagating application-level constantto hardware improves both power and form factor, leading to up to 2-3 timesreductions in peripheral size.

In Chapter 8, entitled “Constructive Timing Violation for Improving En-ergy Efficiency,” T. Sato and I. Arita present a technique that relies on a fault-tolerance mechanism and speculative execution to save power. Their tech-nique, calledconstructive timing violation,guarantees that the timing con-straints for critical paths are not violated.

In the next chapter, entitled “Power Modeling and Reduction of VLIW Pro-cessors,” the authors W. Liao and L. He present an in-depth study of powerbehavior of a VLIW architecture, and develop an infrastructure which can beused for architecture-based as well as compiler studies.

Chapter 10, entitled “Low Power Design of Turbo Decoder Module withExploration of Power-Performance Tradeoffs,” demonstrates how a system-atic data transfer and storage exploration methodology helps characterize en-ergy and performance behavior of Turbo Coding. Vandecappelle, Bougard,Shashidbar, and Catthoor also discuss the cycle budget-energy tradeoff.

In the next chapter, “Static Analysis of Parameterized Loop Nests for En-ergy Efficient Use of Data Caches,” P. D’Alberto, A. Nicolau, A. Veidenbaum,and R. Gupta demonstrate that the compiler analysis of loop with regular ac-cess patterns can reveal useful information for optimizing power.

Finally, in Chapter 12, entitled “A Fresh Look at Low-Power Mobile Com-puting,” M. Franz presents a technique that allows large portions of applica-tions to be offloaded to a base station for execution.

We believe that, with the proliferation of power-constrained devices, energyoptimizations will become even more important in the future. Consequently, itis hard to imagine that architectural and circuit-level optimizations alone will

xxii COMPILERS AND OPERATING SYSTEMS FOR LOW POWER

provide the required level of energy efficiency for demanding applications ofnext generation computing. The research papers presented here do not onlydemonstrate state-of-the-art, but they also prove that, to obtain the best en-ergy/performance characteristics, compiler, system software, and architecturemust work together.

Acknowledgments

This book grew out of the Workshop on Compilers and Operating Systems,2001 (COLP 01). We acknowledge the active contribution of the programcommittee of COLP 01: Eduard Ayguade, R. Chandramouli, Bruce Childers,Marco Cornero, Rudi Eigenmann, Manish Gupta, Rajiv Gupta, Mary Jane Ir-win, Uli Kremer, Rainer Leupers, Diana Marculescu, Enric Musoll, AnandSivasubramaniam, Mary Lou Soffa, Vamsi K. Srikantam, Chau-Wen Tseng,Arnout Vandecappelle, and N. Vijaykrishnan. In addition, we thank the fol-lowing reviewers for their thoughtful reviews of the initial submissions to theworkshop: Bharadwaj Amrutur, Eui Young Chung, Anoop Iyer, Miguel Mi-randa, Phillip Stanley-Marbell, Emil Talpes, Chun Wong, and and Peng Yang.The feedback from the audience at the COLP 01 workshop is greatly appreci-ated.

We sincerely thank Alex Greene and Melissa Sullivan, and the editorial teamat Kluwer for their invaluable help, enthusiasm and encouragement through-out this project. We gratefully acknowledge the support of the U.S. NationalScience Foundation through grants CCR–9457768, CCR–0073800, and CCR–0093082 during this project.

LUCA BENINI, MAHMUT KANDEMIR, J. RAMANUJAM

Chapter 1DYNAMIC CLUSTER RECONFIGURATIONFOR POWER AND PERFORMANCE �Eduardo Pinheiro, Ri ardo Bian hini, Enrique V. Carrera, and TaliverHeathDepartment of Computer S ien eRutgers Universityedpin,ri ardob,vini io,taliver� s.rutgers.eduAbstra t In this paper we address power onservation for lusters of worksta-tions or PCs. Our approa h is to develop systems that dynami allyturn luster nodes on { to be able to handle the load imposed on thesystem eÆ iently { and o� { to save power under lighter load. The key omponent of our systems is an algorithm that makes luster re on�gu-ration de isions by onsidering the total load imposed on the system andthe power and performan e impli ations of hanging the urrent on-�guration. The algorithm is implemented in two ommon luster-basedsystems: a network server and an operating system for lustered y leservers. Our experimental results are very favorable, showing that oursystems onserve both power and energy in omparison to traditionalsystems.Keywords: Load balan ing, load on entration, power and energy onservation.Introdu tionPower and energy onsumption have always been riti al on erns forlaptop and hand-held devi es, as these devi es generally run on batteriesand are not onne ted to the ele tri al power grid. Over the years, alarge amount of resear h has been devoted to low-power and low-energy�This resear h has been supported by NSF under grant # CCR-9986046.1

2design and onservation (e.g. [Halfhill, 2000; Weiser et al., 1994; Lebe ket al., 2000; Douglis and Krishnan, 1995; Flinn and Satyanarayanan,1999℄).In ontrast with this line of resear h, in this paper we fo us on powerand energy onservation for lusters of workstations or PCs, su h asthose that support most Internet ompanies and a large number of re-sear h and tea hing organizations. Our approa h to onserving powerand energy is to develop systems that an leverage the widespread repli- ation of resour es in lusters. In parti ular, we develop systems that an dynami ally turn luster nodes on { to be able to handle the loadimposed on the system eÆ iently { and o� { to save power under lighterload.This resear h is inspired by previous work in luster-wide load balan -ing (e.g. [Barak and La'adan, 1998; Ghormley et al., 1998; Litzkow andSolomon, 1992; Pinheiro and Bian hini, 1999; Cis o, 2000; Bestavroset al., 1998℄). When performing load balan ing, the goal is to evenlyspread the work over the available luster resour es in su h a way thatidle nodes an be used and performan e an be promoted. The inverseof the load balan ing operation on entrates work in fewer nodes, idlingother nodes that an be turned o�. This load on entration or unbal-an ing operation saves the power onsumed by the powered-down nodes,but an degrade the performan e of the remaining nodes and potentiallyin rease their power onsumption. Thus, load on entration involves aninteresting performan e vs. power tradeo�.Our systems exploit load on entration to onserve power. Their key omponent is an algorithm that makes load balan ing and on entra-tion de isions by onsidering both the total load imposed on the lusterand the power and performan e of di�erent luster on�gurations. Inmore detail, the algorithm uses a ontrol-theoreti approa h to deter-mine whether nodes should be added to or removed from the luster,and de ides how the existing load should be re-distributed in ase ofa on�guration hange. To be able to understand the impli ations ofour algorithm, we implemented it in two popular types of luster-basedsystems: a network server and an operating system (OS) for lustered y le servers. The implementations were performed in two ways: (1) atthe appli ation level for the network server; and (2) at the OS level forthe y le server. In a previous te hni al report [Pinheiro et al., 2001a℄,we also onsidered implementations that rely on appli ation/OS inter-a tion.Even though we target power onservation primarily, our experimentalresults show that our se ondary goal of saving energy is a hieved as well.Our results show that the modi�ed network server an redu e the total

Dynami Cluster Re on�guration for Power and Performan e 3power onsumption by as mu h as 71% and the energy onsumptionby 45% in omparison to the original server running on a stati luster on�guration. The modi�ed OS an redu e power onsumption by asmu h as 88% for a syntheti workload, while attempting to avoid anyperforman e degradation, again in omparison to the original system ona stati luster. The energy savings it a rues in this ase is 32%. Whena 20% performan e degradation is a eptable, our system onserves 88%of the power and 40% of the energy onsumed by the stati system.The remainder of this paper is organized as follows. The next se tiondis usses our motivation. Se tion 2 des ribes our luster on�gurationand load distribution algorithm and its di�erent implementations. Se -tion 3 des ribes our experimental set-up and the methodology used. Se -tion 4 dis usses our experimental results. Se tion 5 dis usses the relatedwork. Finally, se tion 6 on ludes the paper.1. MotivationOur motivation in pursuing this resear h is that large lusters on-sume signi� ant amounts of power and energy. Power onsumption isan important on ern for lusters as it dire tly in uen es their oolingrequirements. In fa t, a medium to large number of high-performan enodes ra ked losely together in the same room, as is usually the asewith lusters, requires a signi� ant investment in ooling, both in termsof sophisti ated ra ks and heavy-duty air onditioning systems. Besides ooling under normal operation, power onsumption also in uen es therequired investments in ba kup ooling and ba kup power-generationequipment for lusters that an never be unavailable, su h as those of ompanies that provide servi es on the Internet. The re ent trend to-wards ultra-dense lusters [RLX Te hnologies, 2001℄ will only worsen the ooling problem.Taking a broader perspe tive, the power requirements of lusters havebe ome a major issue for several states, su h as California and NewYork, at the height of the e onomi boom in the United States. Evenif these states make a tremendous investment in new power plants inthe next several years, power onservation should still be an importantgoal in that most power-generation te hnologies (su h as nu lear and oal-based generation) have a negative impa t on the environment.Energy onsumption is also an important on ern for lusters in thatboth the omputational and the air onditioning infrastru tures on-sume energy. This energy onsumption is re e ted in the ele tri ity bill,whi h an be signi� ant for a large and/or dense luster in a heavily air-

4 onditioned room. Resear h and tea hing organizations, in parti ular,may �nd it diÆ ult to over high energy osts.The bottom line is that to onserve the power and energy onsumedby lusters eases deployment and installation, prote ts the environment,and an potentially save a lot of money. In fa t, even when it is notpossible to redu e the maximum power requirements of a luster (i.e.it is not possible to ut down the one-time ost of ooling and ba kuppower-generation systems), redu ing the ommon- ase power and en-ergy onsumption redu es the operational ost of these systems and theele tri ity ost.2. Cluster Con�guration and Load Distribution2.1 OverviewPower vs. performan e. We onsider the tradeo� between powerand two types of performan e, namely throughput and exe ution timeperforman e. Throughput is the key issue for systems su h as modernnetwork servers, in whi h the goal is to servi e as many requests aspossible; the laten y of ea h request at the server is usually a smallfra tion of the overall laten y of wide-area lient-server ommuni ation.Exe ution time is key for systems su h as y le servers, as users mayobje t to signi� ant delays in the exe ution of their jobs.The luster on�guration and load distribution algorithm we proposede ides whether to add (turn on) or remove (turn o�) nodes, a ordingto the expe ted performan e and power impli ations of the de ision. De- isions are made dynami ally for ea h luster on�guration and urrentlyo�ered load.For simpli ity, the algorithm assumes that the luster is omprisedof homogeneous ma hines. Furthermore, the algorithm assumes thatthe removal of a node does not ripple the �le system. This is a validassumption, sin e: (1) in ertain environments it is possible to repli ate�les at all nodes; and (2) when this is not the ase, the �le servers antransparently be run on ma hines that do not stri tly belong to the luster or that are not subje t to the algorithm.Addition/removal de ision. To make node addition or removalde isions, the algorithm requires the ability to predi t the performan eand the power onsumption of di�erent luster on�gurations. Exa tpower onsumption predi tions are not straightforward. The problemis that it is diÆ ult to predi t the power to be onsumed by a nodeafter it re eives some arbitrary load. Conversely, it is diÆ ult to predi tthe power to be onsumed by a node after some of its load is movedelsewhere.

Dynami Cluster Re on�guration for Power and Performan e 5Nevertheless, exa t power onsumption predi tions are not really ne -essary for the algorithm to a hieve its main goal, namely to onservepower. The reason for this is that ea h of our luster nodes onsumesapproximately 70 Watts when idle and approximately 94 Watts whenall resour es, i.e. CPU, a hes, memory, network interfa e, and disk, arestret hed to the maximum. These measurements mean that: (a) there isa relatively small di�eren e in power onsumption between an idle nodeand a fully utilized node; and (b) the penalty for keeping a node poweredon is high, even if it is idle. Thus, we �nd in pra ti e that turning a nodeo� always saves power, even if its load has to be moved to one or moreother nodes. Thus, our algorithm always de reases the number of nodes,provided that the expe ted performan e of appli ations is a eptable.Performan e predi tions an also be diÆ ult to make. We predi tperforman e by keeping tra k of the demand for (not the utilization of)resour es on all luster nodes. With this information, our algorithm an estimate the performan e degradation that an be imposed on anode when new load is sent to it. There is a aveat here, though. Adegradation predi tion is made based on past resour e demand historyof the load to be moved on its urrent node, so the predi tion doesnot onsider demand hanges due to unexpe ted future behavior. Inparti ular, the initial settle-down period during whi h the a hes arewarmed up with the new load is disregarded; we are more interested insteady-state performan e.A throughput predi tion an easily be made based on the resour edemand information. To see how this works, let us onsider the through-put of a luster-based network server. Suppose a s enario with 3 lusternodes, ea h of whi h with demands for disk of 80%, 30%, and 20% oftheir nominal bandwidth. By adding up all of these disk demands (anddisregarding other resour es to simplify the example), we �nd that theserver ould run with no throughput degradation on 2 nodes (130 <200) and with a 30% throughput degradation on 1 node (130 - 100 =30). Our algorithm should de ide to remove one at least; two nodes if a30% degradation is a eptable.Exe ution time predi tions are mu h more omplex, as they dependheavily on the spe i� hara teristi s of the appli ations and on theamount and timing of the demand imposed on the di�erent resour es.Therefore, we have to settle for optimisti exe ution time predi tionsbased on the demand for resour es. The predi tions are optimisti be- ause they assume that the use of resour es is fully pipelined and over-lapped. To see how this works, let us onsider the exe ution time perfor-man e of appli ations running on a luster of y le servers. Suppose as enario with 2 luster nodes with demands for their CPUs of 80% and

640%. Our optimisti predi tion strategy says that these appli ations ould run with a 20% exe ution time degradation on 1 node (120 - 100= 20). Our algorithm should de ide to remove one of the nodes, if a20% degradation is a eptable. (In reality, 20% is a lower bound on thedegradation.)It is lear then that a key omponent of our algorithm is the demandfor resour es at ea h point in time. However, the de isions made bythe algorithm must not be solely based on instantaneous demands toavoid re on�gurations triggered by transient load variations. The algo-rithm should also take into a ount the past history of demands andthe speed of hange in demands. Control theory provides a formal andwell-understood approa h to onsidering these properties. Thus, we usea Proportional-Integral-Di�erential (PID) feedba k ontroller for ea hresour e as the basis for our algorithm's de isions. The ontroller withthe largest output (in absolute value) is used to determine the ideal lus-ter on�guration at ea h point in time. The formula that des ribes theoutput, o(t), of ea h ontroller is:o(t) = kpe(t) + ki tX0 e(t) + kd�e(t)Ea h ontroller al ulates the urrent ex ess demand (with respe tto the urrent luster on�guration) for a resour e, a umulates ex essdemand over time, and omputes the rate of hange in ex ess demands.These are the proportional, integral, and di�erential omponents of the ontroller, respe tively. Ea h of these omponents is weighted with atunable onstant, whi h should re e t how mu h importan e we want togive to ea h omponent. In our experiments, we used kp = 0:7, ki = 0:15,and kd = 0:15. Furthermore, we saturate the integral omponent atthe resour e apa ity of a single node, i.e. 100% plus the a eptableperforman e degradation. (These onstants and saturation value were hosen after some experimentation with our systems.) The output ofthe ontroller is the sum of the weighted omponents. The ontroller isexe uted every 10 se onds.To guarantee stability, our algorithm omputes ex ess demands withrespe t to the arithmeti mean of the resour e apa ities of N and N-1 nodes for a on�guration with N nodes. Moreover, the algorithmonly triggers a re on�guration if the absolute value of the ontroller'soutput is greater than half of the resour e apa ity of a single node plus10% of this value. To de ide how many nodes to add or remove, thealgorithm divides the output of the ontroller minus half of the resour e apa ity of a node by the resour e apa ity of a node. For instan e, for

Dynami Cluster Re on�guration for Power and Performan e 7a 5-node on�guration and a 20% a eptable performan e degradation,ex ess demands would be omputed with respe t to 540% (the averageof 5 � 120 and 4 � 120). In this s enario, a ontroller output of -65% means that the urrent on�guration should not be altered (65< 66). An output of -300% should trigger the removal of two nodes(d(j � 300j � 60)=120e = 2).We refer to the a eptable degradation and the minimum time be-tween re on�gurations as the degrad and elapse parameters of ouralgorithm. The degrad parameter an be spe i�ed by the luster ad-ministrator or by ea h appli ation (i.e. user). Ideally, the algorithm ould also try to guarantee a maximum performan e degradation. Thisis learly not possible for exe ution time performan e, but is on eivablefor throughput performan e. However, even in the ase of throughput,su h a strong guarantee annot be made, given that the load on the luster may in rease faster than the system an rea t to su h in rease.Rather, we use our a eptable performan e degradation parameter totrigger a tions that an redu e or eliminate any degradation.Load (re-)distribution de ision. After an addition or removalde ision is made, the load may have to be re-distributed. If the de isionis to add one or more nodes, the algorithm must determine what part ofthe urrent load should be sent to the added nodes. Obviously, the loadto be migrated should ome from nodes undergoing ex essive demandfor resour es.If the de ision is to remove one or more nodes, the algorithm mustdetermine whi h nodes should be removed and, if ne essary, where tosend the load urrently assigned to the soon-to-be-removed nodes. Ob-viously, the algorithm should give preferen e to lightly loaded vi timnodes and destination nodes that would not undergo ex essive demandfor resour es after re eiving the new load.The details of how to sele t vi tim nodes and of how to migrate loadaround the luster depend on the system for whi h the algorithm isimplemented, so we leave the des ription of these de isions for the nextsubse tion.2.2 ImplementationsOur algorithm has been implemented with minor variations in twodi�erent environments: (1) at the appli ation level for a network serverthat runs alone on a luster; and (2) at the system level for a OS for lustered y le servers.In both implementations, the algorithm is run by a master node (node0), whi h is a regular node ex ept that it re eives periodi resour e de-

8mand messages from all other nodes and it annot be turned o�. We hose entralized implementations of the algorithm due to their simpli -ity and the fa t that load messages an be infrequent. For fault toler-an e, a distributed implementation would be best, but that is beyondthe s ope of this paper.Power-aware luster-based network server. Wemodi�ed PRESS[Carrera and Bian hini, 2001℄, a luster-based, event-driven WWW serverto implement our algorithm ompletely at the appli ation level. Theserver is based on the observation that serving a request from any mem-ory a he, even a remote a he, is substantially more eÆ ient than serv-ing it from a disk, even a lo al disk. Essentially, the server distributesHTTP requests a ross nodes based on a he lo ality and load balan ing onsiderations, so that �les are unlikely to be read from disk if there isa a hed opy somewhere in the luster. Sin e the a heable �les arestati , ea h node stores a opy of all �les on its lo al disk.We implemented the luster on�guration and load distribution al-gorithm in the server making all nodes periodi ally inform the mas-ter node about their CPU, disk, and network interfa e demands. TheCPU demand is omputed by reading information from the /pro dire -tory, whereas network and disk demands are omputed based on internalserver information. To smooth out short bursts of a tivity, ea h of thesedemands is exponentially amortized over time using the following for-mula: ��old demand+(1��)� urrent demand. For our experiments,� = 0:8 and the interval between demand omputations is 10 se onds.In ase of the server, we are interested in throughput performan e.With information from all nodes, the master runs the luster on�gu-ration and load distribution algorithm des ribed in the previous se tion.If a removal de ision is made, the master determines the maximum de-mand for any resour e at ea h node and pi ks the node(s) with the lowestof the maximum demands as the vi tim. For the WWW server, it is notne essary to migrate load from a node to be ex luded from the luster.The load an be naturally redistributed among the remaining nodes, bythe server's own HTTP request distribution algorithm and/or a loadbalan ing front-end. Similarly, the addition of a new node to the lusterdoes not require migrating any load from other nodes to it.Note that at the appli ation level it is impossible to determine thedemand for network interfa e (due to bu�ering in the kernel) and CPU(due to the fa t that the server is single-pro ess) resour es, so our server annot deal with a throughput degradation that is greater than 0%.In fa t, in our experiments we assume that the resour e apa ity of asingle node is either 70% or 85% of its a tual apa ity, i.e. we studydegrad parameters of -30% and -15%. These values provide some sla k

Dynami Cluster Re on�guration for Power and Performan e 9to ompensate for the time it takes for a node to be booted, approxi-mately 100 se onds. We set the default value of the elapse parameterto 120 se onds. Given that the interval between demand omputationsis 10 se onds, this setting for elapse allows the server two observationsof the state of a re on�gured luster before another re on�guration ispermitted.Power-aware OS for lusters. We modi�ed Nomad [Pinheiro andBian hini, 1999℄, a Linux-based single-system-image OS for lusters ofuni and/or multipro essor y le servers. For the purposes of this paper,the most important hara teristi s of the OS are that (a) it has a shared�le system; (b) it starts ea h appli ation on the most lightly loaded nodeof the luster at the moment; and ( ) it performs dynami he kpointingand migration of whole appli ations (with all its pro esses and state,in luding open �le des riptors, stati libraries, data, sta k, registers andthe like) between nodes to balan e load. Resour e demand is omputedfor ea h node in the OS, by he king the resour e queues every se ond.Whenever the average CPU demand, the memory onsumption, or theI/O demand observed lo ally at a node remains higher than a thresholdfor 5 se onds, the OS onsiders the node to be undergoing ex essivedemand and attempts to migrate some of its load out to a more lightly-loaded node with respe t to the heavily demanded resour e.To avoid ex essive migration a tivity, the migration of an appli ation an only happen if a few onditions are veri�ed. First, an appli ation an only be migrated if it has already exe uted at least as long as theestimated time to migrate a pro ess of its size. Se ond, a node that hasjust migrated an appli ation elsewhere will not migrate another one untila period of stabilization, urrently set to 30 se onds, has elapsed. Third,no in oming migration will be a epted by a node that has been eitherthe sour e or the destination of a migration during the stabilizationperiod. Finally, the OS was designed for lustered y le servers, i.e.time-shared exe ution of sequential appli ations on unipro essor nodesand of parallel appli ations on multipro essor nodes, so appli ations thatdo not onform to these restri tions annot be migrated by the system.Again, we implemented the luster on�guration and load distributionalgorithm in the OS making all nodes periodi ally inform the masternode about their CPU, memory, and I/O demands. The CPU demandand the memory onsumption are omputed by reading information from/pro , whereas I/O demands are determined by instrumenting read andwrite system alls and getting swap information from /pro . To smoothout short bursts of a tivity, the demands are amortized using the sameformula used by the WWW server. For our experiments, � = 0:8 andthe interval between demand omputations is 10 se onds. In ase of the

10OS implementation of our algorithm, we are interested in exe ution timeperforman e.With information from all nodes, the master an run our algorithm.If a removal de ision is made, the master sele ts the nodes with thelowest demands for ea h resour e as andidate vi tims. Unlike theWWW server, in the OS ase the load of the vi tim must be migrated toother nodes, so the master sele ts the two nodes with the lightest loadwith respe t to ea h resour e (CPU, I/O, and memory) and sele ts thesour e/destination pair that would lead to the lowest overall demandfor resour es. To simplify our prototype implementation, the destina-tion node re eives all appli ations that are running on the vi tim node.Any load imbalan es are later orre ted by the OS a ording to its loadbalan ing poli y.In the modi�ed OS, a node addition is not e�e ted if only one appli a-tion is responsible for the ex essive demand. After a new node is turnedon, the OS will start migrating appli ations to it, so that the load willbe balan ed again. Given that adding nodes takes a signi� ant amountof time (about 100 se onds), it might take a while before the demand forresour es be omes a eptable again, after a long-lasting surge of a tivity.We experiment with two values for degrad: 0% and 20%. The elapseparameter is set to 150 se onds. This setting allows the system timeto re on�gure, migrate appli ations to balan e the load, and re-evaluatethe resour e demands before another re on�guration is allowed.3. MethodologyTo study the performan e of our algorithm and systems, we performedexperiments with a luster of 8 PCs onne ted by a Fast Ethernet swit hand a Giganet swit h. Ea h of the nodes ontains an 800-MHz PentiumIII pro essor, 512 MBytes of memory, two 7200 rpm disks (only onedisk is used in our experiments), and two network interfa es. Shuttinga node down takes approximately 45 se onds and bringing it ba k uptakes approximately 100 se onds.All ma hines are onne ted to a power strip that allows for remote ontrol of the outlets. Ma hines an be turned on and o� by sending ommands to the IP address of the power strip. The total amount ofpower onsumed by the luster nodes is then monitored by a multimeter onne ted to the power strip. The multimeter olle ts instantaneouspower measurements 3-4 times per se ond and sends these measurementto another omputer, whi h stores them in a log for later use. We obtainthe power onsumed by di�erent luster on�gurations by aligning thelog and our systems' statisti s.

Dynami Cluster Re on�guration for Power and Performan e 11Network server experiments. Besides the main luster, we useanother 14 Pentium-based ma hines to generate load for the modi�edWWW server. For simpli ity, we did not experiment with a front-enddevi e that would hide the powering down of luster nodes from lients.Instead, lients poll all servers every 10 se onds and an thus dete t lus-ter re on�gurations and adapt their behavior a ordingly. The lientssend requests to the available nodes of the server in randomized fash-ion a ording to a tra e of the a esses to the World Cup '98 site from12pm on 07/12 to 12pm on 07/14. The tra e in ludes the day of the hampionship mat h. To shorten the duration of the experiments, wea elerated the tra e 20 times.Distributed OS experiments. The syntheti workload used for ourmodi�ed OS experiments draws appli ations from a number of sour es:all integer appli ations from the SPEC2000 ben hmark, the BerkeleyMPEG movie en oder, and two I/O ben hmarks, IO all and IOzone.IO all is a ben hmark to measure OS performan e on I/O alls, espe- ially �le read system alls. IOzone is a �le system ben hmarking tool[IOzone, 2000℄; it generates and measures the performan e of a varietyof �le operations. Appli ations are arbitrarily assigned to nodes andare run in arbitrary groups. Be ause the luster size varies dynami allya ording to the resour e demand imposed on it, we start with only onema hine powered on (the master), whi h is responsible for laun hing allappli ations in the workload. The o�ered demand onforms to a bell-shaped urve. To shorten the length of the experiments, we generatesigni� ant hanges in o�ered demand in very little time.4. Experimental ResultsPower-aware luster-based network server. We des ribe twoexperiments with our server. In the �rst experiment, the parametersfor the luster re on�guration algorithm are set to guarantee qui k rea -tion to u tuations in load, so that we an ta kle signi� ant in reases inload without performan e degradation. More spe i� ally, the algorithmtries to keep 30% spare resour es for any luster on�guration. Fig-ure 1.1 presents the evolution of the luster on�guration and demandsfor ea h resour e as a fun tion of time in se onds for this experiment.The demand for ea h resour e is plotted as a per entage of the nominalthroughput of the same resour e in one node.The �gure shows that for this parti ular workload the network in-terfa e is the bottlene k resour e throughout the whole exe ution ofthe experiment (140 minutes). We started the experiment with a two-node on�guration. The traÆ dire ted to the server initially in reases

12

0

100

200

300

400

500

600

700

800

0 1000 2000 3000 4000 5000 6000 7000 80000

1

2

3

4

5

6

7

8

Load (

% o

f one n

ode)

Num

ber

of N

odes

Time(sec)

CPU loadDisk loadNet load

Number of Nodes

Figure 1.1. Cluster evolution and resour e demands for the WWW server. elapse= 120 se onds; degrad = -30%.

0

100

200

300

400

500

600

700

800

0 1000 2000 3000 4000 5000 6000 7000 8000

Pow

er(

W)

Time(sec)

Static ConfigurationDynamic Configuration

Figure 1.2. Power onsumption for the WWW server under stati and dynami lus-ter on�gurations. elapse = 120 se onds; degrad = -30%.slowly, triggering the addition of a node, before in reasing substantiallyand triggering the addition of several new nodes in qui k sequen e. ThetraÆ then subsides, until another period of high traÆ o urs, whi his followed by a substantial de line in traÆ . Note that throughout theexperiment the system rea ts qui kly to in reases in traÆ , be ause ofthe spare apa ity it onsistently retains. As a result of the spare a-pa ity, the performan e of the server is not a�e ted by the dynami re on�guration of the luster.

Dynami Cluster Re on�guration for Power and Performan e 13

0

100

200

300

400

500

600

700

800

0 1000 2000 3000 4000 5000 6000 7000 80000

1

2

3

4

5

6

7

8

Load (

% o

f one n

ode)

Num

ber

of N

odes

Time(sec)

CPU loadDisk loadNet load

Number of Nodes

Figure 1.3. Cluster evolution and resour e demands for the WWW server. elapse= 120 se onds; degrad = -15%.Figure 1.2 presents the power onsumption of the whole luster for twoversions of the same experiment as a fun tion of time. The lower urve(labeled \Dynami Con�guration") represents the version in whi h werun the power-aware server, i.e. the luster on�guration is dynami allyadapted to respond to variations in resour e demand. The higher urve(labeled \Stati Con�guration") represents a situation where we run theoriginal server, i.e. the luster on�guration is �xed at 7 nodes. As an be seen in the �gure, our modi�ed WWW server an redu e power onsumption signi� antly for most of the exe ution of our experiment.Power savings a tually rea h 71% when the resour e demands requireonly two nodes. Our energy savings are also signi� ant. Cal ulating thearea below the two urves, we �nd that the power-aware server saves38% in energy. Thus, the load on the ooling infrastru ture over time isalso redu ed by 38%.Even though these are signi� ant gains, we an do better. The rea-son is that keeping spare apa ity promotes performan e at the ostof higher power and energy onsumption. If we an estimate how fastthe o�ered traÆ an in rease, we an redu e the spare apa ity to theminimum required to avoid ex essively long request laten ies. For ourtra e, this minimum is 15%. Thus, �gure 1.3 presents the evolution ofthe luster on�guration when the system attempts to retain this mu hspare apa ity. Comparing �gures 1.1 and 1.3 we an see that duringmost of the experiment the system now requires fewer a tive nodes tohandle the o�ered load. In this ase, the power and energy gains that

14 an be a hieved in omparison to a stati system with 7 nodes are 71%and 45%, respe tively.In general, it might not be possible to determine the maximum rate ofworkload hange a priori. In these ases, mismat hes between the rateof workload hange and luster re on�guration delays an be alleviatedby adjusting the elapse and/or degrad parameters dynami ally. Webelieve however that in pra ti e values of a few minutes for elapse anda few per ent for degrad should work just �ne, sin e real network serverworkloads usually hange more slowly than in our experiments.Power-aware OS for lusters. Figure 1.4 presents the evolution ofthe luster on�guration and demands for ea h resour e with elapse =150 se onds and degrad = 0%, as a fun tion of time. The experimentlasted about 46 minutes. The CPU is always the bottlene k resour eduring the experiment. The experiment starts with a single-node on-�guration. This node is responsible for starting all the appli ations inthe workload. As new appli ations are started, the CPU demand in- reases and eventually triggers the addition of a new node. When thenew node is added by the master, the OS attempts to balan e the loadby migrating some appli ations to the new node. As the number of ap-pli ations started in reases, they trigger the addition of other nodes, oneat a time. The OS is able to tra k the demand in reases fairly well byin reasing the size of the luster. At about half way through the experi-ment, the demand for CPU be omes mu h higher than an be managedby an 8-node luster. Right after this peak in demand however, someappli ations start to �nish and the demand for resour es drops qui kly.The master responds to this hange in load by ex luding the now idlenodes, one at a time. Again, the system does a good job of tra king thede rease in resour e demand.Figure 1.5 presents the power onsumption of the whole luster fortwo versions of the same experiment as a fun tion of time. As an beseen in the �gure, our power-aware OS an redu e power onsumptionsigni� antly for most of the exe ution time of the experiment. Powersavings a tually rea h 88% when the resour e demands require only asingle node. Energy savings are also signi� ant. The area below the two urves indi ates that the power-aware OS saves 35% in energy for thisworkload.It is interesting to note that the workload used in this experiment�nishes a little earlier on the stati on�guration (around 45 minutes)than on the dynami one (around 46 minutes). If we ompare the energy onsumed by the stati on�guration during the �rst 45 minutes of theexperiment against that of the dynami on�guration for the whole ex-periment, we �nd that our energy savings are only slightly smaller, 32%.

Dynami Cluster Re on�guration for Power and Performan e 15

0

200

400

600

800

1000

0 500 1000 1500 2000 2500 30000

2

4

6

8

10

Load (

% o

f one n

ode)

Num

ber

of N

odes

Time(s)

CPU loadMem load

I/O loadNumber of nodes

Figure 1.4. Cluster evolution and resour e demands in the power-aware OS. elapse= 150 se onds; degrad = 0%.

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

Pow

er(

W)

Time(s)

Static ConfigurationDynamic Configuration

Figure 1.5. Power onsumption for the power-aware OS under stati and dynami luster on�gurations. elapse = 150 se onds; degrad = 0%.(This omparison is not really fair however, sin e real, i.e. stati , y leservers are never turned o�). In any ase, it is lear that the load onthe ooling infrastru ture is redu ed by at least 32% under the dynami system.To investigate the tradeo� between performan e and power, we alsoperformed experiments in whi h our intended performan e degradationis 20%. We kept elapse at 150 se onds. Figure 1.6 illustrates theevolution of the luster on�guration in this ase. As one would expe t,

16

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 30000

2

4

6

8

10

Load (

% o

f one n

ode)

Num

ber

of N

odes

Time(s)

CPU loadMem load

I/O loadNumber of nodes

Figure 1.6. Cluster evolution and resour e demands in the power-aware OS. elapse= 150 se onds; degrad = 20%.allowing for some performan e degradation has the e�e t of slowing downthe addition of new nodes and speeding up the removal of unne essarynodes. As a result, the system de ides to jump dire tly from 6 to 8nodes when ramping up and to jump dire tly from 7 to 5 nodes whendown-sizing. Overall, our system onserves 88% and 42% of the powerand energy onsumed by its stati ounterpart in this experiment. If we onsider that the workload �nishes 2 minutes later on the dynami thanon the stati on�guration, the energy gains are of 40%.5. Related WorkMost of the previous work on onservation has been fo used on lap-top omputers and embedded and hand-held devi es. Resear h on thesedevi es has in luded optimizations for the pro essor (e.g. [Weiser et al.,1994; Halfhill, 2000; Hsu et al., 2000℄), for the memory (e.g. [Lebe ket al., 2000; Vijaykrishnan et al., 2000; Delaluz et al., 2001℄), for the disk(e.g. [Li et al., 1994; Douglis and Krishnan, 1995; Helmbold et al., 1996℄),and for o�oading omputation from them to non-battery-operated om-puters (e.g. [Rudenko et al., 1998; Kremer et al., 2000℄).Some of this resear h an be used to optimize ea h node of a lusterindependently, so we an also bene�t from it. However, our resear h isorthogonal to these ontributions in the sense that we fo us on luster-wide power and energy onservation, i.e. onservation that onsiders allof the luster resour es and the load o�ered to the luster as a whole.

Dynami Cluster Re on�guration for Power and Performan e 17We originally proposed the ideas and systems des ribed here in [Pin-heiro et al., 2001a℄. A shorter and revised version of that paper appearsin [Pinheiro et al., 2001b℄. This paper extends our original work in sev-eral ways: (a) our original luster on�guration and load distributionalgorithm did not onsider the past behavior and speed of hange of theo�ered workload when making its de isions; (b) our original luster re- on�guration de isions were limited to adding or removing a single nodeat a time; and ( ) our original evaluation of the power-aware serverassumed a syntheti workload. Extensions (a) and (b) in reased ourability to tra k and qui kly adjust to variations in o�ered load, whereasextension ( ) allows us to demonstrate the usefulness of our approa h inrealisti s enarios.Two re ent papers [Chase et al., 2001; Elnozahy et al., 2002℄ alsodeal with power and energy resear h for lusters. Chase et al. [Chaseet al., 2001℄ ta kled the general problem of resour e allo ation in host-ing enters using market-based poli ies. In terms of power and energy onservation, they evaluated a resour e allo ation poli y for a lusteredWWW server that is similar to the luster on�guration algorithm westudy here. Elnozahy et al. [Elnozahy et al., 2002℄ evaluated di�erent ombinations of luster re on�guration and dynami voltage s aling for lusters. They showed that the bene�ts of our te hnique an be in reasedby oupling it with oordinated ( luster-wide) voltage s aling.As aforementioned, load on entration is inspired by previous workin luster-wide load balan ing (e.g. [Barak and La'adan, 1998; Ghorm-ley et al., 1998; Litzkow and Solomon, 1992; Douglis and Ousterhout,1991; Pinheiro and Bian hini, 1999; Cis o, 2000; Bestavros et al., 1998℄).Some systems do use some form of load on entration, but only as a re-medial te hnique like in systems that harvest idle workstations (e.g. [Barak and La'adan, 1998; Litzkow and Solomon, 1992℄) or as a man-agement te hnique for manually ex luding a luster node. We use load on entration as a �rst- lass te hnique for onserving power and energyin lusters.The te hnique that is losest in spirit to load on entration for powerand energy is o�oading omputation from a battery-operated devi e toa remote non-battery-operated omputer (e.g. [Rudenko et al., 1998;Kremer et al., 2000℄). However, load on entration as des ribed hereinvolves di�erent hallenges and tradeo�s, mainly be ause the load onthe luster and the e�e t of applying the te hnique must be determinedbefore any a tion an be taken.A few other proje ts deal with luster re on�guration (e.g. [Fox et al.,1997; Appleby et al., 2001; Van Renesse et al., 1998; Goldszmidt andHunt, 1999℄). Even though these proje ts do not onsider power and en-

18ergy issues, they lend themselves ni ely to the powering down of unusedsystems.In terms of our algorithm, the most losely related work is that ofSkadron et al. [Skadron et al., 2002℄. They proposed the use of ontrol-theoreti te hniques for dynami thermal management of mi ropro es-sors. We apply similar te hniques to luster re on�guration for powerand performan e.6. Con lusionsIn this paper we addressed power and energy onservation for lus-ters. We proposed a ontrol-theoreti luster on�guration and loaddistribution algorithm and applied it under two di�erent s enarios. Ourexperiments showed that it is indeed possible to onserve signi� antpower and energy in the ontext of lusters. Based on our experimentalresults, we on lude that exploiting periods of light load an providetremendous gains for organizations and ompanies that rely on large lusters of servers.A knowledgementsWe would like to thank Carla Ellis, Brett Fleis h, and Liviu Iftode for omments on the topi of this resear h. We also thank Uli Kremer, MikeHsiao, and the rest of the people in the Programming Languages readinggroup, who helped us improve the quality of this paper. Finally, wewould like to thank Uli Kremer for letting us use the power measurementinfrastru ture of the Energy EÆ ien y and Low-Power (EEL) lab atRutgers.Referen esAppleby, K., Fakhouri, S., Fong, L., Goldszmidt, G., Kalantar, M., Kr-ishnakumar, S., Pazel, D., Pershing, J., and Ro hwerger, B. (2001).O eano - SLA Based Management of a Computing Utility. In Pro- eedings of the 7th IFIP/IEEE International Symposium on IntegratedNetwork Management.Barak, A. and La'adan, O. (1998). The MOSIX Multi omputer Oper-ating System for High Performan e Cluster Computing. Journal ofFuture Generation Computer Systems, 13(4-5):361{372.Bestavros, A., Crovella, M., Liu, J., and Martin, D. (1998). DistributedPa ket Rewriting and its Appli ation to S alable Server Ar hite tures.In Pro eedings of the International Conferen e on Network Proto ols.Carrera, E. V. and Bian hini, R. (June 2001). EÆ ien y vs. portabilityin luster-based network servers. In Pro eedings of the 8th ACM SIG-

REFERENCES 19PLAN Symposium on Prin iples and Pra ti e of Parallel Program-ming.Chase, J., Anderson, D., Tha kar, P., Vahdat, A., and Boyle, R. (O tober2001). Managing energy and server resour es in hosting enters. InPro eedings of the 18th Symposium on Operating Systems Prin iples.Cis o (2000). Cis o Lo alDire tor. http://www. is o. om/.Delaluz, V., Kandemir, M., Vijaykrishnan, N., Sivasubramaniam, A.,and Irwin, M. J. (January 2001). Dram energy management usingsoftware and hardware dire ted power mode ontrol. In Pro eedingsof the International Symposium on High-Performan e Computer Ar- hite ture.Douglis, F. and Krishnan, P. (1995). Adaptive disk spin-down poli iesfor mobile omputers. Computing Systems, 8(4):381{413.Douglis, F. and Ousterhout, J. (1991). Transparent Pro ess Migration:Design and Alternatives and the Sprite Implementation. Software:Pra ti e and Experien e, 21(8):757{785.Elnozahy, E. N., Kistler, M., and Rajamony, R. (February 2002). Energy-EÆ ient Server Clusters. In Pro eedings of the 2nd Workshop on Power-Aware Computing Systems.Flinn, J. and Satyanarayanan, M. (1999). Energy-aware adaptation formobile appli ations. In Pro eedings of the 17th Symposium on Oper-ating Systems Prin iples, pages 48{63.Fox, A., Gribble, S., Chawathe, Y., Brewer, E., and Gauthier, P. (1997).Cluster-Based S alable Network Servi es. In Pro eedings of the Inter-national Symposium on Operating Systems Prin iples, pages 78{91.Ghormley, D., Petrou, D., Rodrigues, S., Vahdat, A., and Anderson, T.(1998). GLUnix: a Global Layer Unix for a Network of Workstations.Software: Pra ti e and Experien e.Goldszmidt, G. and Hunt, G. (1999). S aling Internet Servi es by Dy-nami Allo ation of Conne tions. In Pro eedings of the 6th IFIP/IEEEInternational Symposium on Integrated Network Management, pages171{184.Halfhill, T. (February 2000). Transmeta breaks the x86 low-power bar-rier. In Mi ropro essor Report.Helmbold, D. P., Long, D. D. E., and Sherrod, B. (1996). A dynami disk spin-down te hnique for mobile omputing. In Pro eedings of the2nd International Conferen e on Mobile Computing (MOBICOM96),pages 130{142.Hsu, C.-H., Kremer, U., and Hsiao, M. (November 2000). Compiler-dire ted dynami frequen y and voltage s aling. In Pro eedings of theWorkshop on Power-Aware Computer Systems.IOzone (November 2000). Iozone �lesystem ben hmark. http://www.iozone.org.

20Kremer, U., Hi ks, J., and Regh, J. (O tober 2000). Compiler-dire tedremote task exe ution for power management. In Pro eedings of theWorkshop on Compilers and Operating Systems for Low Power.Lebe k, A. R., Fan, X., Zeng, H., and Ellis, C. S. (2000). Power awarepage allo ation. In Pro eedings of the 9th International Conferen eon Ar hite tural Support for Programming Languages and OperatingSystems (ASPLOS IX), pages 105{116.Li, K., Kumpf, R., Horton, P., and Anderson, T. (1994). A quantitativeanalysis of disk drive power management in portable omputers. InPro eedings of the 1994 Winter USENIX Conferen e, pages 279{291.Litzkow, M. J. and Solomon, M. (1992). Supporting Che kpoint andPro ess Migration Outside the UNIX Kernel. In Usenix Conferen ePro eedings, pages 283{290, San Fran is o, CA.Pinheiro, E. and Bian hini, R. (De ember 1999). Nomad: A s alable op-erating system for lusters of uni and multipro essors. In Pro eedingsof the 1st IEEE International Workshop on Cluster Computing.Pinheiro, E., Bian hini, R., Carrera, E. V., and Heath, T. (2001a). LoadBalan ing and Unbalan ing for Power and Performan e in Cluster-Based Systems. Te hni al Report DCS-TR-440, Department of Com-puter S ien e, Rutgers University.Pinheiro, E., Bian hini, R., Carrera, E. V., and Heath, T. (2001b). LoadBalan ing and Unbalan ing for Power and Performan e in Cluster-Based Systems. In Pro eedings of the International Workshop on Com-pilers and Operating Systems for Low Power.RLX Te hnologies (June 2001). Serverblade. http://www.rlxte hnologies. om/.Rudenko, A., Reiher, P., Popek, G. J., and Kuenning, G. H. (1998).Saving portable omputer battery power through remote pro ess ex-e ution. Mobile Computing and Communi ations Review, 2(1):19{26.Skadron, K., Stan, M., and Abdelzaher, T. (February 2002). Control-theoreti te hniques and thermal-r modeling for a urate and lo al-ized dynami thermal management. In Pro eedings of the Interna-tional Symposium on High-Performan e Computer Ar hite ture.Van Renesse, R., Birman, K., Hayden, M., Vaysburd, A., and Karr, D.(1998). Building adaptive systems using Ensemble. Software Pra ti eand Experien e, 28(9):963{979.Vijaykrishnan, N., Kandemir, M., Irwin, M. J., Kim, H. S., and Ye,W. (2000). Energy-driven integrated hardware-software optimizationsusing simplepower. In Pro eedings of the 27th Annual InternationalSymposium on Computer Ar hite ture, pages 95{106.Weiser, M., Wel h, B., Demers, A., and Shenker, S. (1994). S hedul-ing for redu ed pu energy. In Pro eedings of the 1st Symposium onOperating System Design and Implementation.

compilers and operating systems for low...

Documents