energy efficiency in large scale systems

Gaurav Dhiman, Raid Ayoub

Prof. Tajana Šimunić Rosing

Dept. of Computer Science

System Energy Efficiency Lab

seelab.ucsd.edu

Large scale systems: Clusters

Power consumption is a critical

design parameter: Operational costs

o Compute Equipment

o Cooling

By 2010, US electricity bill for powering and cooling data centers ~$7B[1]

Electricity input to data centers in the US exceeds electricity consumption of Italy!

2

[1]: Meisner et al, ASPLOS 2008

Energy Savings with DVFS

Reduction in CPU power

Extra system power

Effectiveness of DVFS

For energy savings

ER > EE

Factors in modern systems affecting this equation:

Performance delay (tdelay)

Idle CPU power consumption (PE)

Power consumption of other devices (PE)

Performance Delay

Lower tdelay=> higher energy savings

Depends on memory/CPU intensiveness

Experiments with SPEC CPU2000

mcf: highly memory intensive

Expect low tdelay

sixtrack: highly cache/CPU intensive

Expect high tdelay

Two state of the art processors

AMD quad core Opteron

On die memory controller (2.6GHz), DDR3

Intel quad core Xeon

Off chip memory controller (1.3GHz), DDR2

Performance Delay

Due to slower memory controller and memory

Due to on die memory controller and fast DDR3 memory

mcf much closer to

worst case on AMD

mcf much closer to

best case on Xeon

Idle CPU power consumption

Low power idle CPU states common now

C1 state used be default

Zero dynamic power consumption

Support for deeper C-states appearing

C6 on Nehalem

Zero dynamic+leakage power

Higher extra CPU power consumption for

modern CPUs

Lower DVFS benefits

Device power consumption

DVFS makes other devices consume power for

longer time (tdelay)

Memory (4GB DDR3)

Idle -> 5W

Active -> 10W

Higher extra device power consumption

Lower DVFS benefits for memory intensive

benchmarks

Evaluation Setup

Assume a simple static-DVFS policy

AMD Opteron (four v-f settings):

1.25V/2.6GHz, 1.15V/1.9GHz, 1.05V/1.4GHz,

0.9V/0.8GHz

Compare against a base system with no

DVFS and three simple idle PM policies:

Policy Description

PM-1 switch CPU to ACPI state C1 (remove clock supply) and move to lowest voltage setting

PM-2 switch CPU to ACPI state C6 (remove power)

PM-3 switch CPU to ACPI state C6 and switch the memory to self- refresh mode

Methodology

Run SPEC CPU2000 benchmarks at all v-f

settings

Estimate savings baselined against system with

PM-(1:3) policies

EPM-i varies based on the policy

DVFS beneficial if:

%EsavingsPM-i > 0

Results

Benchmark

Freq %delay%EnergysavingsPM-i

PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•High average delay

•On die memory controller

Results

Benchmark


PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•Max Avg ~7% savings

•High perf delay

Results

Benchmark


PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•Lowest v-f setting not useful

•Avg 7% savings

•Avg 200% delay

Results

Benchmark


PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

• DVFS energy inefficient

• Lower system idle power

consumption

Conclusion

Simple power management policies provide

better energy performance tradeoffs

Lower v-f setting offer worse e/p tradeoffs due

to high performance delay

DVFS still useful for:

Power reduction: thermal management

Systems with simpler memory controllers and low

power system components

Server Power Breakdown

Energy Proportional Computing

17

Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to

peak performance. Even an energy-efficient server still consumes about half its full power

when doing virtually no work.

“The Case for

Energy-Proportional

Computing,”

Luiz André Barroso,

Urs Hölzle,

IEEE Computer

December 2007 Doing nothing well …

NOT!

Energy Efficiency =

Utilization/Power


18

Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers

are rarely completely idle and seldom operate near their maximum utilization, instead operating

most of the time at between 10 and 50 percent of their maximum

It is surprisingly hard

to achieve high levels

of utilization of typical

servers (and your home

PC or laptop is even

worse)

“The Case for

Energy-Proportional

Computing,”


Urs Hölzle,

IEEE Computer

December 2007


19

Figure 4. Power usage and energy efficiency in a more energy-proportional server. This

server has a power efficiency of more than 80 percent of its peak value for utilizations of

30 percent and above, with efficiency remaining above 50 percent for utilization levels as

low as 10 percent.

“The Case for

Energy-Proportional

Computing,”


Urs Hölzle,

IEEE Computer

December 2007 Design for

wide dynamic

power range and

active low power

modes

Doing nothing

VERY well

Energy Efficiency =

Utilization/Power

Why not consolidate servers?

Security

Isolation

Must use the same OS

Solution:

Use virtualization!

Virtualization

21

Benefits:

•Isolation and security

•Different OS in each VM

•Better resource utilization

Virtualization

22

Benefits:

•Improved manageability

•Dynamic load management

•Energy savings through VM

consolidation!

How to Save Energy?

VM consolidation is a common practice: Increases resource utilization

Turn idle machines into sleep mode

Active machines?

Active power management: e.g. DVFS less effective

in newer line of server processors

Leakage, faster memories, low voltage range

Make the workload run faster

Similar average power across machines

Exploit workload characteristics to share resources

efficiently

23

Motivation: Workload Characterization

24

PM1

PM2

VM1 VM2

mcf

eon

60%


Workload characteristics determine:

Power/performance profile

Power distribution

Co-schedule/consolidate heterogeneous VMs

25

50W


Co-schedule/consolidate heterogeneous VMs

26

What about DVFS?

Poor performance

Energy inefficient

Only good if homogeneously high MPC workload

27

80%40%

9%

vGreen

A system for VM scheduling across a cluster of

physical machines

Dynamic VM characterization:

Memory accesses

Instruction throughput

CPU utilization

Co-schedule VMs with heterogeneous

characteristics for better:

Performance

Energy efficiency

Balanced thermal profile

28

Scheduling with VMs

29

VM1 VM2

VCPU1 VCPU2 VCPU1 VCPU2

Dom0 VM1 VM2

Xen Scheduler

Dom-0: Privileged VM

Management

I/O

VM Creation:

Specify CPU, memory, I/O

config

CPU of VM referred to as VCPU:

Fundamental unit of

execution

OS inside VM schedules on VCPUs

Xen schedules VCPUs across PCPUs

vGreen Architecture

30

Main Components:

vgnodes

vgxen: Characterizes the running VMs

vgdom: Exports information to vgserv

vgserv

Collects and analyzes the characterization information

Issues scheduling commands based on balancing policy

Dom0 VM1 VM2

Xen vgxen

vgpolicy

vgdom

vgserv

vgnode1

Updates UpdatesCommands

Dom0 VM1 VM2

Xen vgxen

vgdom

vgnode2

vgnode (client physical machine)

31VCPU1 VCPU2

wMPC

wIPC

util

wMPC

wIPC

util

vMPC

vIPC

vutil

vMPC

vIPC

vutil

VM1 VM2

wMPC

wIPC

util

wMPC

wIPC

util

VCPU1 VCPU2

vMPC wMPCVCPU

vIPC wIPCVCPU

vutil utilVCPU

Dom0 VM1 VM2

Xen vgxen

vgdom

vgxen: characterizes the VMs

Uses performance counters to estimate:

IPC (Instructions per cycle)

MPC (Memory accesses per cycle)

Weighted by CPU utilization (wIPC, wMPC)

vgdom:

Reads metrics from vgxen

Exports to vgserv

wMPCcur = MPCcur . util

wMPC = α . wMPCcur + (1-α) . wMPCprev

Hierarchical Workload

Characterization

32VCPU1 VCPU2

wMPC

wIPC

util

wMPC

wIPC

util

vMPC

vIPC

vutil

vMPC

vIPC

vutil

nMPC

nIPC

nutil

VM1 VM2

wMPC

wIPC

util

wMPC

wIPC

util

VCPU1 VCPU2

Node Level Metrics

(maintained by vgpolicy)

VM Level Metrics

(maintained by

vgpolicy and vgxen)

VCPU Level Metrics

(maintained by vgxen)

VGNODE

nMPC vMPCVM

nIPC vIPCVM

nutil vutilVM

vMPC wMPCVCPU

vIPC wIPCVCPU

vutil utilVCPU

vgserv (VM scheduler)

33

vgpolicy: schedules VMs

Balances the overall MPC and IPC across vgnodes:

MPC: performance and energy efficiency

IPC: power distribution

vgpolicy

vgserv

vgnode1 vgnode2 vgnode3 vgnode4

Periodic Updates:

Node and VM level

MPC and IPC

VM1 VM2 VM1 VM1 VM1 VM2

Migration Command

VM2

If conflict in decision, priority given to MPC balancing

vgpolicy

34

vgnode1 vgnode2 vgnode3 vgnode4

Example: MPC Balance Algorithm

Find the VM with minimum MPC

Migrate if it does not reverse imbalancenMPC > nMPCth vgnodenMPCmin

VM1 VM2 VM1 VM1 VM1 VM2VM2

VM1 VM2 VM1VM2

Implementation

Xen 3.3.1 as the hypervisor

vgxen implemented as part of the stock Xen

credit scheduler

vgdom implemented as a driver and application

in Domain0

Communicates with vgxen through a shared page

No modifications required to the guest OS!

Used a testbed of Dual Intel Quad core

Xeon based machines as vgnodes

Linux based desktop used as vgserv

35

Dom0 VM1 VM2

Xen vgxen

vgdom

Methodology

Create VMs running Linux as guest OS:

4 VCPUs; 4GB memory for each VM

Populate with 18 workloads with varying characteristics:

mcf, eon, art, equake, swim , bzip2, gcc etc.

36

Initial assignment of VMs based on Eucalyptus [Nurmi, CCA’08]

Compare against ‘E+’: Eucalyptus + state of the art dynamic VM scheduling algorithms

Perform VM consolidation based on CPU utilization

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Weig

hte

d S

peed

up

Weighted Speedup vs E+

20% speedup on

average

37

Average 40% Weighted Speedup

Energy Savings vs E+

38

Average 35% Energy Savings

Balanced Thermal Profile

39

Average power variance reduction of 30W

Cooling subsystem challenges

Challenges

How to minimize the cooling costs within a

single machine?

How to further reduce the cooling costs by

creating a better temperature distribution

across the physical machines

Cooling systems maintain temperature

and avoid reliability issues

Fan subsystem consume significant energy

(up to 30% of total server power)

Fan subsystem

1U server

CPU

CPU

Fan controller

41

Fan controller (e.g. PI controller)

Actuator(adjusts fan speed)

Chip die

CPU cores

thermal sensors

Fan speeds

(e.g 8 speeds)

Max temperature

Temperature

threshold

Efficient fan controller can be constructed based on control theory

Traditionally, cooling optimizations focus only on the fan controller

without including workload management

Cooling aware workload scheduling

State of the art load balancing algorithms in operating systems

do not consider cooling costs

High

temp

High speed Moderate speed

Assume a case of two sockets where one runs intensive workload while the other

executes moderately active workload

The workload is balanced from the performance point of view

One fan runs at high speed while the other at moderate speed

Sources of inefficiency in cooling costs?

fan power ~ (fan speed)3 Nonlinear relation

How and when to schedule the

workload to minimize the

cooling costs?

Handling cooling inefficiency

43

Migrate computations to minimize the cooling cost

Migration overhead is acceptable:

Heat sink temperature time constant (10s of seconds) >> migration latency

((10s – 100s) of µs )

High speed Low speed

Challenge:

Which threads to migrate?

Thermal and cooling model is

needed to estimate the benefits

of workload reassignment

Migration ?

Triggering workload rescheduling

44

Fan speed

time

Fan speed

time

Workload

rescheduling

Reactive approach

Lowers cooling savings

Cannot minimize the noise level

Impacts fan stability

Challenge: Design of efficient proactive dynamic cooling aware

workload management technique

Predictive approach:

Improves energy

Lowers noise level

Provides better stability

Workload

rescheduling

45

Socket level strategies: spreading

Fan speed can be reduced by creating a better

temperature distribution

Migrate some of the active threads from the sockets with high fan

speed to sockets with lower speed

Swap some of the hot threads from sockets with high fan speed

with colder threads from sockets with lower speed.

High speed Low speed Moderate speed Moderate speed

464646

Socket level strategies: Consolidation

Concentrate the hot threads into fewer sockets while keeping the

fan at the same speed:

Migrate hot threads to the socket with fan that is spinning higher

than is required

If Fan speedM ≥ Fan speedN, we can swap the hot thread from

socket N with colder threads from socket M

Moderate speed Moderate speed Moderate speed Low speed

PW ≤ PC+PD

47

Multi-tier workload management

47

Multi-tier workload management to enhance temperature

distribution in the entire system:

VM management at the physical machine level

VP management at the CPU socket level

High speed Moderate

speed

Thread

migration

Low speed

VM

migration

Server i Server j

Moderate

speed

48

Scheduling policy

Traverse

VMs/VPs

Mark if savings

exist

Evaluate Spreading Savings

Traverse

VMs/VPs

Mark if savings

exist

Evaluate Consolidation Savings

Multi-tier scheduling algorithm: VMs at the machine level

VPs at the socket level

Run periodically: Period ~ minutes @ the VM level

Period ~ seconds @ the VP level

Schedule

49

Experimental setup

Platform: A cluster of 4 servers each has 2 Xeon sockets

Simulation: Actual power measurements of the jobs

HotSpot, [Skadron 03] for thermal/cooling simulations

K. Skadron, et al. Temperature-aware microarchitecture, ISCA 2003.

Polices:

We compare our results against state of the art Dynamic load balancing

Dynamic load balancing minimizes the differences in task queues across various

levels

We use benchmark

combinations from SPEC

2000 suite

We assume live migration for

migrating the VMs across the

PMs

VM assignment Ambient temp oC

PM 1 – 4

PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each



PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41

PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each

PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each

50

Energy savings

Average cooling energy savings of 72%

VM assignment Ambient temp oC

PM 1 – 4

1: PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each



4: PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41

5: PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each

6: PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each

0

10

20

30

40

50

60

70

80

90

Po

wer

savin

gs (

%)

What next?

Energy management must take both computation

and cooling costs into account

What about priority/QoS

Current approach maximizes overall throughput/speedup

What if someone pays more or has higher priority?

Closed loop design for performance monitoring

51

energy efficiency in large scale systems

Technology