energy efficiency in large scale systems

51
Gaurav Dhiman, Raid Ayoub Prof. Tajana Šimunić Rosing Dept. of Computer Science System Energy Efficiency Lab seelab.ucsd.edu

Upload: montana-state-university

Post on 01-Jul-2015

182 views

Category:

Technology


4 download

DESCRIPTION

June 2010 presentation by GreenLight researcher Tajana Rosing on her project research on system energy efficiency.

TRANSCRIPT

Page 1: Energy Efficiency in Large Scale Systems

Gaurav Dhiman, Raid Ayoub

Prof. Tajana Šimunić Rosing

Dept. of Computer Science

System Energy Efficiency Lab

seelab.ucsd.edu

Page 2: Energy Efficiency in Large Scale Systems

Large scale systems: Clusters

Power consumption is a critical

design parameter: Operational costs

o Compute Equipment

o Cooling

By 2010, US electricity bill for powering and cooling data centers ~$7B[1]

Electricity input to data centers in the US exceeds electricity consumption of Italy!

2

[1]: Meisner et al, ASPLOS 2008

Page 3: Energy Efficiency in Large Scale Systems

Energy Savings with DVFS

Reduction in CPU power

Extra system power

Page 4: Energy Efficiency in Large Scale Systems

Effectiveness of DVFS

For energy savings

ER > EE

Factors in modern systems affecting this equation:

Performance delay (tdelay)

Idle CPU power consumption (PE)

Power consumption of other devices (PE)

Page 5: Energy Efficiency in Large Scale Systems

Performance Delay

Lower tdelay=> higher energy savings

Depends on memory/CPU intensiveness

Experiments with SPEC CPU2000

mcf: highly memory intensive

Expect low tdelay

sixtrack: highly cache/CPU intensive

Expect high tdelay

Two state of the art processors

AMD quad core Opteron

On die memory controller (2.6GHz), DDR3

Intel quad core Xeon

Off chip memory controller (1.3GHz), DDR2

Page 6: Energy Efficiency in Large Scale Systems

Performance Delay

Due to slower memory controller and memory

Due to on die memory controller and fast DDR3 memory

mcf much closer to

worst case on AMD

mcf much closer to

best case on Xeon

Page 7: Energy Efficiency in Large Scale Systems

Idle CPU power consumption

Low power idle CPU states common now

C1 state used be default

Zero dynamic power consumption

Support for deeper C-states appearing

C6 on Nehalem

Zero dynamic+leakage power

Higher extra CPU power consumption for

modern CPUs

Lower DVFS benefits

Page 8: Energy Efficiency in Large Scale Systems

Device power consumption

DVFS makes other devices consume power for

longer time (tdelay)

Memory (4GB DDR3)

Idle -> 5W

Active -> 10W

Higher extra device power consumption

Lower DVFS benefits for memory intensive

benchmarks

Page 9: Energy Efficiency in Large Scale Systems

Evaluation Setup

Assume a simple static-DVFS policy

AMD Opteron (four v-f settings):

1.25V/2.6GHz, 1.15V/1.9GHz, 1.05V/1.4GHz,

0.9V/0.8GHz

Compare against a base system with no

DVFS and three simple idle PM policies:

Policy Description

PM-1 switch CPU to ACPI state C1 (remove clock supply) and move to lowest voltage setting

PM-2 switch CPU to ACPI state C6 (remove power)

PM-3 switch CPU to ACPI state C6 and switch the memory to self- refresh mode

Page 10: Energy Efficiency in Large Scale Systems

Methodology

Run SPEC CPU2000 benchmarks at all v-f

settings

Estimate savings baselined against system with

PM-(1:3) policies

EPM-i varies based on the policy

DVFS beneficial if:

%EsavingsPM-i > 0

Page 11: Energy Efficiency in Large Scale Systems

Results

Benchmark

Freq %delay%EnergysavingsPM-i

PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•High average delay

•On die memory controller

Page 12: Energy Efficiency in Large Scale Systems

Results

Benchmark

Freq %delay%EnergysavingsPM-i

PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•Max Avg ~7% savings

•High perf delay

Page 13: Energy Efficiency in Large Scale Systems

Results

Benchmark

Freq %delay%EnergysavingsPM-i

PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

•Lowest v-f setting not useful

•Avg 7% savings

•Avg 200% delay

Page 14: Energy Efficiency in Large Scale Systems

Results

Benchmark

Freq %delay%EnergysavingsPM-i

PM-1 PM-2 PM-3

mcf

1.9 29 5.2 0.7 -0.5

1.4 63 8.1 0.1 -2.1

0.8 163 8.1 -6.3 -10.7

bzip2

1.9 37 4.7 -0.6 -2.1

1.4 86 7.4 -2.4 -5

0.8 223 7.8 -9.0 -14

art

1.9 32 6 1 -0.1

1.4 76 7.3 -1.7 -4

0.8 202 8 -8 -13

sixtrack

1.9 37 5 -0.5 -2

1.4 86 6 -4.3 -7.2

0.8 227 7 -11 -16.1

• DVFS energy inefficient

• Lower system idle power

consumption

Page 15: Energy Efficiency in Large Scale Systems

Conclusion

Simple power management policies provide

better energy performance tradeoffs

Lower v-f setting offer worse e/p tradeoffs due

to high performance delay

DVFS still useful for:

Power reduction: thermal management

Systems with simpler memory controllers and low

power system components

Page 16: Energy Efficiency in Large Scale Systems

Server Power Breakdown

Page 17: Energy Efficiency in Large Scale Systems

Energy Proportional Computing

17

Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to

peak performance. Even an energy-efficient server still consumes about half its full power

when doing virtually no work.

“The Case for

Energy-Proportional

Computing,”

Luiz André Barroso,

Urs Hölzle,

IEEE Computer

December 2007 Doing nothing well …

NOT!

Energy Efficiency =

Utilization/Power

Page 18: Energy Efficiency in Large Scale Systems

Energy Proportional Computing

18

Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers

are rarely completely idle and seldom operate near their maximum utilization, instead operating

most of the time at between 10 and 50 percent of their maximum

It is surprisingly hard

to achieve high levels

of utilization of typical

servers (and your home

PC or laptop is even

worse)

“The Case for

Energy-Proportional

Computing,”

Luiz André Barroso,

Urs Hölzle,

IEEE Computer

December 2007

Page 19: Energy Efficiency in Large Scale Systems

Energy Proportional Computing

19

Figure 4. Power usage and energy efficiency in a more energy-proportional server. This

server has a power efficiency of more than 80 percent of its peak value for utilizations of

30 percent and above, with efficiency remaining above 50 percent for utilization levels as

low as 10 percent.

“The Case for

Energy-Proportional

Computing,”

Luiz André Barroso,

Urs Hölzle,

IEEE Computer

December 2007 Design for

wide dynamic

power range and

active low power

modes

Doing nothing

VERY well

Energy Efficiency =

Utilization/Power

Page 20: Energy Efficiency in Large Scale Systems

Why not consolidate servers?

Security

Isolation

Must use the same OS

Solution:

Use virtualization!

Page 21: Energy Efficiency in Large Scale Systems

Virtualization

21

Benefits:

•Isolation and security

•Different OS in each VM

•Better resource utilization

Page 22: Energy Efficiency in Large Scale Systems

Virtualization

22

Benefits:

•Improved manageability

•Dynamic load management

•Energy savings through VM

consolidation!

Page 23: Energy Efficiency in Large Scale Systems

How to Save Energy?

VM consolidation is a common practice: Increases resource utilization

Turn idle machines into sleep mode

Active machines?

Active power management: e.g. DVFS less effective

in newer line of server processors

Leakage, faster memories, low voltage range

Make the workload run faster

Similar average power across machines

Exploit workload characteristics to share resources

efficiently

23

Page 24: Energy Efficiency in Large Scale Systems

Motivation: Workload Characterization

24

PM1

PM2

VM1 VM2

mcf

eon

60%

Page 25: Energy Efficiency in Large Scale Systems

Motivation: Workload Characterization

Workload characteristics determine:

Power/performance profile

Power distribution

Co-schedule/consolidate heterogeneous VMs

25

50W

Page 26: Energy Efficiency in Large Scale Systems

Motivation: Workload Characterization

Co-schedule/consolidate heterogeneous VMs

26

Page 27: Energy Efficiency in Large Scale Systems

What about DVFS?

Poor performance

Energy inefficient

Only good if homogeneously high MPC workload

27

80%40%

9%

Page 28: Energy Efficiency in Large Scale Systems

vGreen

A system for VM scheduling across a cluster of

physical machines

Dynamic VM characterization:

Memory accesses

Instruction throughput

CPU utilization

Co-schedule VMs with heterogeneous

characteristics for better:

Performance

Energy efficiency

Balanced thermal profile

28

Page 29: Energy Efficiency in Large Scale Systems

Scheduling with VMs

29

VM1 VM2

VCPU1 VCPU2 VCPU1 VCPU2

Dom0 VM1 VM2

Xen Scheduler

Dom-0: Privileged VM

Management

I/O

VM Creation:

Specify CPU, memory, I/O

config

CPU of VM referred to as VCPU:

Fundamental unit of

execution

OS inside VM schedules on VCPUs

Xen schedules VCPUs across PCPUs

Page 30: Energy Efficiency in Large Scale Systems

vGreen Architecture

30

Main Components:

vgnodes

vgxen: Characterizes the running VMs

vgdom: Exports information to vgserv

vgserv

Collects and analyzes the characterization information

Issues scheduling commands based on balancing policy

Dom0 VM1 VM2

Xen vgxen

vgpolicy

vgdom

vgserv

vgnode1

Updates UpdatesCommands

Dom0 VM1 VM2

Xen vgxen

vgdom

vgnode2

Page 31: Energy Efficiency in Large Scale Systems

vgnode (client physical machine)

31VCPU1 VCPU2

wMPC

wIPC

util

wMPC

wIPC

util

vMPC

vIPC

vutil

vMPC

vIPC

vutil

VM1 VM2

wMPC

wIPC

util

wMPC

wIPC

util

VCPU1 VCPU2

vMPC wMPCVCPU

vIPC wIPCVCPU

vutil utilVCPU

Dom0 VM1 VM2

Xen vgxen

vgdom

vgxen: characterizes the VMs

Uses performance counters to estimate:

IPC (Instructions per cycle)

MPC (Memory accesses per cycle)

Weighted by CPU utilization (wIPC, wMPC)

vgdom:

Reads metrics from vgxen

Exports to vgserv

wMPCcur = MPCcur . util

wMPC = α . wMPCcur + (1-α) . wMPCprev

Page 32: Energy Efficiency in Large Scale Systems

Hierarchical Workload

Characterization

32VCPU1 VCPU2

wMPC

wIPC

util

wMPC

wIPC

util

vMPC

vIPC

vutil

vMPC

vIPC

vutil

nMPC

nIPC

nutil

VM1 VM2

wMPC

wIPC

util

wMPC

wIPC

util

VCPU1 VCPU2

Node Level Metrics

(maintained by vgpolicy)

VM Level Metrics

(maintained by

vgpolicy and vgxen)

VCPU Level Metrics

(maintained by vgxen)

VGNODE

nMPC vMPCVM

nIPC vIPCVM

nutil vutilVM

vMPC wMPCVCPU

vIPC wIPCVCPU

vutil utilVCPU

Page 33: Energy Efficiency in Large Scale Systems

vgserv (VM scheduler)

33

vgpolicy: schedules VMs

Balances the overall MPC and IPC across vgnodes:

MPC: performance and energy efficiency

IPC: power distribution

vgpolicy

vgserv

vgnode1 vgnode2 vgnode3 vgnode4

Periodic Updates:

Node and VM level

MPC and IPC

VM1 VM2 VM1 VM1 VM1 VM2

Migration Command

VM2

If conflict in decision, priority given to MPC balancing

Page 34: Energy Efficiency in Large Scale Systems

vgpolicy

34

vgnode1 vgnode2 vgnode3 vgnode4

Example: MPC Balance Algorithm

Find the VM with minimum MPC

Migrate if it does not reverse imbalancenMPC > nMPCth vgnodenMPCmin

VM1 VM2 VM1 VM1 VM1 VM2VM2

VM1 VM2 VM1VM2

Page 35: Energy Efficiency in Large Scale Systems

Implementation

Xen 3.3.1 as the hypervisor

vgxen implemented as part of the stock Xen

credit scheduler

vgdom implemented as a driver and application

in Domain0

Communicates with vgxen through a shared page

No modifications required to the guest OS!

Used a testbed of Dual Intel Quad core

Xeon based machines as vgnodes

Linux based desktop used as vgserv

35

Dom0 VM1 VM2

Xen vgxen

vgdom

Page 36: Energy Efficiency in Large Scale Systems

Methodology

Create VMs running Linux as guest OS:

4 VCPUs; 4GB memory for each VM

Populate with 18 workloads with varying characteristics:

mcf, eon, art, equake, swim , bzip2, gcc etc.

36

Initial assignment of VMs based on Eucalyptus [Nurmi, CCA’08]

Compare against ‘E+’: Eucalyptus + state of the art dynamic VM scheduling algorithms

Perform VM consolidation based on CPU utilization

Page 37: Energy Efficiency in Large Scale Systems

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Weig

hte

d S

peed

up

Weighted Speedup vs E+

20% speedup on

average

37

Average 40% Weighted Speedup

Page 38: Energy Efficiency in Large Scale Systems

Energy Savings vs E+

38

Average 35% Energy Savings

Page 39: Energy Efficiency in Large Scale Systems

Balanced Thermal Profile

39

Average power variance reduction of 30W

Page 40: Energy Efficiency in Large Scale Systems

Cooling subsystem challenges

Challenges

How to minimize the cooling costs within a

single machine?

How to further reduce the cooling costs by

creating a better temperature distribution

across the physical machines

Cooling systems maintain temperature

and avoid reliability issues

Fan subsystem consume significant energy

(up to 30% of total server power)

Fan subsystem

1U server

CPU

CPU

Page 41: Energy Efficiency in Large Scale Systems

Fan controller

41

Fan controller (e.g. PI controller)

Actuator(adjusts fan speed)

Chip die

CPU cores

thermal sensors

Fan speeds

(e.g 8 speeds)

Max temperature

Temperature

threshold

Efficient fan controller can be constructed based on control theory

Traditionally, cooling optimizations focus only on the fan controller

without including workload management

Page 42: Energy Efficiency in Large Scale Systems

Cooling aware workload scheduling

State of the art load balancing algorithms in operating systems

do not consider cooling costs

High

temp

High speed Moderate speed

Assume a case of two sockets where one runs intensive workload while the other

executes moderately active workload

The workload is balanced from the performance point of view

One fan runs at high speed while the other at moderate speed

Sources of inefficiency in cooling costs?

fan power ~ (fan speed)3 Nonlinear relation

How and when to schedule the

workload to minimize the

cooling costs?

Page 43: Energy Efficiency in Large Scale Systems

Handling cooling inefficiency

43

Migrate computations to minimize the cooling cost

Migration overhead is acceptable:

Heat sink temperature time constant (10s of seconds) >> migration latency

((10s – 100s) of µs )

High speed Low speed

Challenge:

Which threads to migrate?

Thermal and cooling model is

needed to estimate the benefits

of workload reassignment

Migration ?

Page 44: Energy Efficiency in Large Scale Systems

Triggering workload rescheduling

44

Fan speed

time

Fan speed

time

Workload

rescheduling

Reactive approach

Lowers cooling savings

Cannot minimize the noise level

Impacts fan stability

Challenge: Design of efficient proactive dynamic cooling aware

workload management technique

Predictive approach:

Improves energy

Lowers noise level

Provides better stability

Workload

rescheduling

Page 45: Energy Efficiency in Large Scale Systems

45

Socket level strategies: spreading

Fan speed can be reduced by creating a better

temperature distribution

Migrate some of the active threads from the sockets with high fan

speed to sockets with lower speed

Swap some of the hot threads from sockets with high fan speed

with colder threads from sockets with lower speed.

High speed Low speed Moderate speed Moderate speed

Page 46: Energy Efficiency in Large Scale Systems

464646

Socket level strategies: Consolidation

Concentrate the hot threads into fewer sockets while keeping the

fan at the same speed:

Migrate hot threads to the socket with fan that is spinning higher

than is required

If Fan speedM ≥ Fan speedN, we can swap the hot thread from

socket N with colder threads from socket M

Moderate speed Moderate speed Moderate speed Low speed

PW ≤ PC+PD

Page 47: Energy Efficiency in Large Scale Systems

47

Multi-tier workload management

47

Multi-tier workload management to enhance temperature

distribution in the entire system:

VM management at the physical machine level

VP management at the CPU socket level

High speed Moderate

speed

Thread

migration

Low speed

VM

migration

Server i Server j

Moderate

speed

Page 48: Energy Efficiency in Large Scale Systems

48

Scheduling policy

Traverse

VMs/VPs

Mark if savings

exist

Evaluate Spreading Savings

Traverse

VMs/VPs

Mark if savings

exist

Evaluate Consolidation Savings

Multi-tier scheduling algorithm: VMs at the machine level

VPs at the socket level

Run periodically: Period ~ minutes @ the VM level

Period ~ seconds @ the VP level

Schedule

Page 49: Energy Efficiency in Large Scale Systems

49

Experimental setup

Platform: A cluster of 4 servers each has 2 Xeon sockets

Simulation: Actual power measurements of the jobs

HotSpot, [Skadron 03] for thermal/cooling simulations

K. Skadron, et al. Temperature-aware microarchitecture, ISCA 2003.

Polices:

We compare our results against state of the art Dynamic load balancing

Dynamic load balancing minimizes the differences in task queues across various

levels

We use benchmark

combinations from SPEC

2000 suite

We assume live migration for

migrating the VMs across the

PMs

VM assignment Ambient temp oC

PM 1 – 4

PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each

PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 43 each

PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 43 each

PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41

PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each

PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each

Page 50: Energy Efficiency in Large Scale Systems

50

Energy savings

Average cooling energy savings of 72%

VM assignment Ambient temp oC

PM 1 – 4

1: PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each

2: PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 43 each

3: PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 43 each

4: PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41

5: PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each

6: PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each

0

10

20

30

40

50

60

70

80

90

Po

wer

savin

gs (

%)

Page 51: Energy Efficiency in Large Scale Systems

What next?

Energy management must take both computation

and cooling costs into account

What about priority/QoS

Current approach maximizes overall throughput/speedup

What if someone pays more or has higher priority?

Closed loop design for performance monitoring

51