energy efficiency in large scale systems
DESCRIPTION
June 2010 presentation by GreenLight researcher Tajana Rosing on her project research on system energy efficiency.TRANSCRIPT
Gaurav Dhiman, Raid Ayoub
Prof. Tajana Šimunić Rosing
Dept. of Computer Science
System Energy Efficiency Lab
seelab.ucsd.edu
Large scale systems: Clusters
Power consumption is a critical
design parameter: Operational costs
o Compute Equipment
o Cooling
By 2010, US electricity bill for powering and cooling data centers ~$7B[1]
Electricity input to data centers in the US exceeds electricity consumption of Italy!
2
[1]: Meisner et al, ASPLOS 2008
Energy Savings with DVFS
Reduction in CPU power
Extra system power
Effectiveness of DVFS
For energy savings
ER > EE
Factors in modern systems affecting this equation:
Performance delay (tdelay)
Idle CPU power consumption (PE)
Power consumption of other devices (PE)
Performance Delay
Lower tdelay=> higher energy savings
Depends on memory/CPU intensiveness
Experiments with SPEC CPU2000
mcf: highly memory intensive
Expect low tdelay
sixtrack: highly cache/CPU intensive
Expect high tdelay
Two state of the art processors
AMD quad core Opteron
On die memory controller (2.6GHz), DDR3
Intel quad core Xeon
Off chip memory controller (1.3GHz), DDR2
Performance Delay
Due to slower memory controller and memory
Due to on die memory controller and fast DDR3 memory
mcf much closer to
worst case on AMD
mcf much closer to
best case on Xeon
Idle CPU power consumption
Low power idle CPU states common now
C1 state used be default
Zero dynamic power consumption
Support for deeper C-states appearing
C6 on Nehalem
Zero dynamic+leakage power
Higher extra CPU power consumption for
modern CPUs
Lower DVFS benefits
Device power consumption
DVFS makes other devices consume power for
longer time (tdelay)
Memory (4GB DDR3)
Idle -> 5W
Active -> 10W
Higher extra device power consumption
Lower DVFS benefits for memory intensive
benchmarks
Evaluation Setup
Assume a simple static-DVFS policy
AMD Opteron (four v-f settings):
1.25V/2.6GHz, 1.15V/1.9GHz, 1.05V/1.4GHz,
0.9V/0.8GHz
Compare against a base system with no
DVFS and three simple idle PM policies:
Policy Description
PM-1 switch CPU to ACPI state C1 (remove clock supply) and move to lowest voltage setting
PM-2 switch CPU to ACPI state C6 (remove power)
PM-3 switch CPU to ACPI state C6 and switch the memory to self- refresh mode
Methodology
Run SPEC CPU2000 benchmarks at all v-f
settings
Estimate savings baselined against system with
PM-(1:3) policies
EPM-i varies based on the policy
DVFS beneficial if:
%EsavingsPM-i > 0
Results
Benchmark
Freq %delay%EnergysavingsPM-i
PM-1 PM-2 PM-3
mcf
1.9 29 5.2 0.7 -0.5
1.4 63 8.1 0.1 -2.1
0.8 163 8.1 -6.3 -10.7
bzip2
1.9 37 4.7 -0.6 -2.1
1.4 86 7.4 -2.4 -5
0.8 223 7.8 -9.0 -14
art
1.9 32 6 1 -0.1
1.4 76 7.3 -1.7 -4
0.8 202 8 -8 -13
sixtrack
1.9 37 5 -0.5 -2
1.4 86 6 -4.3 -7.2
0.8 227 7 -11 -16.1
•High average delay
•On die memory controller
Results
Benchmark
Freq %delay%EnergysavingsPM-i
PM-1 PM-2 PM-3
mcf
1.9 29 5.2 0.7 -0.5
1.4 63 8.1 0.1 -2.1
0.8 163 8.1 -6.3 -10.7
bzip2
1.9 37 4.7 -0.6 -2.1
1.4 86 7.4 -2.4 -5
0.8 223 7.8 -9.0 -14
art
1.9 32 6 1 -0.1
1.4 76 7.3 -1.7 -4
0.8 202 8 -8 -13
sixtrack
1.9 37 5 -0.5 -2
1.4 86 6 -4.3 -7.2
0.8 227 7 -11 -16.1
•Max Avg ~7% savings
•High perf delay
Results
Benchmark
Freq %delay%EnergysavingsPM-i
PM-1 PM-2 PM-3
mcf
1.9 29 5.2 0.7 -0.5
1.4 63 8.1 0.1 -2.1
0.8 163 8.1 -6.3 -10.7
bzip2
1.9 37 4.7 -0.6 -2.1
1.4 86 7.4 -2.4 -5
0.8 223 7.8 -9.0 -14
art
1.9 32 6 1 -0.1
1.4 76 7.3 -1.7 -4
0.8 202 8 -8 -13
sixtrack
1.9 37 5 -0.5 -2
1.4 86 6 -4.3 -7.2
0.8 227 7 -11 -16.1
•Lowest v-f setting not useful
•Avg 7% savings
•Avg 200% delay
Results
Benchmark
Freq %delay%EnergysavingsPM-i
PM-1 PM-2 PM-3
mcf
1.9 29 5.2 0.7 -0.5
1.4 63 8.1 0.1 -2.1
0.8 163 8.1 -6.3 -10.7
bzip2
1.9 37 4.7 -0.6 -2.1
1.4 86 7.4 -2.4 -5
0.8 223 7.8 -9.0 -14
art
1.9 32 6 1 -0.1
1.4 76 7.3 -1.7 -4
0.8 202 8 -8 -13
sixtrack
1.9 37 5 -0.5 -2
1.4 86 6 -4.3 -7.2
0.8 227 7 -11 -16.1
• DVFS energy inefficient
• Lower system idle power
consumption
Conclusion
Simple power management policies provide
better energy performance tradeoffs
Lower v-f setting offer worse e/p tradeoffs due
to high performance delay
DVFS still useful for:
Power reduction: thermal management
Systems with simpler memory controllers and low
power system components
Server Power Breakdown
Energy Proportional Computing
17
Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to
peak performance. Even an energy-efficient server still consumes about half its full power
when doing virtually no work.
“The Case for
Energy-Proportional
Computing,”
Luiz André Barroso,
Urs Hölzle,
IEEE Computer
December 2007 Doing nothing well …
NOT!
Energy Efficiency =
Utilization/Power
Energy Proportional Computing
18
Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers
are rarely completely idle and seldom operate near their maximum utilization, instead operating
most of the time at between 10 and 50 percent of their maximum
It is surprisingly hard
to achieve high levels
of utilization of typical
servers (and your home
PC or laptop is even
worse)
“The Case for
Energy-Proportional
Computing,”
Luiz André Barroso,
Urs Hölzle,
IEEE Computer
December 2007
Energy Proportional Computing
19
Figure 4. Power usage and energy efficiency in a more energy-proportional server. This
server has a power efficiency of more than 80 percent of its peak value for utilizations of
30 percent and above, with efficiency remaining above 50 percent for utilization levels as
low as 10 percent.
“The Case for
Energy-Proportional
Computing,”
Luiz André Barroso,
Urs Hölzle,
IEEE Computer
December 2007 Design for
wide dynamic
power range and
active low power
modes
Doing nothing
VERY well
Energy Efficiency =
Utilization/Power
Why not consolidate servers?
Security
Isolation
Must use the same OS
Solution:
Use virtualization!
Virtualization
21
Benefits:
•Isolation and security
•Different OS in each VM
•Better resource utilization
Virtualization
22
Benefits:
•Improved manageability
•Dynamic load management
•Energy savings through VM
consolidation!
How to Save Energy?
VM consolidation is a common practice: Increases resource utilization
Turn idle machines into sleep mode
Active machines?
Active power management: e.g. DVFS less effective
in newer line of server processors
Leakage, faster memories, low voltage range
Make the workload run faster
Similar average power across machines
Exploit workload characteristics to share resources
efficiently
23
Motivation: Workload Characterization
24
PM1
PM2
VM1 VM2
mcf
eon
60%
Motivation: Workload Characterization
Workload characteristics determine:
Power/performance profile
Power distribution
Co-schedule/consolidate heterogeneous VMs
25
50W
Motivation: Workload Characterization
Co-schedule/consolidate heterogeneous VMs
26
What about DVFS?
Poor performance
Energy inefficient
Only good if homogeneously high MPC workload
27
80%40%
9%
vGreen
A system for VM scheduling across a cluster of
physical machines
Dynamic VM characterization:
Memory accesses
Instruction throughput
CPU utilization
Co-schedule VMs with heterogeneous
characteristics for better:
Performance
Energy efficiency
Balanced thermal profile
28
Scheduling with VMs
29
VM1 VM2
VCPU1 VCPU2 VCPU1 VCPU2
Dom0 VM1 VM2
Xen Scheduler
Dom-0: Privileged VM
Management
I/O
VM Creation:
Specify CPU, memory, I/O
config
CPU of VM referred to as VCPU:
Fundamental unit of
execution
OS inside VM schedules on VCPUs
Xen schedules VCPUs across PCPUs
vGreen Architecture
30
Main Components:
vgnodes
vgxen: Characterizes the running VMs
vgdom: Exports information to vgserv
vgserv
Collects and analyzes the characterization information
Issues scheduling commands based on balancing policy
Dom0 VM1 VM2
Xen vgxen
vgpolicy
vgdom
vgserv
vgnode1
Updates UpdatesCommands
Dom0 VM1 VM2
Xen vgxen
vgdom
vgnode2
vgnode (client physical machine)
31VCPU1 VCPU2
wMPC
wIPC
util
wMPC
wIPC
util
vMPC
vIPC
vutil
vMPC
vIPC
vutil
VM1 VM2
wMPC
wIPC
util
wMPC
wIPC
util
VCPU1 VCPU2
vMPC wMPCVCPU
vIPC wIPCVCPU
vutil utilVCPU
Dom0 VM1 VM2
Xen vgxen
vgdom
vgxen: characterizes the VMs
Uses performance counters to estimate:
IPC (Instructions per cycle)
MPC (Memory accesses per cycle)
Weighted by CPU utilization (wIPC, wMPC)
vgdom:
Reads metrics from vgxen
Exports to vgserv
wMPCcur = MPCcur . util
wMPC = α . wMPCcur + (1-α) . wMPCprev
Hierarchical Workload
Characterization
32VCPU1 VCPU2
wMPC
wIPC
util
wMPC
wIPC
util
vMPC
vIPC
vutil
vMPC
vIPC
vutil
nMPC
nIPC
nutil
VM1 VM2
wMPC
wIPC
util
wMPC
wIPC
util
VCPU1 VCPU2
Node Level Metrics
(maintained by vgpolicy)
VM Level Metrics
(maintained by
vgpolicy and vgxen)
VCPU Level Metrics
(maintained by vgxen)
VGNODE
nMPC vMPCVM
nIPC vIPCVM
nutil vutilVM
vMPC wMPCVCPU
vIPC wIPCVCPU
vutil utilVCPU
vgserv (VM scheduler)
33
vgpolicy: schedules VMs
Balances the overall MPC and IPC across vgnodes:
MPC: performance and energy efficiency
IPC: power distribution
vgpolicy
vgserv
vgnode1 vgnode2 vgnode3 vgnode4
Periodic Updates:
Node and VM level
MPC and IPC
VM1 VM2 VM1 VM1 VM1 VM2
Migration Command
VM2
If conflict in decision, priority given to MPC balancing
vgpolicy
34
vgnode1 vgnode2 vgnode3 vgnode4
Example: MPC Balance Algorithm
Find the VM with minimum MPC
Migrate if it does not reverse imbalancenMPC > nMPCth vgnodenMPCmin
VM1 VM2 VM1 VM1 VM1 VM2VM2
VM1 VM2 VM1VM2
Implementation
Xen 3.3.1 as the hypervisor
vgxen implemented as part of the stock Xen
credit scheduler
vgdom implemented as a driver and application
in Domain0
Communicates with vgxen through a shared page
No modifications required to the guest OS!
Used a testbed of Dual Intel Quad core
Xeon based machines as vgnodes
Linux based desktop used as vgserv
35
Dom0 VM1 VM2
Xen vgxen
vgdom
Methodology
Create VMs running Linux as guest OS:
4 VCPUs; 4GB memory for each VM
Populate with 18 workloads with varying characteristics:
mcf, eon, art, equake, swim , bzip2, gcc etc.
36
Initial assignment of VMs based on Eucalyptus [Nurmi, CCA’08]
Compare against ‘E+’: Eucalyptus + state of the art dynamic VM scheduling algorithms
Perform VM consolidation based on CPU utilization
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Weig
hte
d S
peed
up
Weighted Speedup vs E+
20% speedup on
average
37
Average 40% Weighted Speedup
Energy Savings vs E+
38
Average 35% Energy Savings
Balanced Thermal Profile
39
Average power variance reduction of 30W
Cooling subsystem challenges
Challenges
How to minimize the cooling costs within a
single machine?
How to further reduce the cooling costs by
creating a better temperature distribution
across the physical machines
Cooling systems maintain temperature
and avoid reliability issues
Fan subsystem consume significant energy
(up to 30% of total server power)
Fan subsystem
1U server
CPU
CPU
Fan controller
41
Fan controller (e.g. PI controller)
Actuator(adjusts fan speed)
Chip die
CPU cores
thermal sensors
Fan speeds
(e.g 8 speeds)
Max temperature
Temperature
threshold
Efficient fan controller can be constructed based on control theory
Traditionally, cooling optimizations focus only on the fan controller
without including workload management
Cooling aware workload scheduling
State of the art load balancing algorithms in operating systems
do not consider cooling costs
High
temp
High speed Moderate speed
Assume a case of two sockets where one runs intensive workload while the other
executes moderately active workload
The workload is balanced from the performance point of view
One fan runs at high speed while the other at moderate speed
Sources of inefficiency in cooling costs?
fan power ~ (fan speed)3 Nonlinear relation
How and when to schedule the
workload to minimize the
cooling costs?
Handling cooling inefficiency
43
Migrate computations to minimize the cooling cost
Migration overhead is acceptable:
Heat sink temperature time constant (10s of seconds) >> migration latency
((10s – 100s) of µs )
High speed Low speed
Challenge:
Which threads to migrate?
Thermal and cooling model is
needed to estimate the benefits
of workload reassignment
Migration ?
Triggering workload rescheduling
44
Fan speed
time
Fan speed
time
Workload
rescheduling
Reactive approach
Lowers cooling savings
Cannot minimize the noise level
Impacts fan stability
Challenge: Design of efficient proactive dynamic cooling aware
workload management technique
Predictive approach:
Improves energy
Lowers noise level
Provides better stability
Workload
rescheduling
45
Socket level strategies: spreading
Fan speed can be reduced by creating a better
temperature distribution
Migrate some of the active threads from the sockets with high fan
speed to sockets with lower speed
Swap some of the hot threads from sockets with high fan speed
with colder threads from sockets with lower speed.
High speed Low speed Moderate speed Moderate speed
464646
Socket level strategies: Consolidation
Concentrate the hot threads into fewer sockets while keeping the
fan at the same speed:
Migrate hot threads to the socket with fan that is spinning higher
than is required
If Fan speedM ≥ Fan speedN, we can swap the hot thread from
socket N with colder threads from socket M
Moderate speed Moderate speed Moderate speed Low speed
PW ≤ PC+PD
47
Multi-tier workload management
47
Multi-tier workload management to enhance temperature
distribution in the entire system:
VM management at the physical machine level
VP management at the CPU socket level
High speed Moderate
speed
Thread
migration
Low speed
VM
migration
Server i Server j
Moderate
speed
48
Scheduling policy
Traverse
VMs/VPs
Mark if savings
exist
Evaluate Spreading Savings
Traverse
VMs/VPs
Mark if savings
exist
Evaluate Consolidation Savings
Multi-tier scheduling algorithm: VMs at the machine level
VPs at the socket level
Run periodically: Period ~ minutes @ the VM level
Period ~ seconds @ the VP level
Schedule
49
Experimental setup
Platform: A cluster of 4 servers each has 2 Xeon sockets
Simulation: Actual power measurements of the jobs
HotSpot, [Skadron 03] for thermal/cooling simulations
K. Skadron, et al. Temperature-aware microarchitecture, ISCA 2003.
Polices:
We compare our results against state of the art Dynamic load balancing
Dynamic load balancing minimizes the differences in task queues across various
levels
We use benchmark
combinations from SPEC
2000 suite
We assume live migration for
migrating the VMs across the
PMs
VM assignment Ambient temp oC
PM 1 – 4
PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each
PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 43 each
PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 43 each
PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41
PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each
PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each
50
Energy savings
Average cooling energy savings of 72%
VM assignment Ambient temp oC
PM 1 – 4
1: PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 41 each
2: PM1(3perl + 2bzip2), PM2 = PM1, PM3(3gcc + 2mcf), PM4 = PM3 43 each
3: PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 43 each
4: PM1(4perl + 4bzip2), PM2 = PM1, PM3(4gcc + 4mcf), PM4 = PM3 45, 41,38, 41
5: PM1(2perl + 4gcc + 2mcf), PM2(2perl + 4gcc), PM3 = PM1, PM4 = PM2 43 each
6: PM1(2perl + 6gcc), PM2=PM3=PM4 =PM1 43 each
0
10
20
30
40
50
60
70
80
90
Po
wer
savin
gs (
%)
What next?
Energy management must take both computation
and cooling costs into account
What about priority/QoS
Current approach maximizes overall throughput/speedup
What if someone pays more or has higher priority?
Closed loop design for performance monitoring
51