Virtual Melting Temperature: Managing Server Load to Minimize Cooling
Overhead with Phase Change Materials
Matt Skach1, Manish Arora2,3, Dean Tullsen3, Lingjia Tang1, Jason Mars1
University of Michigan1 -- Advanced Micro Devices, Inc.2 -- UC San Diego3
ISCA ‘18
Datacenters
2
Facebook Ireland Datacenter
Facebook datacenter
Huge warehouses full of servers that host the internet and the cloud
Datacenters Cooling
3
● Heat must be removed to prevent:○ Overheating○ Thermal downclocking○ Component failure
http://www.asetek.com/media/1031/rackcdu_d2c_datacenter.jpg
Global Energy Consumption (CIA World Factbook)
4
Energy Consumption Electricity Consumption (TWh/year)
1 China 6,100
2 United States 4,100
3 European Union 3,100
4 India 1,300
5 Russia 1,000
6 Japan 980
7 Canada 640
Datacenter Energy Consumption (Avgerinou, 2017)
5
Energy Consumption Electricity Consumption (TWh/year)
1 China 6,100
2 United States 4,100
3 European Union 3,100
Datacenters (global, est.) 1,600
4 India 1,300
5 Russia 1,000
6 Japan 980
7 Canada 640
Datacenter Energy Consumption (Avgerinou, 2017)
6
Energy Consumption Electricity Consumption (TWh/year)
1 China 6,100
2 United States 4,100
3 European Union 3,100
Datacenters (global, est.) 1,600
4 India 1,300
5 Russia 1,000
6 Japan 980
Datacenter Cooling (global, est.) 650
7 Canada 640
Datacenter Cooling
7
● Datacenter cooling is very expensive○ Infrastructure can cost 10s of
millions of dollars for large DCs
(Kontorinis, 2014)
○ Generally, more power efficient systems are more expensive up front
Open Compute cooling system
Datacenter Workloads
● Diurnal load is problematic○ Work is uneven○ Work is distributed○ Heat is produced when work is done
8
Google Search: US Load
Datacenter Cooling
● Build a big cooling system for peak load○ Underutilized most of the time
9
Expensive 100% coverage, low utilization
Datacenter Cooling ctd.
● Build a big cooling system for peak load○ Underutilized most of the time
10
Expensive 100% coverage, low utilization
Datacenter Cooling ctd.
● Build a big cooling system for peak load○ Underutilized most of the time
11
Expensive
Best
100% coverage, low utilization
50% coverage, maximum utilization
Thermal Time Shifting (TTS) [ISCA ‘15]
3am 7am 7pm 12am Time
Coo
ling
Load
Store heat toflatten peak
Release heatduring off hours
CoupledDecoupled
12
Cooling Load
● Metric of heat that must be removed● Datacenter is primarily concerned with IT & support equipment
13
http://www.slideshare.net/spsu/12-cooling-load-calculations
A Phase Change Material (PCM)
14
● Store energy in a Solid->Liquid phase change● Commercial paraffin wax offers the best properties of currently
available PCMs (Skach, 2015)
The problem with passive TTS
Thermal Time Shifting:
● Paraffin has a limited range of melting temperatures● Melting temperature cannot be changed● Power and temperature profiles vary over lifetime of servers
15Wikimedia Commons
Virtual Melting Temperature
● Datacenters need more flexibility● Create a “virtual” melting temperature separate from the actual melting
temperature
16Microsoft, Wikimedia Commons
Test Infrastructure
● 2U High Throughput Server● 2-day Google Workload trace divided between 5 datacenter workloads
17
Test Methodology
● 5 common datacenter workloads1. Web Search2. Data Caching3. Video Encoding4. Virus Scan5. Clustering
● Consider datacenter where all are colocated○ Contention mitigation techniques applied (eg. Bubble Up (Mars, 2011) and
Protean Code (Laurenzano, 2014))
18
Baseline: Load Balancing Schedulers
● Round Robin and Coolest First
19
Baseline: Load Balancing Schedulers
● Round Robin and Coolest First
● Problem: Average cluster temperature is too low to melt wax
Thermal Aware VMT
● Categorize jobs based upon thermal characteristics○ Binary classification: Would they melt significant wax in isolation?
21
Thermal Aware VMT
● Grouping Value (GV): Controllable ratio of group size○ Proportional to hot group size
● Locate ‘hot jobs’ together in ‘hot group’ to melt wax
22
Thermal Aware VMT Results
● Hot Group sized to melt wax during peak hours
23
Thermal Aware VMT Results
● Balance between melting wax too soon and not melting enough wax
24
GV=24: Hot group is too big
GV=22: Hot group is just right
GV=20: Hot Group is too small
Thermal Aware VMT Results
● Balance between melting wax too soon and not melting enough wax
25
GV=24: Hot group is too big
GV=22: Hot group is just right
GV=20: Hot Group is too small
Wax Aware VMT
● Begin with same setup as VMT-TA● When wax in hot group is fully melted, expand hot group
26
Wax Aware VMT Results
● Hot Group slightly too small: automatically expands during peak load
27
Wax Aware VMT Results
● Wax expansion preserves significant cooling load reduction
28
GV=24: Hot group is too big
GV=22: Hot group is just right
GV=20: Hot Group is too small
Wax Aware VMT Results
● Wax expansion preserves significant cooling load reduction
29
GV=24: Hot group is too big
GV=22: Hot group is just right
GV=20: Hot Group is too small
VMT-TA vs. VMT-WA
● Both work well at ideal GV● VMT-WA offers much more flexibility for unpredictable load
30
Smaller Hot Group
BiggerHot Group
Summary
● VMT stores thermal energy when passive TTS alone cannot○ Reduces maximum cooling load of a diurnal workload○ Configurable for varying datacenter power and load levels
● VMT-enabled thermal energy storage can:○ Reduce cooling system size 12%○ Or allow up to 14% more servers under the same cooling budget
31
Thank you!
32
Questions?
33