resource management: assessing the warm water facilities
TRANSCRIPT
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Resource Management: Assessing the Warm Water Facilities and Systems Supporting ORNL’s SummitJim Rogers
Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory
22
ORNL Leadership Class Systems 2004 - 2018
2012Cray XK7
Titan
27PF
18.5TF
25 TF
54 TF
62 TF
263 TF
1 PF
2.5PF
2004Cray X1E Phoenix
2005Cray XT3
Jaguar
2006Cray XT3
Jaguar
2007Cray XT4
Jaguar
2008Cray XT4
Jaguar
2008Cray XT5
Jaguar
2009Cray XT5
Jaguar
From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C / 42°F
supply) with annualized PUEs to ~1.4
33
ORNL’s Transition to Warmer Facility Supply Temperatures
~1.5EF
200PF
27PF
2012Cray XK7
Titan
2021/2022HPE/Cray Frontier
2018/2019IBM
Summit
Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers
Summit: A combination of direct on-package cooling and RDHX with 21°C / 70°Fsupply is > 95% room-neutral.• Above dewpoint• Contribution by chillers
~20% of the hours of the year
Frontier: Custom packaging is >95% room-neutral with a 32°C / 90°F supply.• ~100% Evaporative Cooling,
with supplemental HVAC for parasitic loads
4
Motivation for “Warmer” Cooling Solutions Serving HPC Centers
• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional
chillers. • No chillers, no ozone-depleting refrigerants (GREEN)
– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)
Oak Ridge, TN
– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.
• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)
55
OLCF Facilities Supporting Summit• Titan – 9MW @ heavy
load• Sitting on 250
pounds/ft2 raised floor• Uses 42F water and
special CDUs (XDPs)
• Summit – 256 compute cabinets on-slab
• 100% room-neutral design uses RDHX
• 20MW warm-water cooling plant using centralized CDU/secondary loop
66
• Summit– Demand: 3.3 (idle)-11.5MW;
• Secondary Loop– Supply 3300GPM (12,500
liters/min) @ 21°C; Return @ 29-33°C
– CPUs and GPUs use cold plates– DIMMs and parasitic loads use
RDHX– Storage and Network use RDHX
21°C/20MW/7700 ton Facility System Design System
77
Medium Water Temperature Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)
When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.
The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit
A New Challenge – Titan and rotated off-line, so there is no significant load on the Chillers[1-5]. NOAA moved to the warm water loop (~500GPM demand)
Existing chilled water cooling loop
New primarycooling loop
88
99
10
Benefits of ORNL’s Warm-Water (21°C) Mechanical Plant
Warm Water• Total Capacity to manage 20MW of IT
load
• Warm Water allows annualized PUE of 1.1– For each ~$1M cost per MW-year for
consumption on Summit;– A corresponding ~$100k cost per MW-year
for waste heat management
• Titan was $6-7M/year plus $1.8-$2M to remove the waste heat
• Summit will have a similar “power bill”, but closer to $500k annually for waste heat.
Integrated System/Facility Operation• Integration with the PLC allows us to tune
water pressure and flow– Better delta(t); less pumping energy
• Integration with IBM’s OpenBMC provides information necessary to protect ~37K CPUs and GPUs from inadequate flow across the cold plates
• Integration with the scheduler allows us to correlate power and temperature data with individual applications
11
Summit’s Power Demand, Aug 2018– Aug 2019
Early Access, Gordon Bell, Acceptance Testing
Transition to Operations
Full Production
12
Detailed Analysis of Summit’s Annual PUE
1.00
1.05
1.10
1.15
1.20
1.25
1.30
0
2000
4000
6000
8000
10000
12000
14000
2018
/08/
0120
18/0
8/11
2018
/08/
2220
18/0
9/02
2018
/09/
1220
18/0
9/23
2018
/10/
0420
18/1
0/14
2018
/10/
2520
18/1
1/05
2018
/11/
1620
18/1
1/26
2018
/12/
0720
18/1
2/18
2018
/12/
2820
19/0
1/08
2019
/01/
1920
19/0
1/30
2019
/02/
0920
19/0
2/20
2019
/03/
0320
19/0
3/13
2019
/03/
2420
19/0
4/04
2019
/04/
1520
19/0
4/25
2019
/05/
0620
19/0
5/17
2019
/05/
2720
19/0
6/07
2019
/06/
1820
19/0
6/28
2019
/07/
0920
19/0
7/20
2019
/07/
3120
19/0
8/10
2019
/08/
21
PUE
Cool
ing
(kW
cm)
Date
Cooling Source August 1 2018-August 31 2019 With 1 Week PUE
CHW CTW 1 Week PUE (kW)
Seasonal Transition (Summer to Fall)
Winter/Spring: 100% Evaporative Cooling
Seasonal Transition (to Summer)
1313
Summit’s RM Challenge – Data Volume
13
TOTAL ~470,000 metrics per second available for real-time analytics
• Data Streams Include:
• IBM OpenBMC framework (99 metrics/node/second x 4608 nodes = ~460,000/sec) - OOB
• IBM LSF jobs data for running applications (~10sec update interval)
• NOAA weather/wet bulb for Oak Ridge area (continuous, external)
• Programmable Logic Circuit (PLC) water flow (continuous, protected as part of BAS)
14
Sample from Grafana Dashboard (live version shown)
15
MTW Performance
Summit (System) Impact:• No substantive impact to CPU or GPU
operating conditions (junction temperatures)
• One degree increase in room temperature (worst case about 72.5° F)
Year over Year, 2018 to 2019:• IT Demand (kW-h) is up 19%• CHW use is 39% LESS than in 2018• The one-degree adjustment of supply
temperature provided a savings of $40,000
2018:Supply 70°F & ~4500GPM 2019:Supply 71°F & ~3300GPM
Energy used by Chilled Water
Total Energy, Mechanical Plant
PUE Improves
Summit IT Load (kW-h) (Cumulative) Increases
Wet Bulb
1616
Summit Power/Jobs Every Second
Click toPlayMovie
1717
Summit Temperature/Jobs Every Second
Click toPlayMovie
1818