gcc cooling problems and recommendations computing sector
Post on 19-Dec-2015
265 views
TRANSCRIPT
On hot days, the cooling at GCC is inadequate to operate the computing equipment at the capacity for which the rooms were designed.
Jan 6, 2012 GCC Cooling 2
The problem with the GCC condensers is that they won't work when their ambient input temperature is above 115 °F. • The combination of hot weather plus the enclosed
area between GCC building and the berm generates a heat pit on the condenser pad that leads to this input temperature limit being exceeded and the condensers becoming ineffective.
Condensers
GCC Berm
GCC Cooling 3
A FESS commissioned engineering study recommends constructing a raised platform for the condensers in order to reduce the formation of the heat pit and the short circuiting of hot exhaust back to the input. [Details on all options later in slides]
– We are asking for GPP funds to proceed with fixing the GCC cooling problem.
– Total amount request = $1,316K– Time for completion = 7 month including contingency
Jan 6, 2012
GCC Cooling 4
Design Capacity, each room
GCC-B
GCC-C
GCC Loads and Design Capacity
Jan 6, 2012
Significant redeployment of computing equipment occurred in Summer 2011, and ~30% of all scientific computing equipment had to be turned off during the 2 very hot Summer periods.• More equipment is being added to GCC rooms• Extra load from GCC-C equipment increases input temp to
GCC-B condensers – quite worrisome. • Not near GCC room design capacity
GCC Cooling 5
Without any type of remediation, we can operate GCC-B/C at the 2010 levels. We did have several hot days in 2010 when some nodes tripped off due to temperature effects.– Measured max capacity = ~865 kW (48% of design)
This is ~60% of expected 2012 summer capacity, meaning ~40% of the GCC equipment must be turned off on hot days
With external cooling similar to last Summer, our expected cooling capacity is ~1140 kW (63% of design)– This still exceeds our
expected 2012 load of ~1375 kW & we will need to turn off ~20% of the GCC computing equipment
Jan 6, 2012
$160K/summer for cooling
GCC Cooling 6
Conditions At Grid Computing Center
Jan 6, 2012
Outages and reduced capacity at GCC because input temp was too high for condensers to work
Guaranteed to have cooling problems at GCC whenever there is hot weather – scientific analysis is affected
Very close to failure,max load with external cooling
GCC Cooling 7
User GCC-A GCC-B GCC-C
CMS - 82% 18%CDF 10% 90% -DØ 16% 84% -
Lattice QCD - - 100%Intensity Frontier
Accelerator, Theory, OSG
- 100% -
The GCC facility is a vital component for all experiments at FNAL and its proper operation at design capacities is needed to support scientific output.• We need a cooling solution that solves the problem,
not one that just gets us by.• Long term future viability/capacity of GCC is important
because we need to use all our rooms (FCC/GCC/LCC) to the maximum extent possible
Fraction of Computing Resources per Room
Jan 6, 2012
GCC Cooling 8
Importance of Computing in FY12
• 2012 will be a big year for the Tevatron program. – CDF/D0 experiments will work hard to complete a full
suite of analyses with the full data set and whatever improved sensitivity features they can muster this year.
– It’s critical to be successful this year – the manpower is moving on and most of what the experiments don't accomplish in 2012 will never get done. No one is reloading students or post doc's and the LHC is working well.
Jan 6, 2012
GCC Cooling 9
Importance of Computing in FY12
• CDF and DØ are targeting two major conferences:– ICHEP (in Melbourne Australia, starts in early July)– Higgs Hunting Workshop (in France in late July).– These are the conferences targeted for “final final” results.
• Unlike all prior years, this year is not entirely conference driven. CDF and DØ will also be pushing to publish all analyses in 2012. – What is critical is the 6-8 weeks before a major
conference, but this year -- ALL weeks are important. The experiments can't afford to lose any significant portion of this summer or else they risk sufficient brain drain that they won't complete specific analyses.
Jan 6, 2012
GCC Cooling 10
Importance of Computing in FY12There are MOUs between FNAL and CERN for the CMS production computing availability – Requirement is 98% available during beam time.
– Losing the Fermilab Tier-1 and the US CMS analysis facilities due to cooling problems during the 2012 LHC run could put CMS data production and US CMS analysis efforts at some risk.
– CMS data production scenarios under discussion for 2012 high pileup run conditions rely on the availability of Fermilab computing resources. The CMS data coordinators would have to be informed of potential problems and would stop scheduling work to be performed at FNAL if cooling outages are anticipated.
– CMS primary data samples we care about will not be stored at FNAL and the U.S. sites will have to transfer data from other sites to continue performing their analyses.
Jan 6, 2012
GCC Cooling 11
Importance of Computing in FY12
• In 2012, the LCQD computing equipment in GCC represents 90% of the TFlop capacity for the $19M DOE Office of Science LQCD project.– Last Summer’s outage represented a loss of 0.6 TF-
yrs, or about 2.5% of the 2011 total.– Estimated loss this upcoming summer is ~2%/week– Cooling loss would impact physics production for the
major conference - Lattice ‘12 (June 24-30)– Outages would also likely mean some physics projects
allocated time by USQCD collaboration would not finish
Jan 6, 2012
GCC Cooling 12
Impact of Outages• An unscheduled outage of a day has a residual effect of 3-4 days.
• One day of downtime during a critical two week analysis is devastating, and can cause major delays to or de-scoping of conference presentations and publications.
• The effect of last year’s cooling shutdown lingered throughout the fall as normal maintenance that was scheduled for the summer was delayed until the fall. This includes acceptance testing for newly purchased equipment. This is a planning and resource scheduling nightmare.
• High temperatures and power cycling is not good for computing equipment and can lead to early mortality incidents
• The human resource load both in CD and at the experiments was large to monitor the situation, to reschedule, and to prioritize jobs based on predicted weather conditions
Jan 6, 2012
GCC Cooling 13
GCC Cooling Study
FESS contracted CMT Engineering to determine the most effective way to modify the cooling at GCC to alleviate outages while the computing rooms are operated at their designed capacity– Final report was delivered Dec 22, 2011 (DocDB 4587)
• Report stated the temporary equipment that was rented for additional cooling during the hot days – does not address the root cause of the problem– is not a permanent solution– is only partially effective in keeping the cooling
systems operatingJan 6, 2012
GCC Cooling 14
Available Options
Jan 6, 2012
CMT report also included recommendation to construct cold aisle containment system (retractable roofs and modular doors) at a total cost of $171K
Recommendation: $1,316K – Remove Berm + Raised Platform for condensers + Cold Aisle containment
GCC Cooling 15
Option 1
Option 1 = Remove the berm– More air would be able to circulate around the units,
but unclear if enough air would flow to solve the problem
– 4 months to complete, including contingency– $92K – Engineers estimate the risk of failure ~35%
Jan 6, 2012
GCC Cooling 16
Option 2
Option 2 = Remove the berm & stagger condensers
– Staggering the condensers increases the airflow and decreases the possibility of recirculating hot air
– Requires extended concrete pad, new refrigerant piping– Move 5 units at a time, no downtime on “cool” days– 5 months to complete, including contingency– $403K– Engineers estimate the risk of failure ~25%
Jan 6, 2012
GCC Cooling 17
Option 3
Option 3 = Remove the berm and replace 95 °F condensers with staggered 105 °F condensers.– Using units rated at higher ambient temperatures
could provide some additional reliability on hot days– Higher rated models are larger and require enlarged
equipment pad – Install in staggered mode as in Option 2– Units could be replaced one at a time – no downtime– 7 months to complete, including contingency– $1,107K– Engineers estimate the risk of failure ~15%
Jan 6, 2012
GCC Cooling 18
Option 4
Option 4 = Install raised platform and relocate condensers to a height above building and berm– New open-grate support structure installed directly
over existing equipment pad – similar to existing GCC-A platform
– Greatly improves air circulation– Removing berm is not required, but would greatly
facilitate construction (fork lifts instead of cranes)– Relocate existing condensers one at time to platform– New refrigerant piping required– 7 months to complete, including contingency– $1,053K– Engineers estimate the risk of failure ~0%
Jan 6, 2012
GCC Cooling 19
Option 5
Option 5 = Install new chilled water cooling system– 12 months to complete, including contingency– $7,802K– Engineers estimate the risk of failure ~0%
Jan 6, 2012
GCC Cooling 20
Phased Deployment
Could the GCC cooling upgrade be staged over several years?
– According to FESS engineering, yes this is possible but there would be of course extra costs involved due the extra engineering work to split the project into multiple pieces and the multiple yearly contracts involved to execute the staged deployment
– Question of cost versus the risk lab is willing to accept
Jan 6, 2012
GCC Cooling 21
Option Schedule Risk of Failure Total Cost
1 Remove Berm 4 months 35-40% $92K
2 Remove Berm and stagger condensers 5 months 25-30% $403K
3 Replace condensers with 105° models 7 months 15-25% $1,107K
4 Move condensers to raised platform 7 months 0-10% $1,053K
5 Install new chilled water cooling system 12 months 0-5% $7,802K
Summary
Jan 6, 2012
The schedule shows that we need to act now to minimize cooling outages at GCC this summer!
• Since schedule contingency is already added, we believe we still have an excellent chance of completing work before the hot days.
Cost of temporary cooling at GCC is ~$160K/summer
Recommendation to resolve GCC Cooling problem: $1,316K Remove Berm (now) + Raised Platform for condensers + Cold Aisle containment
GCC Cooling 24
Temporary Measures• 2010: Soaker water hoses were
deployed under the condensers to cool the concrete pad and leverage evaporative cooling.
• October 2010: 80 condenser duct chimneys were added
• 2011: Measures included soaker hoses, supplemental cooling and other small operational improvements. – However, these actions did not prevent
load shed incidents and will not provide sufficient heat rejection for the ultimate designed power density of 10.8 kW per rack giving a total computing capacity of 900 kW per room
Jan 6, 2012
$160K/summer for cooling
Duct chimneys addedto top of condensers
Original state