gcc cooling problems and recommendations computing sector

24
GCC Cooling Problems and Recommendations Computing Sector

Post on 19-Dec-2015

265 views

Category:

Documents


1 download

TRANSCRIPT

GCC Cooling Problems and Recommendations

Computing Sector

On hot days, the cooling at GCC is inadequate to operate the computing equipment at the capacity for which the rooms were designed.

Jan 6, 2012 GCC Cooling 2

The problem with the GCC condensers is that they won't work when their ambient input temperature is above 115 °F. • The combination of hot weather plus the enclosed

area between GCC building and the berm generates a heat pit on the condenser pad that leads to this input temperature limit being exceeded and the condensers becoming ineffective.

Condensers

GCC Berm

GCC Cooling 3

A FESS commissioned engineering study recommends constructing a raised platform for the condensers in order to reduce the formation of the heat pit and the short circuiting of hot exhaust back to the input. [Details on all options later in slides]

– We are asking for GPP funds to proceed with fixing the GCC cooling problem.

– Total amount request = $1,316K– Time for completion = 7 month including contingency

Jan 6, 2012

GCC Cooling 4

Design Capacity, each room

GCC-B

GCC-C

GCC Loads and Design Capacity

Jan 6, 2012

Significant redeployment of computing equipment occurred in Summer 2011, and ~30% of all scientific computing equipment had to be turned off during the 2 very hot Summer periods.• More equipment is being added to GCC rooms• Extra load from GCC-C equipment increases input temp to

GCC-B condensers – quite worrisome. • Not near GCC room design capacity

GCC Cooling 5

Without any type of remediation, we can operate GCC-B/C at the 2010 levels. We did have several hot days in 2010 when some nodes tripped off due to temperature effects.– Measured max capacity = ~865 kW (48% of design)

This is ~60% of expected 2012 summer capacity, meaning ~40% of the GCC equipment must be turned off on hot days

With external cooling similar to last Summer, our expected cooling capacity is ~1140 kW (63% of design)– This still exceeds our

expected 2012 load of ~1375 kW & we will need to turn off ~20% of the GCC computing equipment

Jan 6, 2012

$160K/summer for cooling

GCC Cooling 6

Conditions At Grid Computing Center

Jan 6, 2012

Outages and reduced capacity at GCC because input temp was too high for condensers to work

Guaranteed to have cooling problems at GCC whenever there is hot weather – scientific analysis is affected

Very close to failure,max load with external cooling

GCC Cooling 7

User GCC-A GCC-B GCC-C

CMS - 82% 18%CDF 10% 90% -DØ 16% 84% -

Lattice QCD - - 100%Intensity Frontier

Accelerator, Theory, OSG

- 100% -

The GCC facility is a vital component for all experiments at FNAL and its proper operation at design capacities is needed to support scientific output.• We need a cooling solution that solves the problem,

not one that just gets us by.• Long term future viability/capacity of GCC is important

because we need to use all our rooms (FCC/GCC/LCC) to the maximum extent possible

Fraction of Computing Resources per Room

Jan 6, 2012

GCC Cooling 8

Importance of Computing in FY12

• 2012 will be a big year for the Tevatron program. – CDF/D0 experiments will work hard to complete a full

suite of analyses with the full data set and whatever improved sensitivity features they can muster this year.

– It’s critical to be successful this year – the manpower is moving on and most of what the experiments don't accomplish in 2012 will never get done. No one is reloading students or post doc's and the LHC is working well.

Jan 6, 2012

GCC Cooling 9

Importance of Computing in FY12

• CDF and DØ are targeting two major conferences:– ICHEP (in Melbourne Australia, starts in early July)– Higgs Hunting Workshop (in France in late July).– These are the conferences targeted for “final final” results.

• Unlike all prior years, this year is not entirely conference driven. CDF and DØ will also be pushing to publish all analyses in 2012. – What is critical is the 6-8 weeks before a major

conference, but this year -- ALL weeks are important. The experiments can't afford to lose any significant portion of this summer or else they risk sufficient brain drain that they won't complete specific analyses.

Jan 6, 2012

GCC Cooling 10

Importance of Computing in FY12There are MOUs between FNAL and CERN for the CMS production computing availability – Requirement is 98% available during beam time.

– Losing the Fermilab Tier-1 and the US CMS analysis facilities due to cooling problems during the 2012 LHC run could put CMS data production and US CMS analysis efforts at some risk.

– CMS data production scenarios under discussion for 2012 high pileup run conditions rely on the availability of Fermilab computing resources. The CMS data coordinators would have to be informed of potential problems and would stop scheduling work to be performed at FNAL if cooling outages are anticipated.

– CMS primary data samples we care about will not be stored at FNAL and the U.S. sites will have to transfer data from other sites to continue performing their analyses.

Jan 6, 2012

GCC Cooling 11

Importance of Computing in FY12

• In 2012, the LCQD computing equipment in GCC represents 90% of the TFlop capacity for the $19M DOE Office of Science LQCD project.– Last Summer’s outage represented a loss of 0.6 TF-

yrs, or about 2.5% of the 2011 total.– Estimated loss this upcoming summer is ~2%/week– Cooling loss would impact physics production for the

major conference - Lattice ‘12 (June 24-30)– Outages would also likely mean some physics projects

allocated time by USQCD collaboration would not finish

Jan 6, 2012

GCC Cooling 12

Impact of Outages• An unscheduled outage of a day has a residual effect of 3-4 days.

• One day of downtime during a critical two week analysis is devastating, and can cause major delays to or de-scoping of conference presentations and publications.

• The effect of last year’s cooling shutdown lingered throughout the fall as normal maintenance that was scheduled for the summer was delayed until the fall. This includes acceptance testing for newly purchased equipment. This is a planning and resource scheduling nightmare.

• High temperatures and power cycling is not good for computing equipment and can lead to early mortality incidents

• The human resource load both in CD and at the experiments was large to monitor the situation, to reschedule, and to prioritize jobs based on predicted weather conditions

Jan 6, 2012

GCC Cooling 13

GCC Cooling Study

FESS contracted CMT Engineering to determine the most effective way to modify the cooling at GCC to alleviate outages while the computing rooms are operated at their designed capacity– Final report was delivered Dec 22, 2011 (DocDB 4587)

• Report stated the temporary equipment that was rented for additional cooling during the hot days – does not address the root cause of the problem– is not a permanent solution– is only partially effective in keeping the cooling

systems operatingJan 6, 2012

GCC Cooling 14

Available Options

Jan 6, 2012

CMT report also included recommendation to construct cold aisle containment system (retractable roofs and modular doors) at a total cost of $171K

Recommendation: $1,316K – Remove Berm + Raised Platform for condensers + Cold Aisle containment

GCC Cooling 15

Option 1

Option 1 = Remove the berm– More air would be able to circulate around the units,

but unclear if enough air would flow to solve the problem

– 4 months to complete, including contingency– $92K – Engineers estimate the risk of failure ~35%

Jan 6, 2012

GCC Cooling 16

Option 2

Option 2 = Remove the berm & stagger condensers

– Staggering the condensers increases the airflow and decreases the possibility of recirculating hot air

– Requires extended concrete pad, new refrigerant piping– Move 5 units at a time, no downtime on “cool” days– 5 months to complete, including contingency– $403K– Engineers estimate the risk of failure ~25%

Jan 6, 2012

GCC Cooling 17

Option 3

Option 3 = Remove the berm and replace 95 °F condensers with staggered 105 °F condensers.– Using units rated at higher ambient temperatures

could provide some additional reliability on hot days– Higher rated models are larger and require enlarged

equipment pad – Install in staggered mode as in Option 2– Units could be replaced one at a time – no downtime– 7 months to complete, including contingency– $1,107K– Engineers estimate the risk of failure ~15%

Jan 6, 2012

GCC Cooling 18

Option 4

Option 4 = Install raised platform and relocate condensers to a height above building and berm– New open-grate support structure installed directly

over existing equipment pad – similar to existing GCC-A platform

– Greatly improves air circulation– Removing berm is not required, but would greatly

facilitate construction (fork lifts instead of cranes)– Relocate existing condensers one at time to platform– New refrigerant piping required– 7 months to complete, including contingency– $1,053K– Engineers estimate the risk of failure ~0%

Jan 6, 2012

GCC Cooling 19

Option 5

Option 5 = Install new chilled water cooling system– 12 months to complete, including contingency– $7,802K– Engineers estimate the risk of failure ~0%

Jan 6, 2012

GCC Cooling 20

Phased Deployment

Could the GCC cooling upgrade be staged over several years?

– According to FESS engineering, yes this is possible but there would be of course extra costs involved due the extra engineering work to split the project into multiple pieces and the multiple yearly contracts involved to execute the staged deployment

– Question of cost versus the risk lab is willing to accept

Jan 6, 2012

GCC Cooling 21

Option Schedule Risk of Failure Total Cost

1 Remove Berm 4 months 35-40% $92K

2 Remove Berm and stagger condensers 5 months 25-30% $403K

3 Replace condensers with 105° models 7 months 15-25% $1,107K

4 Move condensers to raised platform 7 months 0-10% $1,053K

5 Install new chilled water cooling system 12 months 0-5% $7,802K

Summary

Jan 6, 2012

The schedule shows that we need to act now to minimize cooling outages at GCC this summer!

• Since schedule contingency is already added, we believe we still have an excellent chance of completing work before the hot days.

Cost of temporary cooling at GCC is ~$160K/summer

Recommendation to resolve GCC Cooling problem: $1,316K Remove Berm (now) + Raised Platform for condensers + Cold Aisle containment

GCC Cooling 22

• Backup slides

Jan 6, 2012

GCC Cooling 23

Details on Cooling Outages

Jan 6, 2012

GCC-B Load

GCC-B Load

GCC-C Load

GCC Cooling 24

Temporary Measures• 2010: Soaker water hoses were

deployed under the condensers to cool the concrete pad and leverage evaporative cooling.

• October 2010: 80 condenser duct chimneys were added

• 2011: Measures included soaker hoses, supplemental cooling and other small operational improvements. – However, these actions did not prevent

load shed incidents and will not provide sufficient heat rejection for the ultimate designed power density of 10.8 kW per rack giving a total computing capacity of 900 kW per room

Jan 6, 2012

$160K/summer for cooling

Duct chimneys addedto top of condensers

Original state