ral tier1: 2001 to 2011 james thorne gridpp 19 30 th august 2007
TRANSCRIPT
RAL Tier1: 2001 to 2011
James ThorneGridPP 19
30th August 2007
30/08/2007 [email protected]
Result of GridPP3 for Tier1
• Good result:– Effort increases from 16.5 to 20.4 FTE– £6.8M hardware budget (cf £2.3M in GridPP2)
• Extra fault management/hardware staff as size of farm increases
• A good result but team remains thinly stretched; hardware is just sufficient to meet experiments’ requirements.
30/08/2007 [email protected]
Planned Tier1 Storage Capacity (TiB)
Storage Capacity (TiB)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2008 2009 2010 2011
April
TiB Tape
Disk
30/08/2007 [email protected]
Planned Tier1 CPU Capacity (KSI2K)
0
2000
4000
6000
8000
10000
12000
14000
16000
2008 2009 2010 2011
April
KS
I2K
30/08/2007 [email protected]
Estimated Rack Count
0
20
40
60
80
100
120
2006 2007 2008 2009 2010 2011
Ra
ck
s
Disk
CPU
30/08/2007 [email protected]
Estimated number of Disk Servers
050
100150200250300350400450500
2006 2007 2008 2009 2010 2011
Nu
mb
er o
f d
isk
serv
ers
30/08/2007 [email protected]
Estimated number of Spinning Drives
0
2000
4000
6000
8000
10000
12000
2006 2007 2008 2009 2010 2011
Nu
mb
er
of
dri
ve
s
30/08/2007 [email protected]
Approximate H.W Value Allocated to Experiments in 2008
Alice4%
Atlas53%
Babar3%
CMS31%
LHCb8%
Other1%
Alice
Atlas
Babar
CMS
LHCb
Other
30/08/2007 [email protected]
Hardware
• CPU• Disk• Tape• Further procurements in FY08, FY09 and
FY10
30/08/2007 [email protected]
New Machine Room
• Order placed and contractor has started work• 800m2 can accommodate 300 racks + 5 robots• 2.3MW Power/Cooling capacity (some UPS)• Office accommodation for all E-Science staff• Scheduled to be available for September 2008
30/08/2007 [email protected]
Staffing
• Lex Holt left Tier1• James Adams is moving from hardware
support to Fabric Team system admin• Plan to recruit:
– Replacement hardware repair position– Two experiment support posts; one ATLAS, one
CMS.– Raja Nandakumar as honorary team member from
LHCb– Will also shortly commences GridPP3 recruitments
30/08/2007 [email protected]
CASTOR
• Operational issues mentioned at GridPP 18 were tip of iceberg and CASTOR 2.1.2 service was found to be inoperable.
• Massive amount of re-engineering carried out since March with much effort from CASTOR team.– Huge progress– Areas of concern
• We are optimistic that CASTOR will be a success
30/08/2007 [email protected]
SL4
• 20% of batch farm now running SL4• Negotiating with LHC experiments to agree
the move of their capacity from SL3 to SL4.• Once LHC migration is completed, remaining
capacity will follow within a few weeks.• Depends on the experiments, but should
expect termination of SL3 service in September
30/08/2007 [email protected]
Reliability
• March: invested a lot of effort without much gain
• Continue to prioritise reliability and making progress
• Recently exceeded target, now must maintain
• Start “Sysadmin On Duty” in September• Start on call later this year
30/08/2007 [email protected]
RAL-LCG2 Availability/Reliability
0%
20%
40%
60%
80%
100%
120%
Available
Old Reliability
New Reliability
Target
Average
Best 8
30/08/2007 [email protected]
CPU Efficiencies
• CPU efficiency much improved • August fall still being investigated• March minimum when CASTOR was
broken
30/08/2007 [email protected]
CPU Efficiencies
30/08/2007 [email protected]
Termination of GridPP use of ADS Service
• GridPP funding and use of old legacy Atlas Datastore service scheduled to end at end of March 2008.
• RAL will continue to operate ADS service and experiments are free to purchase capacity directly from ADS Team.
30/08/2007 [email protected]
dCache Closure
• dCache still supported and working• We will give 6 months notice before
terminating dCache service• No notice of termination yet• Aiming to end service by end of GRIDPP2
(March 2008). Also cannot terminate ADS service until dCache ceases.
30/08/2007 [email protected]
Grid Only
• Move to Grid only access postponed until December 2007
• No new local accounts• In January 2008:
– Batch job submission through RB/CE only (no qsub, some exceptions)
– No local login to UIs (some exceptions)– AFS Service will end
30/08/2007 [email protected]
Conclusions
• Positioning ourselves for LHC production.• A lot of good progress with CASTOR and
expect to meet the needs of the ATLAS M4 run and CMS’s CSA07.
• Reliability has finally improved.