Download - CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN
CHEP – Mumbai, February CHEP – Mumbai, February 20062006
State of Readiness of LHC State of Readiness of LHC Computing InfrastructureComputing Infrastructure
Jamie Shiers, CERNJamie Shiers, CERN
Introduction
Some attempts to define what “readiness” could mean
How we (will) actually measure it…
Where we stand today
What we have left to do – or can do in the time remaining…
Timeline to First Data
Related Talks
Summary & Conclusions
What are the requirements?
Since the last CHEP, we have seen:
The LHC Computing Model documents and Technical Design Reports; The associated LCG Technical Design Report; The finalisation of the LCG Memorandum of Understanding (MoU)
Together, these define not only the functionality required (Use Cases), but also the requirements in terms of Computing, Storage (disk & tape) and Network
But not necessarily in an site-accessible format…
We also have close-to-agreement on the Services that must be run at each participating site
Tier0, Tier1, Tier2, VO-variations (few) and specific requirements
We also have close-to-agreement on the roll-out of Service upgrades to address critical missing functionality
We have an on-going programme to ensure that the service delivered meets the requirements, including the essential validation by the experiments themselves
How do we measure success?
By measuring the service we deliver against the MoU targets
Data transfer rates; Service availability and time to resolve problems; Resources provisioned across the sites as well as measured usage…
By the “challenge” established at CHEP 2004:
[ The service ] “should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential“
“…whilst being stable, reliable and easy to use”
Preferably both…
Equally important is our state of readiness for startup / commissioning, that we know will be anything but steady state
[ Oh yes, and that favourite metric I’ve been saving… ]
LHC Startup
Startup schedule expected to be confirmed around March 2006
Working hypothesis remains ‘Summer 2007’
Lower than design luminosity & energy expected initially
But triggers will be opened so that data rate = nominal
Machine efficiency still an open question – look at previous machines???
Current targets: Pilot production services from June 2006 Full production services from October 2006 Ramp up in capacity & throughput to TWICE NOMINAL by April
2007
LHC Commissioning
Expect to be characterised by:
Poorly understood detectors, calibration, software, triggers etc.
Most likely no AOD or TAG from first pass – but ESD will be larger?
The pressure will be on to produce some results as soon as possible!
There will not be sufficient resources at CERN to handle the load
We need a fully functional distributed system, aka Grid
There are many Use Cases we did not yet clearly identify
Nor indeed test --- this remains to be done in the coming 9 months!
LCG Service Hierarchy
Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschungszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF (Amsterdam)
Nordic countries – distributed Tier-1 Spain – PIC (Barcelona)Taiwan – Academia Sinica (Taipei)UK – CLRC (Didcot)US – FermiLab (Illinois) – Brookhaven (NY)
Tier-1 – “online” to the data acquisition process high availability
Managed Mass Storage – grid-enabled data service
Data intensive analysis National, regional support Continual reprocessing activity
(or is that continuous?)
Tier-2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive
Les Robertson
The Dashboard
Sounds like a conventional problem for a ‘dashboard’
But there is not one single viewpoint…
Funding agency – how well are the resources provided being used? VO manager – how well is my production proceeding? Site administrator – are my services up and running? MoU targets? Operations team – are there any alarms? LHCC referee – how is the overall preparation progressing? Areas of concern? …
Nevertheless, much of the information that would need to be collected is common…
So separate the collection from presentation (views…)
As well as the discussion on metrics…
The Requirements
Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network
Look at the Computing TDRs; Look at the resources pledged by the sites (MoU etc.); Look at the plans submitted by the sites regarding acquisition, installation and
commissioning; Measure what is currently (and historically) available; signal anomalies.
Functional requirements, in terms of services and service levels, including operations, problem resolution and support
Implicit / explicit requirements in Computing Models; Agreements from Baseline Services Working Group and Task Forces; Service Level definitions in MoU; Measure what is currently (and historically) delivered; signal anomalies.
Data transfer rates – the TierX TierY matrix Understand Use Cases; Measure …
An
d te
st e
xten
sive
ly, b
oth
‘dt
eam
’ an
d ot
her
VO
s
The Requirements
Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network
Look at the Computing TDRs; Look at the resources pledged by the sites (MoU etc.); Look at the plans submitted by the sites regarding acquisition, installation and
commissioning; Measure what is currently (and historically) available.
Functional requirements, in terms of services and service levels, including operations, problem resolution and support
Implicit / explicit requirements in Computing Models; Agreements from Baseline Services Working Group and Task Forces; Service Level definitions in MoU; Measure what is currently (and historically) delivered; signal anomalies.
Data transfer rates – the TierX TierY matrix Understand Use Cases; Measure …
An
d te
st e
xten
sive
ly, b
oth
‘dt
eam
’ an
d ot
her
VO
s
Resource Deployment and Usage
Resource Requirements for 2008
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
30000.00
35000.00
2007 2008 2009 2010 2011 2012
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
0
2000
4000
6000
8000
10000
12000
14000
2007 2008 2009 2010 2011 2012
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
0
20000
40000
60000
80000
100000
120000
140000
2007 2008 2009 2010 2011 2012
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
2007 2008 2009 2010 2011 2012
Disk (TB)
CPU (kSI2k)
Tier-0
Tier-2s
Tier-1s
CERN Analysis Facility
ATLAS Resource Ramp-Up Needs
Site Planning Coordination
Site plans coordinated by LCG Planning Officer, Alberto Aimar
Plans are now collected in a standard format, updated quarterly
These allow tracking of progress towards agreed targets
Capacity ramp-up to MoU deliverables; Installation and testing of key services; Preparation for milestones, such as LCG Service Challenges…
Measured Delivered Capacity
Various accounting summaries:
LHC View http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php
Data Aggregation across Countries EGEE View http://www2.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php
Data Aggregation across EGEE ROC GridPP View http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php
Specific view for GridPP accounting summaries for Tier-2s
The Requirements
Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network Look at the Computing TDRs; Look at the resources pledged by the sites (MoU etc.); Look at the plans submitted by the sites regarding acquisition, installation and
commissioning; Measure what is currently (and historically) available.
Functional requirements, in terms of services and service levels, including operations, problem resolution and support
Implicit / explicit requirements in Computing Models; Agreements from Baseline Services Working Group and Task Forces; Service Level definitions in MoU; Measure what is currently (and historically) delivered; signal
anomalies.
Data transfer rates – the TierX TierY matrix Understand Use Cases; Measure …
An
d te
st e
xten
sive
ly, b
oth
‘dt
eam
’ an
d ot
her
VO
s
Reaching the MoU Service Targets
These define the (high level) services that must be provided by the different Tiers
They also define average availability targets and intervention / resolution times for downtime & degradation
These differ from TierN to TierN+1 (less stringent as N increases) but refer to the ‘compound services’, such as “acceptance of raw data from the Tier0 during accelerator operation”
Thus they depend on the availability of specific components – managed storage, reliable file transfer service, database services, …
Can only be addressed through a combination of appropriate: Hardware; Middleware and Procedures Careful Planning & Preparation Well understood operational & support procedures & staffing
Same, COD-6, Barcelona 17
Service Monitoring - Introduction
• Service Availability Monitoring Environment (SAME) - uniform platform for monitoring all core services based on SFT experience
• Two main end users (and use cases):– project management - overall metrics– operators - alarms, detailed info for debugging, problem
tracking
• A lot of work already done:– SFT and GStat are monitoring CEs and Site-BDIIs– Data schema (R-GMA) established– Basic displays in place (SFT report, CIC-on-duty
dashboard, GStat) and can be reused
Service Level DefinitionsClass Description Downtime Reduced Degraded Availability
C Critical 1 hour 1 hour 4 hours 99%
H High 4 hours 6 hours 6 hours 99%
M Medium 6 hours 6 hours 12 hours 99%
L Low 12 hours 24 hours 48 hours 98%
U Unmanaged None None None None
Tier0 services: C/H, Tier1 services: H/M, Tier2 services M/LService Maximum delay in responding to operational
problemsAverage availability
measured on an annual basis
Service interruptio
n
Degradation … by more than
50%
Degradation … by more than
20%
During accelerator operation
At all other times
Acceptance of data from the Tier-0 Centre during accelerator operation
12 hours 12 hours 24 hours 99% n/a
Networking service to the Tier-0 Centre during accelerator operation
12 hours 24 hours 48 hours 98% n/a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outside accelerator operation
24 hours 48 hours 48 hours n/a 98%
All other services – prime service hours[1]
2 hour 2 hour 4 hours 98% 98%
All other services – outside prime service hours
24 hours 48 hours 48 hours 97% 97%
Service Functionalityhttps://twiki.cern.ch/twiki/bin/view/LCG/Planning
SC4 SERVICES - Planning
Legenda Feature available with the next deployed release of LCG
Feature that will be deployed in 2006 as available
Feature not available
References on: https://uimon.cern.ch/twiki/bin/view/LCG/SummaryOpenIssuesTF
ID References Notes CommentsReferences HyperlinksDependent Milestones
GLITE RUNTIME ENVIRONMENT
ALL 8.a On going effort with Dirk and Andrea
8.a
AUTHORIZATION AND AUTHENTICATION
VOMS v1.6.15 available features 1.b VOMS 1.6.15: LFC 1.4.3 Yes; DPM1.4.3 No (1.5.0 Yes);
1.c VOMS 1.6.15: User Alias retrievable from VOMS server with provided script. Not available in user proxy.
1.a VOMS 1.6.15: High availability but not mirroring off site. Memory leaks in server.
myproxy v0.6.1 1.d LFC 1.4.3 Not needed; DPM 1.4.3 No; FTS 1.5.0 No; RB gLite 3.0 OK.
1.d Provided bygLite 3.0 WMS. No services beside WMS are interfaced with it.
1.e Not available
INFORMATION SYSTEM
Not available 2.a Not available
R.Bailey, Chamonix XV, January 2006R.Bailey, Chamonix XV, January 2006 2020
Breakdown of a normal yearBreakdown of a normal year
7-8
~ 140-160 days for physics per yearNot forgetting ion and TOTEM operation
Leaves ~ 100-120 days for proton luminosity running? Efficiency for physics 50% ?
~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year
- From Chamonix XIV -S
ervi
ce u
pgra
de s
lots
?
Site & User Support
Ready to move to single entry point now
Target is to replace all interim mailing lists prior to SC4 Service Phase
i.e. by end – May for 1st June start
Send mail to [email protected] | [email protected]
Also portal at www.ggus.org
CERN - Computing Challenges 22
Enabling Grids for E-sciencE
INFSO-RI-508833
PPS & WLCG Operations
• Production-like operation procedures and tools need to be introduced in PPS– Must re-use as much as possible from production service.– This has already started (SFT, site registration) but we need to finish this
very quickly – end of February?
• PPS operations must be taken over by COD– Target proposed at last “COD meeting” was end March 2006
• This is a natural step also for “WLCG production operations”
• And is consistent with the SC4 schedule
– Production Services from beginning of June 2006
The Requirements
Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network Look at the Computing TDRs; Look at the resources pledged by the sites (MoU etc.); Look at the plans submitted by the sites regarding acquisition, installation and
commissioning; Measure what is currently (and historically) available.
Functional requirements, in terms of services and service levels, including operations, problem resolution and support
Implicit / explicit requirements in Computing Models; Agreements from Baseline Services Working Group and Task Forces; Service Level definitions in MoU; Measure what is currently (and historically) delivered; signal
anomalies.
Data transfer rates – the TierX TierY matrix Understand Use Cases; Measure …
An
d te
st e
xten
sive
ly, b
oth
‘dt
eam
’ an
d ot
her
VO
s
Summary of Tier0/1/2 Roles
Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times;
Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s;
Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction.
N.B. there are differences in roles by experimentEssential to test using complete production chain of each!
Centre ALICE ATLAS CMS LHCb Rate into T1 (pp)MB/s
ASGC, Taipei - - 100
CNAF, Italy 200
PIC, Spain - 100
IN2P3, Lyon 200
GridKA, Germany 200
RAL, UK - 150
BNL, USA - - - 200
FNAL, USA - - - 200
TRIUMF, Canada - - - 50
NIKHEF/SARA, NL - 150
Nordic Data Grid Facility - - 50
Totals - - - - 1,600
Sustained Average Data Rates to Tier1 Sites (To Tape)
Need additional capacity to recover from inevitable interruptions…
LCG OPN Status
Based on expected data rates during pp and AA running, 10Gbit/s networks are required between the Tier0 and all Tier1s
Inter-Tier1 traffic (reprocessing and other Use Cases) was one of the key topics discussed at the SC4 workshop this weekend, together with TierX TierY needs for analysis data, calibration activities and other studies
A number of sites already have their 10Gbit/s links in operation
The remaining are expected during the course of the year
Service Challenge Throughput Tests
Currently focussing on Tier0Tier1 transfers with modest Tier2Tier1 upload (simulated data)
Recently achieved target of 1GB/s out of CERN with rates into Tier1s at or close to nominal rates
Still much work to do!
We still do not have the stability required / desired…
The daily average needs to meet / exceed targets We need to handle this without “heroic efforts” at all times of day /
night! We need to sustain this over many (100) days We need to test recovery from problems (individual sites – also Tier0) We need these rates to tape at Tier1s (currently disk)
Agree on milestones for TierXTierY transfers & demonstrate readiness
Achieved (Nominal) pp data rates
Centre ALICE ATLAS CMS LHCb Rate into T1 (pp)Disk-Disk (SRM) rates in MB/s
ASGC, Taipei - - 80 (100) (have hit 140)
CNAF, Italy 200
PIC, Spain - >30 (100) (network constraints)
IN2P3, Lyon 200
GridKA, Germany 200
RAL, UK - 200 (150)
BNL, USA - - - 150 (200)
FNAL, USA - - - >200 (200)
TRIUMF, Canada - - - 140 (50)
SARA, NL - 250 (150)
Nordic Data Grid Facility
- - 150 (50)
Meeting or exceeding nominal rate (disk – disk)
Met target rate for SC3 (disk & tape) re-run Missing: rock solid stability - nominal tape rates
SC4 T0-T1 throughput goals: nominal rates to disk (April) and tape (July)
To come:Srm copy support in FTS;CASTOR2 at remote sites;SLC4 at CERN;Network upgrades etc.
CMS Tier1 – Tier1 Transfers
In the CMS computing model the Tier-1 to Tier-1 transfers are reasonably small.
The Tier-1 centers are used for re-reconstruction of events so Reconstructed events from some samples and analysis objects from all samples are replicated between Tier-1 centers.
Goal for Tier-1 to Tier-1 transfers:
FNAL -> One Tier-1 1TB per day February 2006 FNAL -> Two Tier-1's 1TB per day each March 2006 FNAL -> 6Tier-1 Centers 1TB per day each July 2006 FNAL -> One Tier-1 4TB per day July 2006 FNAL -> Two Tier-1s 4TB per day each November 2006
Ian Fisk1 day = 86,400s ~105s
ATLAS – 2 copies of ESD?
33
SC4 milestones (2)
Tier-1 to Tier-2 Transfers (target rate 300-500Mb/s)• Sustained transfer of 1TB data to 20% sites by end December• Sustained transfer of 1TB data from 20% sites by end December• Sustained transfer of 1TB data to 50% sites by end January• Sustained transfer of 1TB data from 50% sites by end January• Peak rate tests undertaken for the two largest Tier-2 sites in each Tier-2 by end
February• Sustained individual transfers (>1TB continuous) to all sites completed by mid-March• Sustained individual transfers (>1TB continuous) from all sites completed by mid-
March• Peak rate tests undertaken for all sites by end March• Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March
Tier-2 Transfers (target rate 100 Mb/s)• Sustained transfer of 1TB data between largest site in each Tier-2 to that of another
Tier-2 by end February• Peak rate tests undertaken for 50% sites in each Tier-2 by end February
June 12-14 2006 “Tier2” Workshop
Focus on analysis Use Cases and Tier2s in particular
List of Tier2s reasonably well established
Try to attract as many as possible!
Some 20+ already active – target of 40 by September 2006!
Still many to bring up to speed – re-use experience of existing sites!
Important to understand key data flows How experiments will decide which data goes where Where does a Tier2 archive its MC data? Where does it download the relevant Analysis data? The models have evolved significantly over the past year!
Two-three day workshop followed by 1-2 days of tutorials
Bri
ngi
ng
rem
ain
ing
site
s in
to p
lay:
Iden
tify
ing
rem
ain
ing
Use
Cas
es
Summary of Key Issues
There are clearly many areas where a great deal still remains to be done, including:
Getting stable, reliable, data transfers up to full rates Identifying and testing all other data transfer needs Understanding experiments’ data placement policy
Bringing services up to required level – functionality, availability, (operations, support, upgrade schedule, …)
Delivery and commissioning of needed resources Enabling remaining sites to rapidly and effectively
participate
Accurate and concise monitoring, reporting and accounting Documentation, training, information dissemination…
And Those Other Use Cases?
1. A small 1 TB dataset transported at "highest priority" to a Tier1 or a Tier2 or even a user group where CPU resources are available.
I would give it 3 Gbps so I can support 2 of them at once (max in the presence of other flows and some headroom). So this takes 45 minutes.
2. 10 TB needs to moved from one Tier1 to another or a large Tier2. It takes 450 minutes, as above so only ~two per day can be
supported per 10G link.
Timeline - 2006
January SC3 disk repeat – nominal rates capped at 150MB/sSRM 2.1 delivered (?)
July Tape Throughput tests at full nominal rates!
February CHEP w/s – T1-T1 Use Cases, SC3 disk – tape repeat (50MB/s, 5 drives)
August T2 Milestones – debugging of tape results if needed
March Detailed plan for SC4 service agreed (M/W + DM service enhancements)
September
LHCC review – rerun of tape tests if required?
April SC4 disk – disk (nominal) and disk – tape (reduced) throughput tests
October WLCG Service Officially opened. Capacity continues to build up.
May Deployment of new M/W and DM services across sites – extensive testing
November
1st WLCG ‘conference’All sites have network / tape h/w in production(?)
June SC4 production - Tests by experiments of ‘T1 Use Cases’. ‘Tier2 workshop’ – identification of key Use Cases and Milestones for T2s
December ‘Final’ service / middleware review leading to early 2007 upgrades for LHC data taking??
O/S
Upgrade? (S
LC
4) Som
etime before A
pril 2007!
The Dashboard Again…
(Some) Related Talks
The LHC Computing Grid Service (plenary)
BNL Wide Area Data Transfer for RHIC and ATLAS: Experience and Plans
CMS experience in LCG SC3
The LCG Service Challenges - Results from the Throughput Tests and Service Deployment
Global Grid User Support: the model and experience in the Worldwide LHC Computing Grid
The gLite File Transfer Service: Middleware Lessons Learned from the Service Challenges
Summary
In the 3 key areas addressed by the WLCG MoU:
Data transfer rates; Service availability and time to resolve problems; Resources provisioned.
we have made good – sometimes excellent - progress over the last year.
There still remains an a huge amount to do, but we have a clear plan of how to address these issues.
Need to be pragmatic, focussed and work together on our common goals.