the wlcg service starts here… sc4 production == wlcg pilot --- jamie shiers it-gd group meeting,...
TRANSCRIPT
The WLCG Service Starts Here…
SC4 Production == WLCG Pilot
---Jamie Shiers
IT-GD Group Meeting, July 7th 2006
My (CERN) Background
Started at CERN in CO group back in 1984
Started at CERN as student in 1978…
Since then, we’ve had the following major accelerator startups:
pp collider at CERN; LEP; FNAL collider runs I & II; SLC at SLAC; (others too…)
Enjoy the calm, relaxing environment you currently enjoy..
(The quiet before the storm…)
The Worldwide LHC Computing Grid
Purpose Develop, build and maintain a distributed computing
environment for the storage and analysis of data from the four LHC experiments
Ensure the computing service … and common application libraries and tools
Phase I – 2002-05 - Development & planning
Phase II – 2006-2008 – Deployment & commissioning of the initial services
The solution!
July 2006 WLCG Service Challenges: Overview and Outlook
Overview
• SC4 Phases:– Throughput Phase (April)
• May was reserved for gLite 3.0 upgrades
– Service Phase (June – September inclusive)– Experiment production activities / requirements
• WLCG Production Service– In principle October on…– ATLAS CSC / CMS CSA06 start early / mid September
• Some comments on Tier2 workshop– Much more complete review at Wednesday’s GDB
July 2006 WLCG Service Challenges: Overview and Outlook
Th
e E
volu
tion
of
Dat
abas
es in
HE
PCHEP 92 – the Birth of OO in
HEP? Wide-ranging discussions on the future of s/w development in
HEP
A number of proposals presented leading to (DRDC/LCRB/LCB):
RD41 – MOOSE [ Kors Bos ] The applicability of OO to offline particle physics code
RD44 – GEANT4 [ Simone Giani ] Produce a global object-oriented analysis and design of an
improved GEANT simulation toolkit for HEP RD45 – A Persistent Object Manager for HEP [ JDS ]
(and later also LHC++ (subsequently ANAPHE)) [ JDS ]
ROOT [ René ]
Started working on LHC Computing full-time!
LCG Service Deadlines
full physicsrun
first physics
cosmics
2007
2008
2006Pilot Service – stable service from 1 June 06i.e. we have already taken off!
LCG Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance
LCG service commissioned – 1 Apr 07
~6 months prior to first collisions
Updated LHC schedule coming…
July 2006 WLCG Service Challenges: Overview and Outlook
The LHC Machine
• Some clear indications regarding LHC startup schedule and operation are now available– Press release issued two weeks ago
• Comparing our (SC) actual status with ‘the plan’, we are arguably one year late!– Some sites cheerfully claim two…– We were supposed to test all offline Use Cases of experiments
during SC3 production phase (Sep 2005)
• We still have an awful lot of work to do
Not the time to relax!
July 2006 WLCG Service Challenges: Overview and Outlook
Press Release - Extract
CERN confirms LHC start-up for 2007
• Geneva, 23 June 2006. First collisions in the … LHC … in November 2007 said … Lyn Evans at the 137th meeting of the CERN Council ...
• A two month run in 2007, with beams colliding at an energy of 0.9 TeV will allow the LHC accelerator and detector teams to run-in their equipment ready for a full 14 TeV energy run to start in Spring 2008– Service Challenge ’07?
• The schedule announced today ensures the fastest route to a high-energy physics run with substantial quantities of data in 2008, while optimising the commissioning schedules for both the accelerator and the detectors that will study its particle collisions. It foresees closing the LHC’s 27 km ring in August 2007 for equipment commissioning. Two months of running, starting in November 2007, will allow the accelerator and detector teams to test their equipment with low-energy beams. After a winter shutdown in which commissioning will continue without beam, the high-energy run will begin. Data collection will continue until a pre-determined amount of data has been accumulated, allowing the experimental collaborations to announce their first results.
LHC CommissioningExpect to be characterised by:
Poorly understood detectors, calibration, software, triggers etc.
Lower than design luminosity & energy (~injection energy)
Most likely no AOD or TAG from first pass – but ESD will be larger?
Possible large impact on Tier2s – RAW and ESD samples to Tier2s?
The pressure will be on to produce some results as soon as possible!
There will not be sufficient resources at CERN to handle the load
We need a fully functional distributed system - ENTER THE GRID
There are many Use Cases we did not yet clearly identify
Nor indeed test --- this remains to be done in the coming months!
July 2006July 2006 R.Bailey, Chamonix XV, January 2006R.Bailey, Chamonix XV, January 2006 1515
Breakdown of a normal yearBreakdown of a normal year
7-8
~ 140-160 days for physics per yearNot forgetting ion and TOTEM operation
Leaves ~ 100-120 days for proton luminosity running? Efficiency for physics 50% ?
~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year
- From Chamonix XIV -S
ervi
ce u
pgra
de s
lots
?
July 2006WLCG Service Challenges: Overview and Outlook 16
P. SphicasLHC experiments’ software
Multiplicity paper:• Introduction• Detector system
- Pixel (& TPC)• Analysis method• Presentation of data
- dN/dη and mult. distribution (s dependence)
Theoretical interpretation- ln2(s) scaling?, saturation, multi-parton inter…
• Summary
pT paper outline:• Introduction• Detector system
- TPC, ITS• Analysis method• Presentation of data
- pT spectra and pT-multiplicity correlation
• Theoretical interpretation- soft vs hard, mini-jet production…
• Summary
Startup physics (ALICE)
Can publish two papers 1-2 weeks after LHC startup
LCG Service ModelTier0 – the accelerator centre (that’s us) Data acquisition & initial processing Long-term data curation Distribution of data Tier1s
This is where FTS comes in…
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschungszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF (Amsterdam)
Nordic countries – distributed Tier-1 Spain – PIC (Barcelona)Taiwan – Academia Sinica (Taipei)UK – CLRC (Didcot)US – FermiLab (Illinois) – Brookhaven (NY)
Tier1 – “online” to the data acquisition process high availability
Managed Mass Storage – grid-enabled data service
Data intensive analysis National, regional support Continual reprocessing activity
(or is that continuous?)
Tier2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive
Les Robertson
CERN18%
All Tier-1s39%
All Tier-2s43%
CERN12%
All Tier-1s55%
All Tier-2s33%
CERN34%
All Tier-1s66%
CPU Disk Tape
Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005
CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53
Tier1 Centre ALICE ATLAS ATLAS CMS LHCb Target
IN2P3, Lyon 9% 13% 90 10% 27% 200
GridKA, Germany 20% 10% 75 8% 10% 200
CNAF, Italy 7% 7% 60 13% 11% 200
FNAL, USA - - - 28% - 200
BNL, USA - 22% 200 - - 200RAL, UK - 7% 60 3% 15% 150
NIKHEF, NL (3%) 13% 90 - 23% 150
ASGC, Taipei - 8% 60 10% - 100
PIC, Spain - 4% (5) 50 6% (5) 6.5% 100
Nordic Data Grid Facility - 6% 50 - - 50
TRIUMF, Canada - 4% 50 - - 50
TOTAL 1.6GB/s
Nominal Tier0 – Tier1 Data Rates (pp)H
eat
SC4 T0-T1: Results
Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days
Result: just managed this rate on Good Sunday (1/10)
Easter w/eTarget 10 day period
Easter Sunday: > 1.6GB/s including DESY
GridView reports 1614.5MB/s as daily average
Service Challenges - Reminder
Purpose Understand what it takes to operate a real grid servicereal grid service – run for weeks/months at a time (not just
limited to experiment Data Challenges) Trigger and verify Tier-1 & large Tier-2 planning and deployment –
- tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability, scalability,
end-to-end performance
Four progressive steps from October 2004 thru September 2006 End 2004 - SC1 – data transfer to subset of Tier-1s Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s – first set of baseline services
Jun-Sep 2006 – SC4 – pilot service
Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007
We have shown that we can drive transfers at full nominal rates to:
Most sites simultaneously; All sites in groups (modulo network constraints – PIC); At the target nominal rate of 1.6GB/s expected in pp running
In addition, several sites exceeded the disk – tape transfer targets
There is no reason to believe that we cannot drive all sites at or above nominal rates for sustained periods.
But
There are still major operational issues to resolve – and most importantly – a full end-to-end demo under realistic conditions
SC4 – Executive Summary
July 2006 WLCG Service Challenges: Overview and Outlook
Experiment Plans for SC4
• All 4 LHC experiments will run major production exercises during WLCG pilot / SC4 Service Phase
• These will test all aspects of the respective Computing Models plus stress Site Readiness to run (collectively) full production services
• These plans have been assembled from the material presented at the Mumbai workshop, with follow-up by Harry Renshall with each experiment, together with input from Bernd Panzer (T0) and the Pre-production team, and summarised on the SC4 planning page.
• We have also held a number of meetings with representatives from all experiments to confirm that we have all the necessary input (all activities: PPS, SC, Tier0, …) and to spot possible clashes in schedules and / or resource requirements. (See “LCG Resource Scheduling Meetings” under LCG Service Coordination Meetings).
• The conclusions of these meetings has been presented to the weekly operations meetings and the WLCG Management Board in written form (documents, presentations)
– See the SC4 Combined Action List for more information…
Summary of Experiment Plans
All experiments will carry out major validations of both their offline software and the service infrastructure during the next 6 months
There are significant concerns about the state-of-readiness (of everything…) – not to mention manpower at ~all sites + in experiments
I personally am considerably worried –- seemingly simply issues, such as setting up LFC/FTS services, publishing SRM end-points etc. have taken O(1 year) to be resolved (across all sites).
and [still] don’t even mention basic operational procedures (Some big improvements here recently…)
And all this despite heroic efforts across the board
But – oh dear – your planet has just been blown up by the Vogons
[ So long and thanks for all the fish ]
Mini ComputerMini Computer
MicrocomputerMicrocomputer
ClusterCluster
mainframemainframe
July 2006 WLCG Service Challenges: Overview and Outlook
31
ATLAS SC plans/requirements
• Running now till 7 July to demonstrate the complete Atlas DAQ and first pass processing with distribution of raw and processed data to Tier 1 sites at the full nominal rates. Will also include data flow to some Tier2 sites and full usage of the Atlas Distributed Data Management system, DQ2. Raw data to go to tape, processed to disk only. Sites to delete from disk and tape
• After summer investigate scenarios of recovery from failing Tier 1 sites and deploy cleanup of pools at Tier 0.
• Later, test distributed production, analysis and reprocessing.• DQ2 has a central role with respect to Atlas Grid tools
– ATLAS will install local DQ2 catalogues and services at Tier 1 centres– ATLAS define a region of a Tier 1 and well network connected sites that will
depend on the Tier 1 DQ2 catalogue.– Expect such (volunteer) Tier 2 to join SC when T0/T1 runs stably – ATLAS will delete DQ2 catalogue entries
• Require VO box per Tier 0 and Tier 1 – done• Require LFC server per Tier 1 – done, must be monitored• Require FTS server and validated channels per Tier 0 and Tier 1 – close• Require ‘durable’ MSS disk area at Tier 1 – few sites have it. To be followed
up by Atlas and SC team. • Atlas would like their T1 sites to attend (VRVS) their weekly (Wed at 14.00)
SC review meeting during this running phase. No commitments were made.
July 2006 WLCG Service Challenges: Overview and Outlook
32
ALICE SC Plans
• Validation of the LCG/gLite workload management services: ongoing– Stability of the services is fundamental for the entire
duration of the exercise• Validation of the data transfer and storage services
– 2nd phase: end July/August T0 to T1 (recyclable tape) at 300 MB/sec
– The stability and support of the services have to be assured during and beyond these throughput tests
• Validation of the ALICE distributed reconstruction and calibration model: August/September reconstruction at Tier 1
• Integration of all Grid resources within one single – interfaces to different Grids (LCG, OSG, NDGF) will be done by ALICE
• End-user data analysis: September/October
July 2006 WLCG Service Challenges: Overview and Outlook
34
CMS SC Plans/Requirements
• In September/October run CSA06, a 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS
• Now till end June– Continue to try to improve file transfer efficiency. Low rates and many errors now.– Attempt to hit 25k batch jobs per day and increase the number and reliability of sites aiming
to obtain 90% efficiency for job completion• In July
– Demonstrate CMS analysis submitter in bulk mode with the gLite RB• In July and August
– 25M events per month with the production systems– Second half of July participate in multi-experiment FTS Tier-0 to Tier-1 transfers at 150
MB/sec out of CERN– Continue through August with transfers
• Requirements:
• Improve Tier-1 to Tier-2 transfers and the reliability of the FTS channels.• CMS are exercising the channels available to them, but there are still issues with site
preparation and reliability– the majority of sites are responsive, but there is a lot of work for this summer
• Require to deploy the LCG-3D infrastructure– From late June deploy Frontier for SQUID caches
• All participating sites should be able to complete the CMS workflow and metrics (as defined in the CSA06 documentation)
July 2006 WLCG Service Challenges: Overview and Outlook
35
LHCB SC Plans/Requirements
• Will start DC06 challenge at beginning of July using LCG production services and run till end August:
– Distribution of raw data from CERN to Tier 1s at 23 MB/sec– Reconstruction/stripping at Tier 0 and Tier 1– DST distribution to CERN and Tier 1s– Job prioritisation will be dealt with by LHCB but it is important jobs are not
delayed by other VO activities
• Preproduction for this is ongoing with 125 TB of MC data at CERN• Production will go on throughout the year for an LHCB physics book due in
2007• Require SRM 1.1 based SE’s separated for disk and MSS at all Tier 1 as
agreed in Mumbai and FTS channels for all CERN-T1’s– Data access directly from SE to ROOT/POOL (not just GridFTP/srmcp). For
NIKHEF/SARA (firewall issue) this could perhaps be done via GFAL.
• Require VO boxes at Tier 1 – so far at CERN, IN2P3, PIC and RAL. Need CNAF, NIKHEF and GridKa
• Require central LFC catalogue at CERN and read-only copy at certain T1 (currently setting up at CNAF)
• DC06-2 in Oct/Nov requires T1’s to run COOL and 3D database services
WLCG Service Challenges: Overview and Outlook
Experiment Summary
• All experiments will be ramping up their activity between now and first collisions
• The period of ‘one experiment having priority’ – as was done in SC3 and for ATLAS until this weekend – is over
It is full, concurrent production from now on!
WLCG Service Challenges: Overview and Outlook
Workshop Feedback
• >160 people registered and (a few more) participated!– This is very large for a workshop – about same as Mumbai
• Some comments related directly to this (~40 replies received so far)
• Requests for more: – Tutorials, particularly “hands-on”– Direct Tier2 involvement– Feedback sessions, planning concrete actions etc.
Active help from Tier2s in preparing / defining future events would be much appreciated– Please not just the usual suspects…
• See also Duncan Rand’s talk to GridPP16 – Some slides included below
WLCG Service Challenges: Overview and Outlook
Tutorial Rating – 10=best
0
2
4
6
8
10
12
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Series1
WLCG Service Challenges: Overview and Outlook
Workshop Rating
0
2
4
6
8
10
12
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Series1
WLCG Service Challenges: Overview and Outlook
Workshop Comments
• Many positive comments on all sessions of the workshop and tutorials
• Possibility to discuss with other sites and the developers also much appreciated
• Sessions which some liked least others liked most!
• I hope that the people who didn’t reply also feel the same!
“Very very inspiring” “Hope to do it again soon”
“Tutorials were very useful”
“The organisation was excellent”
“Discussions were very enlightening”
“Information collected together in one place”
WLCG Service Challenges: Overview and Outlook
Workshop Summary
• Workshops have been well attended and received– Feedback will help guide future events
Need to improve on Tier1+Tier2 involvement– Preparing agenda / chairing sessions / giving talks etc.
• Strong demand for more tutorials– Hands-on where possible / appropriate
• Thanks to everyone for their contribution to both workshop and tutorials!
HEPiX Rome 05apr06
LCG
The Service Challenge programme this year must show that we can run reliable services
Grid reliability is the product of many components – middleware, grid operations, computer centres, ….
Target for September 90% site availability 90% user job success
Requires a major effort by everyone to monitor, measure, debug
First data will arrive next year NOT an option to get things going later
Too modest?
Too ambitious?