jamie shiers february 2004 assembled from sc4 workshop presentations + les’ plenary talk at chep...
TRANSCRIPT
Jamie ShiersJamie ShiersFebruary 2004February 2004
Assembled from SC4 Workshop presentations + Les’ plenary talk at CHEPAssembled from SC4 Workshop presentations + Les’ plenary talk at CHEP
The Worldwide LHC Computing Grid Service
ExperimentPlans for SC4
LCG
Introduction
Global goals and timelines for SC4
Experiment plans for pre, post and SC4 production
Medium term outline for WLCG services
The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 -to the capacity and performance needed for the first beams.
Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year –
reliable operation of the baseline services
LCG
LCG Service Deadlines
full physicsrun
first physics
cosmics
2007
2008
2006 Pilot Services – stable service from 1 June 06
LHC Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance
LHC service commissioned – 1 Apr 07
Service Challenge 4
LCG
SC4 – the Pilot LHC Service from June 2006
Full demonstration of experiment production DAQ Tier-0 Tier-1
data recording, calibration, reconstruction Full offline chain – Tier-1 Tier-2 data exchange
simulation, batch and end-user analysis
Service metrics MoU service levels
Extension to most Tier-2 sites
Functionality - modest evolution from current services
Focus on reliability, performance
5
ALICE Data Challenges 2006
• Last chance to show that things are working together (i.e. to test our computing model)
• whatever does not work here is likely not to work when real data are there
– So we better plan it well and do it well
6
ALICE Data Challenges 2006
• Three main objectives
– Computing Data Challenge• Final version of rootifier / recorder• Online data monitoring
– Physics data challenge• Simulation of signal events: 106 Pb-Pb, 108 p-p• Final version reconstruction• Data analysis
– PROOF data challenge• Preparation of the fast reconstruction / analysis
framework
7
Main points
• Data flow
• Realistic system stress test • Network stress test
• SC4 Schedule• Analysis activity
8
Data Flow• Not very fancy… always the same• Distributed Simulation Production
– Here we stress-test the system with the number of jobs in parallel
• Data back to CERN• First reconstruction at CERN
– RAW/ESD Scheduled “push-out” – here we do the network test
• Distributed reconstruction– Here we stress test the I/O subsystem
• Distributed (batch) analysis– “And here comes the proof of the pudding” - FCA
9
SC3 -> SC4 Schedule• February 2006
– Rerun of SC3 disk – disk transfers (max 150MB/s X 7 days)– Transfers with FTD, either triggered via AliEn jobs or scheduled – T0 -> T1 (CCIN2P3, CNAF, Grid.Ka, RAL)
• March 2006– T0-T1 “loop-back” tests at 2 x nominal rate (CERN)– Run bulk production @ T1,T2 (simulation+reconstruction jobs) and send data back to
CERN– (We get ready with proof@caf)
• April 2006– T0-T1 disk-disk (nominal rates) disk-tape (50-75MB/s)– First Push out (T0 -> T1) of simulated data, reconstruction at T1– (First tests with proof@caf)
• July 2006– T0-T1 disk-tape (nominal rates)– T1-T1, T1-T2, T2-T1 and other rates TBD according to CTDRs– Second chance to push out the data– Reconstruction at CERN and remote centres
• September 2006– Scheduled analysis challenge– Unscheduled challenge (target T2’s?)
10
SC4 Rates - Scheduled Analysis
• Users– Order of 10 at the beginning of SC4
• Input– 1.2M Pb-Pb events, 100M p-p events, ESD stored at T1s
• Job rate– Can be tuned, according to the availability of resources
• Queries to MetaData Catalogue– Time/Query to be evaluated (does not involve LCG services)
• Job splitting– Can be done by AliEn according to the query result (destination set
for each job)– CPU availability is an issue (sub-jobs should not wait too much for
delayed executions)– Result merging can be done by a separate job
• Network– Not an issue
11
SC4 Rates - Scheduled Analysis
• Some (preliminary) numbers– Based on 20 minutes jobs
Centre ESD volume in local
storage [TB]
Number of available
CPUs
Number of analysis jobs/day
Number of passes over
the ESD sample/day
Data I/O per CPU/day
[GB]
Aggregated I/O from
local storage [GB/s]
CERN 26 1000 72,000 16 430 4.8
CCIN2P3 10.5 220 16,000 9 430 2.7
12
SC4 Rates - Unscheduled Analysis
• To be defined
Dario Barberis: ATLAS SC4 Plans 13
WLCG SC4 Workshop - Mumbai, 12 February 2006
ATLAS SC4 Tests
Complete Tier-0 test Internal data transfer from “Event Filter” farm to Castor disk pool, Castor
tape, CPU farm
Calibration loop and handling of conditions data Including distribution of conditions data to Tier-1s (and Tier-2s)
Transfer of RAW, ESD, AOD and TAG data to Tier-1s
Transfer of AOD and TAG data to Tier-2s
Data and dataset registration in DB (add meta-data information to meta-data DB)
Distributed production Full simulation chain run at Tier-2s (and Tier-1s)
Data distribution to Tier-1s, other Tier-2s and CAF
Reprocessing raw data at Tier-1s Data distribution to other Tier-1s, Tier-2s and CAF
Distributed analysis “Random” job submission accessing data at Tier-1s (some) and Tier-2s
(mostly)
Tests of performance of job submission, distribution and output retrieval
Dario Barberis: ATLAS SC4 Plans 14
WLCG SC4 Workshop - Mumbai, 12 February 2006
ATLAS SC4 Plans (1) Tier-0 data flow tests:
Phase 0: 3-4 weeks in March-April for internal Tier-0 tests Explore limitations of current setup
Run real algorithmic code
Establish infrastructure for calib/align loop and conditions DB access
Study models for event streaming and file merging
Get input from SFO simulator placed at Point 1 (ATLAS pit)
Implement system monitoring infrastructure
Phase 1: last 3 weeks of June with data distribution to Tier-1s Run integrated data flow tests using the SC4 infrastructure for data distribution
Send AODs to (at least) a few Tier-2s
Automatic operation for O(1 week)
First version of shifter’s interface tools
Treatment of error conditions
Phase 2: 3-4 weeks in September-October Extend data distribution to all (most) Tier-2s
Use 3D tools to distribute calibration data
The ATLAS TDAQ Large Scale Test in October-November prevents further Tier-0 tests in 2006… … but is not incompatible with other distributed operations
Dario Barberis: ATLAS SC4 Plans 15
WLCG SC4 Workshop - Mumbai, 12 February 2006
ATLAS SC4 Plans (2)
ATLAS CSC includes continuous distributed simulation productions: We will continue running distributed simulation productions all the time
Using all Grid computing resources we have available for ATLAS
The aim is to produce ~2M fully simulated (and reconstructed) events/week from April onwards, both for physics users and to build the datasets for later tests
We can currently manage ~1M events/week; ramping up gradually
SC4: distributed reprocessing tests: Test of the computing model using the SC4 data management
infrastructure Needs file transfer capabilities between Tier-1s and back to CERN CAF
Also distribution of conditions data to Tier-1s (3D)
Storage management is also an issue
Could use 3 weeks in July and 3 weeks in October
SC4: distributed simulation intensive tests: Once reprocessing tests are OK, we can use the same infrastructure to
implement our computing model for simulation productions As they would use the same setup both from our ProdSys and the SC4 side
First separately, then concurrently
Dario Barberis: ATLAS SC4 Plans 16
WLCG SC4 Workshop - Mumbai, 12 February 2006
ATLAS SC4 Plans (3)
Distributed analysis tests:
“Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly)
Generate groups of jobs and simulate analysis job submission by users at home sites
Direct jobs needing only AODs as input to Tier-2s
Direct jobs needing ESDs or RAW as input to Tier-1s
Make preferential use of ESD and RAW samples available on disk at Tier-2s
Tests of performance of job submission, distribution and output retrieval
Test job priority and site policy schemes for many user groups and roles
Distributed data and dataset discovery and access through metadata, tags, data catalogues.
Need same SC4 infrastructure as needed by distributed productions
Storage of job outputs for private or group-level analysis may be an issue
Tests can be run during Q3-4 2006
First a couple of weeks in July-August (after distributed production tests)
Then another longer period of 3-4 weeks in November
Dario Barberis: ATLAS SC4 Plans 17
WLCG SC4 Workshop - Mumbai, 12 February 2006
Overview of requirements for SC4
SRM (“baseline version”) on all storages
VO Box per Tier-1 and in Tier-0
LFC server per Tier-1 and in Tier-0
FTS server per Tier-1 and in Tier-0
Disk-only area on all tape systems Preferably we could have separate SRM entry points for “disk” and
“tape” SEs. Otherwise a directory set as permanent (“durable”?) on disk (non-migratable).
Disk space is managed by DQ2.
Counts as online (“disk”) data in the ATLAS Computing Model
Ability to install FTS ATLAS VO agents on Tier-1 and Tier-0 VO Box (see next slides)
Single entry point for FTS with multiple channels/servers
Ability to deploy DQ2 services on VO Box as during SC3
No new requirements on the Tier-2s besides SRM SE
Dario Barberis: ATLAS SC4 Plans 18
WLCG SC4 Workshop - Mumbai, 12 February 2006
Movement use cases for SC4
EF -> Tier-0 migratable area
Tier-0 migratable area -> Tier-1 disk
Tier-0 migratable area -> Tier-0 tape
Tier-1 disk -> Same Tier-1 tape
Tier-1 disk -> Any other Tier-1 disk
Tier-1 disk -> Related Tier-2 disk (next slides for details)
Tier-2 disk -> Related Tier-1 disk (next slides for details)
Not done: Processing directly from tape (not in ATLAS Computing Model)
Automated multi-hop (no ‘complex’ data routing)
Built-in support for end-user analysis: goal is to exercise current middleware and understand its limitations (metrics)
Dario Barberis: ATLAS SC4 Plans 19
WLCG SC4 Workshop - Mumbai, 12 February 2006
ATLAS SC4 Requirement (new!)
Small testbed with (part of) CERN, a few Tier-1s and a few Tier-2s to test our distributed systems (ProdSys, DDM, DA) prior to deployment It would allow testing new m/w features without disturbing other
operations
We could also tune properly the operations on our side
The aim is to get to the agreed scheduled time slots with an already tested system and really use the available time for relevant scaling tests
This setup would not interfere with concurrent large-scale tests or data transfers run by other experiments
A first instance of such a system would be useful already now! April-May looks like a realistic request
Dario Barberis: ATLAS SC4 Plans 20
WLCG SC4 Workshop - Mumbai, 12 February 2006
Summary of requests
March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0)
April-May (pre-SC4): tests of distributed operations on a “small” testbed
Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s
3 weeks in July: distributed processing tests (Part 1)
2 weeks in July-August: distributed analysis tests (Part 1)
3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s
3 weeks in October: distributed processing tests (Part 2)
3-4 weeks in November: distributed analysis tests (Part 2)
30
LHCb DC06
“Test of LHCb Computing Model using LCG Production Services”
• Distribution of RAW data• Reconstruction + Stripping• DST redistribution• User Analysis• MC Production• Use of Condition DB (Alignment +
Calibration)
31
SC4 Aim for LHCb
• Test Data Processing part of CM
• Use 200 M MC RAW events:– Distribute
– Reconstruct
– Stripped and Re-distribute
• Simultaneous activities:– MC production
– User Analysis
32
Preparation for SC4
• Event generation, detector simulation & digitization
• 100M B-physics + 100M min bias events:
– 3.7 MSI2k · month required (~2-3 months)
– 125 TB on MSS at Tier-0 (keep MC True)
• Timing:
– Start productions mid March,
– Full capacity end March
33
LHCb SC4 (I)
• Timing: – Start June
– Duration 2 months
• Distribution of RAW data– Tier0 MSS SRM Tier1’s MSS SRM
• 2 TB/day out of CERN
– 125 TB on MSS @ Tier1’s
34
LHCb SC4 (II)
• Reconstruction/stripping – 270 kSi2k · month
– 60 TB on MSS @ Tier1’s (full DST)
– 1k Job/day (following the data)• Jobs duration 2 hour
• 90 % Jobs (Rec): Input 3.6 GB Output 2 GB
• 10% Jobs (Strip): Input 20 GB Output 0.5 GB
• DST Distribution– 2.2 TB on Disk / Tier1 + CAF (selected DST+RAW)
35
DIRAC Tools & LCG
• DIRAC Transfer Agent @ Tier-0 + Tier-1’s– FTS + SRM
• DIRAC Production Tools– Production Manager console– Transformation Agents
• DIRAC WMS– LFC + RB + CE
• Applications:– GFAL: Posix I/O via LFN
36
To be Tested after SC4
• Data Management:– SRM v2– gridFTP 2– FPS
• Workload Management:– gLite RB ?– gLite CE ?
• VOMS– Integration with MW
• Applications– Xrootd
LCG
Monthly Summary (I)
February ALICE: data transfers T0->T1 (CCIN2P3, CNAF, Grid.Ka,
RAL) ATLAS: CMS: LHCb:
March ALICE: bulk production at T1/T2; data back to T0 ATLAS: 3-4 weeks Mar/Apr T0 tests CMS: PhEDEx integration with FTS LHCb: start generation of 100M B-physics + 100M min bias
events (2-3 months; 125 TB on MSS at Tier-0) April
ALICE: first push out of sim. data; reconstruction at T1s. ATLAS: see above CMS: 10TB to tape at T1s at 150MB/s LHCb: see above dTeam: T0-T1 at nominal rates (disk); 50-75MB/s (tape)
Extensive testing on P
PS
by all VO
s
LCG
Monthly Summary (II)
May ALICE: ATLAS: CMS: LHCb:
June ALICE: ATLAS: Tier-0 test (Phase 1) with data distribution to Tier-1s (3 weeks) CMS: 2-week re-run of SC3 goals (beginning of month) LHCb: reconstruction/stripping: 2 TB/day out of CERN - 125 TB on MSS
@ Tier1’s July
ALICE: Reconstruction at CERN and remote centres ATLAS: CMS: bulk simulation (2 months) LHCb: see above dTeam: T0-T1 at full nominal rates (to tape)
Deployment of gLite 3.0 at major sites for SC4 production
LCG
Monthly Summary (III)
August ALICE: ATLAS: CMS: bulk simulation continues LHCb: Analysis on data from June/July … until spring 07 or so…
September ALICE: Scheduled + unscheduled (T2s?) analysis challenges ATLAS: CMS: LHCb: see above
LCG
WLCG - Medium Term Evolution
3Ddistributeddatabaseservices
developmenttest
SC4
SRM 2test and
deployment
Plan beingelaborated
October?
Additional planned
Functionality
to be agreed& completedin the next
few months
then - testeddeployed
Subject to progress& experience
Newfunctionality
Evaluation&
developmentcycles
Possiblecomponents
for lateryears
??
LCG
So What Happens at the end of SC4?
Well prior to October we need to have all structures and procedures in place…
… to run –-- and evolve --- a production service for the long-term
This includes all aspects – monitoring, automatic problem detection, resolution, reporting, escalation, {site, user} support, accounting, review, planning for new productions, service upgrades …
For the precise reason that things will evolve, should avoid over-specification…
LCG
Summary
Two grid infrastructures are now in operation, on which we are able to build computing services for LHC
Reliability and performance have improved significantly over the past year
The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 -to the capacity and performance needed for the first beams.
Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year –
reliable operation of the baseline services