atlas computing
DESCRIPTION
ATLAS Computing. XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, 10-17 September 2007 Alexandre Vaniachine Invited talk for ATLAS Collaboration. Outline. ATLAS Computing Model Distributed computing facilities: Tier-0 plus Grids: EGEE, OSG, and NDGF - PowerPoint PPT PresentationTRANSCRIPT
ATLAS Computing
XXI International Symposium on Nuclear Electronics & Computing
Varna, Bulgaria, 10-17 September 2007
Alexandre Vaniachine
Invited talk for ATLAS Collaboration
2Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Outline
ATLAS Computing Model Distributed computing facilities:
– Tier-0 plus Grids: EGEE, OSG, and NDGF Components:
– Production system (jobs)
– Data management (files)• Databases to keep track of jobs and files, etc
Status:
– Transition from development to operations• Commissioning at pit and Tier-0 with cosmics data• Commissioning with simulations data by grid operations teams• Distributed analysis of these data by physicists
– M4 cosmics run in August validated all these separate operations
3Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Credits
I wish to thank the Symposium organizers for their invitation and for their hospitality
This overview is biased by my own opinions
– The CHEP07 last week simplified my task I wish to thank my ATLAS collaborators for their
contributions– I have added references to their CHEP07 contributions
All ATLAS CHEP contributions will be published as ATLAS Computing Notes and serve as a foundation for collaboration paper on ATLAS Computing to be prepared towards the end of 2007
– which will provide updates for ATLAS Computing TDR:
http://atlas-proj-computing-tdr.web.cern.ch/atlas-proj-computing-tdr/Html/Computing-TDR.htm
4Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
ATLAS Multi-Grid Infrastructure
ATLAS computing operates uniformly on three Grids with different interfaces
Focus is shifting to physics analysis performance
– testing, integration, validation, deployment, documentation• then operations!
5Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
40+ sites Worldwide
Farbin [id 83]
ATLAS Computing Model: Roles of Distributed Tiers
•Reprocessing of full data a few months after data taking, as soon as improved calibration and alignment constants are available•Managed Tape Access: RAW, ESD•Disk Access: AOD, fraction of ESD
6Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Ongoing Transition from Development to Operations
To facilitate operational efficiency ATLAS computing separates development activities from operations
No more “clever developments” - pragmatic addition of vital residual services to support operations:
– Such as monitoring tools (critical to operations) ATLAS CHEP07 contributions covered in detail both development reports
and operational experience:
Domain Development Operations
Experience Monitoring Tools
Databases Viegas Vaniachine Andreeva
Tier-0 Nairz/Goossens BosRocha
DDM Lassnig Klimentov
ProdSys
EGEE Retico Espinal
KennedyOSG Maeno Smirnov
NDGF Grønager Klejst
Analysis Farbin Liko Elmsheuser
7Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Keeping Track of Jobs and Files Needs Databases
To achieve robust operations on the grids ATLAS splits data processing tasks of petabytes of event data into smaller units - jobs and files
Databases
FilesJobs
File - unit for data management in Grid Computing
- Managed by ATLAS Distributed Data Management (DDM) System
Job – unit for data processing workflow management in
Grid Computing - Managed by ATLAS Production System:
ATLAS files are grouped in Datasets:
ATLAS job configuration and job completion are stored in prodDB
ATLAS jobs are grouped in Tasks:
AKTR DB for physics tasks
AMI: ATLAS Metadata Information DB – “Where is my dataset?”
DDM Central Catalogs
Database Operations will be covered in a separate talk later in this session
8Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
ATLAS Multi-Grid Operations Architecture and Results
Leveraging underlying database infrastructure ATLAS Production System and ATLAS DDM successfully manage simulations workflow on three production grids: EGEE, OSG and NDGF
Statistics of success - from ATLAS production database
9Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Leveraging Growing Resources on Three Grids
Latest snapshot of ATLAS resources [Yuri Smirnov, CHEP talk 184]
Grid infrastructure operated by
Resources available for ATLAS
Tier-1s CPU (count) Disk (PB)
EGEE: Enabling Grids for E-SciencE 8 3500 0.4
OSG: Open Science Grid 1 2500 0.4
NDGF: Nordic Data Grid Facility 1 500 .06
10Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
CERN to Tier-1 Transfer Rates
ATLAS has the largest nominal CERN to Tier-1 transfer rate
– Tests this spring reached ~75% of the nominat target Successful use of all ten ATLAS Tier-1centers:
11Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Validation Computing Model with Realistic Data Rates
M3 Cosmics Run (mid-July)– Cosmics produced about 100 TB in 2 weeks– Stressed offline by running at 4 times the nominal rate
• LAr 32 samples test M4 Cosmics Run: August 23 – September 3
– Metrics for success• full rate Tier-0 processing• data exported to 5 of 10 Tier-1s and stored• for 2 of 5 Tier-1s exports to at least two Tier-2s• quasi real-time analysis in at least one Tier-2• reprocessing in September in at least one Tier-1
M5 Cosmics Run scheduled for October 16-23 M6 Cosmics Run will run from end December until real data
– Incremental goals, reprocessing between runs– Will run close to nominal rate– Maybe ~420 TB by start of run, plus Monte Carlo
12Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Validating Computing Model with Realistic Data Rates
M4 Cosmics Run: August 23 – September 3 Raw data distribution in real time from online to Tier-0 and to all Tier-1s
– The full chain worked with all ten Tier-1s at a target rate
Throughput in MB/s from T0 to all T1’s
Expected max rate
Last day of runEvery day rates were ramping-up
13Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Real-time M4 Data Analysis
Tracks in the muon chambers (right) and in the TRT (below)
Analysis done simultaneously in European and US T1/2 sites
14Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
An Important Milestone
Metrics for success
– Full rate T0 processing OK
– Data exported to 5 / 10 T1’s and stored OK, and did more !
– For 2 / 5 T1’s exports to at least 2 T2’s OK
– Quasi-rt analysis in at least 1 T2 OK, and did more !
– Reprocessing in Sept. in at least 1 T1 in preparation Last week ATLAS has shown to master for the first time the whole data
chain: from a measurement of a real cosmic ray muon in the detector until an almost real-time analysis in sites in Europe and the US with all steps in between
15Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
ATLAS Event Data Model
Data Type Size (kB) Accessibility
RDO:
Raw Data Objects1600
Tier-0 and Tier-1s
Have to be recorded on permanent storage
ESD:
Event Summary Data
500
Tier-0 and Tier-1s
Output of the reconstruction
Often large; difficult to analyze the full set
AOD:
Analysis Object Data
100
Tier-0, Tier-1/2 - at least one complete copy per cloud, Tier-3 – subsetQuantities in particular relevant for physics analysisMain input for analysis, distributed to many sites
DPD:
Derived Physics Data10
Tier-3s (e.g. your laptop)
Sometimes called N-Tuples
TAG:
event-level metadata1 All Tiers
16Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
ATLAS Full Dress Rehearsal
Simulated events injected in the TDAQ– Realistic physics mix in bytestream format incl. luminosity blocks– Real data file and dataset sizes, trigger tables, data streaming
Tier-0/Tier-1 data quality, express line, calibration running– Use of Conditions DB
Tier-0 reconstruction: ESD, AOD, TAG, DPD– Data exports to Tier-1 and Tier-2s
Remote analysis
– at the Tier-1s:• Reprocessing from RAW ESD, AOD, DPD, TAG• Remake AOD from ESD• Group based analysis DPD
– at the Tier-2s and Tier-3s:• Root based analysis• Trigger aware analysis with Cond. and Trigger db• No MC truth, user analysis• MC/Reco production in parallel
17Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
FDR Schedule
Round 11. Data streaming tests: DONE2. Sept/Oct 07: Data preparation STARTS SOON3. End Oct 07: Tier-0 operations tests4. Nov 07 - Feb 08: Reprocess at Tier-1, make group DPD's
Round 2 ASSUMING NEW G41. Dec 07 – Jan 08: New data production for final round 2. Feb 08: Data prep for final round using 3. Mar 08: Reco final round ASSUMING SRM v2.24. Apr 08: DPD production at Tier-1s5. Apr 08: More simulated data prod in preparation for first data.6. May 08: final FDR
First pass production should be validated by year-end Reprocessing will be validated months later Analysis roles will be validated
18Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Ramping Up Computing Resources for LHC Datataking
Change of LHC schedule makes little change to the resource profile
– Recall the early data is for calibration and commissioning• This is needed either from collisions or cosmics
New T1 Evolution
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
Total Disk (TB) 2090.24 10725.33659 20100.24616 39528.81444 56231.83987 72934.8653
Total Tape (TB) 1246.026667 8067.068427 15498.64876 29423.0892 45830.74975 64721.63041
Total CPU (kSI2k) 3173 18124.42353 28426.02353 49576.22353 70726.42353 91876.62353
2007 2008 2009 2010 2011 2012
New T2 Evolution
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Disk (TB)
CPU (kSI2k)
Disk (TB) 1339.93551 8602.406837 13889.17849 22909.44708 31868.59425 40828.06284
CPU (kSI2k) 2336 17506.89811 26972.75589 51557.13737 69140.91886 86724.70034
2007 2008 2009 2010 2011 2012
New CAF Evolution
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
Total Disk (TB) 218.9205102 1145.579516 1780.331401 3127.598482 4811.362354 6311.016939
Total Tape (TB) 43.90035714 370.9648871 645.3581497 1043.271812 1374.785475 1706.299137
Total CPU (kSI2k) 821.4117647 2081.470067 2562.343154 4663.81529 6598.454092 8533.092895
2007 2008 2009 2010 2011 2012
New T0 Evolution
0
5000
10000
15000
20000
25000
30000
Total Disk (TB)
Total Tape (TB)
Total CPU (kSI2k)
Total Disk (TB) 62.85714286 152.4621429 265.0335714 472.3528571 472.3528571 472.3528571
Total Tape (TB) 400 2449.077 5562.274 11893.871 18225.468 24557.065
Total CPU (kSI2k) 1910.411765 3705.823529 4058.823529 6105.823529 6105.823529 6105.823529
2007 2008 2009 2010 2011 2012
19Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
ATLAS Analysis Model
Basic principle: Smaller data can be read faster
– Skimming - Keep interesting events
– Thinning - Keep interesting objects in events
– Slimming - Keep interesting info in objects
– Reduction - Build higher-level data Derived Physics Data
– Share the schema with objects in the AOD/ESD
– Can be analyzed interactively
Farbin id 83
19
20Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Analysis: Grid Tools and Experiences
On the EGEE and NDGF infrastructure ATLAS uses direct submission to the middleware using GANGA
– EGEE: LCG RB and gLite WMS
– NDGF: ARC middleware On OSG PANDA system
– Pilot based system
– Also available at some EGEE sites Many users have been exposed to the grid
– Work is getting done Simple user interface is essential to simplify the usage
– But experts required to understand the problem Sometimes user have the impression that they are debugging the grid
21Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007
Conclusions
ATLAS computing is addressing unprecedented challenges
– we are in a final stages of mastering how to handle those challenges ATLAS experiment mastered complex multi-grid computing infrastructure
at the scale close to the expectations for running conditions
– Resource utilization for simulated event production
– Transfers from CERN A coordinated shift from development to operations/services is happening
in the final year of preparation An increase in scale is expected in facility infrastructure and the
corresponding ability to use new capacities effectively User analysis activities are ramping up