atlas computing

ATLAS Computing

XXI International Symposium on Nuclear Electronics & Computing

Varna, Bulgaria, 10-17 September 2007

Alexandre Vaniachine

Invited talk for ATLAS Collaboration

2Alexandre Vaniachine NEC'2007, Varna, Bulgaria, 10-17 September 2007

Outline

ATLAS Computing Model Distributed computing facilities:

– Tier-0 plus Grids: EGEE, OSG, and NDGF Components:

– Production system (jobs)

– Data management (files)• Databases to keep track of jobs and files, etc

Status:

– Transition from development to operations• Commissioning at pit and Tier-0 with cosmics data• Commissioning with simulations data by grid operations teams• Distributed analysis of these data by physicists

– M4 cosmics run in August validated all these separate operations


Credits

I wish to thank the Symposium organizers for their invitation and for their hospitality

This overview is biased by my own opinions

– The CHEP07 last week simplified my task I wish to thank my ATLAS collaborators for their

contributions– I have added references to their CHEP07 contributions

All ATLAS CHEP contributions will be published as ATLAS Computing Notes and serve as a foundation for collaboration paper on ATLAS Computing to be prepared towards the end of 2007

– which will provide updates for ATLAS Computing TDR:

http://atlas-proj-computing-tdr.web.cern.ch/atlas-proj-computing-tdr/Html/Computing-TDR.htm




ATLAS Multi-Grid Infrastructure

ATLAS computing operates uniformly on three Grids with different interfaces

Focus is shifting to physics analysis performance

– testing, integration, validation, deployment, documentation• then operations!


40+ sites Worldwide

Farbin [id 83]

ATLAS Computing Model: Roles of Distributed Tiers

•Reprocessing of full data a few months after data taking, as soon as improved calibration and alignment constants are available•Managed Tape Access: RAW, ESD•Disk Access: AOD, fraction of ESD


Ongoing Transition from Development to Operations

To facilitate operational efficiency ATLAS computing separates development activities from operations

No more “clever developments” - pragmatic addition of vital residual services to support operations:

– Such as monitoring tools (critical to operations) ATLAS CHEP07 contributions covered in detail both development reports

and operational experience:

Domain Development Operations

Experience Monitoring Tools

Databases Viegas Vaniachine Andreeva

Tier-0 Nairz/Goossens BosRocha

DDM Lassnig Klimentov

ProdSys

EGEE Retico Espinal

KennedyOSG Maeno Smirnov

NDGF Grønager Klejst

Analysis Farbin Liko Elmsheuser


Keeping Track of Jobs and Files Needs Databases

To achieve robust operations on the grids ATLAS splits data processing tasks of petabytes of event data into smaller units - jobs and files

Databases

FilesJobs

File - unit for data management in Grid Computing

- Managed by ATLAS Distributed Data Management (DDM) System

Job – unit for data processing workflow management in

Grid Computing - Managed by ATLAS Production System:

ATLAS files are grouped in Datasets:

ATLAS job configuration and job completion are stored in prodDB

ATLAS jobs are grouped in Tasks:

AKTR DB for physics tasks

AMI: ATLAS Metadata Information DB – “Where is my dataset?”

DDM Central Catalogs

Database Operations will be covered in a separate talk later in this session


ATLAS Multi-Grid Operations Architecture and Results

Leveraging underlying database infrastructure ATLAS Production System and ATLAS DDM successfully manage simulations workflow on three production grids: EGEE, OSG and NDGF

Statistics of success - from ATLAS production database


Leveraging Growing Resources on Three Grids

Latest snapshot of ATLAS resources [Yuri Smirnov, CHEP talk 184]

Grid infrastructure operated by

Resources available for ATLAS

Tier-1s CPU (count) Disk (PB)

EGEE: Enabling Grids for E-SciencE 8 3500 0.4

OSG: Open Science Grid 1 2500 0.4

NDGF: Nordic Data Grid Facility 1 500 .06


CERN to Tier-1 Transfer Rates

ATLAS has the largest nominal CERN to Tier-1 transfer rate

– Tests this spring reached ~75% of the nominat target Successful use of all ten ATLAS Tier-1centers:


Validation Computing Model with Realistic Data Rates

M3 Cosmics Run (mid-July)– Cosmics produced about 100 TB in 2 weeks– Stressed offline by running at 4 times the nominal rate

• LAr 32 samples test M4 Cosmics Run: August 23 – September 3

– Metrics for success• full rate Tier-0 processing• data exported to 5 of 10 Tier-1s and stored• for 2 of 5 Tier-1s exports to at least two Tier-2s• quasi real-time analysis in at least one Tier-2• reprocessing in September in at least one Tier-1

M5 Cosmics Run scheduled for October 16-23 M6 Cosmics Run will run from end December until real data

– Incremental goals, reprocessing between runs– Will run close to nominal rate– Maybe ~420 TB by start of run, plus Monte Carlo


Validating Computing Model with Realistic Data Rates

M4 Cosmics Run: August 23 – September 3 Raw data distribution in real time from online to Tier-0 and to all Tier-1s

– The full chain worked with all ten Tier-1s at a target rate

Throughput in MB/s from T0 to all T1’s

Expected max rate

Last day of runEvery day rates were ramping-up


Real-time M4 Data Analysis

Tracks in the muon chambers (right) and in the TRT (below)

Analysis done simultaneously in European and US T1/2 sites


An Important Milestone

Metrics for success

– Full rate T0 processing OK

– Data exported to 5 / 10 T1’s and stored OK, and did more !

– For 2 / 5 T1’s exports to at least 2 T2’s OK

– Quasi-rt analysis in at least 1 T2 OK, and did more !

– Reprocessing in Sept. in at least 1 T1 in preparation Last week ATLAS has shown to master for the first time the whole data

chain: from a measurement of a real cosmic ray muon in the detector until an almost real-time analysis in sites in Europe and the US with all steps in between


ATLAS Event Data Model

Data Type Size (kB) Accessibility

RDO:

Raw Data Objects1600

Tier-0 and Tier-1s

Have to be recorded on permanent storage

ESD:

Event Summary Data

500

Tier-0 and Tier-1s

Output of the reconstruction

Often large; difficult to analyze the full set

AOD:

Analysis Object Data

100

Tier-0, Tier-1/2 - at least one complete copy per cloud, Tier-3 – subsetQuantities in particular relevant for physics analysisMain input for analysis, distributed to many sites

DPD:

Derived Physics Data10

Tier-3s (e.g. your laptop)

Sometimes called N-Tuples

TAG:

event-level metadata1 All Tiers


ATLAS Full Dress Rehearsal

Simulated events injected in the TDAQ– Realistic physics mix in bytestream format incl. luminosity blocks– Real data file and dataset sizes, trigger tables, data streaming

Tier-0/Tier-1 data quality, express line, calibration running– Use of Conditions DB

Tier-0 reconstruction: ESD, AOD, TAG, DPD– Data exports to Tier-1 and Tier-2s

Remote analysis

– at the Tier-1s:• Reprocessing from RAW ESD, AOD, DPD, TAG• Remake AOD from ESD• Group based analysis DPD

– at the Tier-2s and Tier-3s:• Root based analysis• Trigger aware analysis with Cond. and Trigger db• No MC truth, user analysis• MC/Reco production in parallel


FDR Schedule

Round 11. Data streaming tests: DONE2. Sept/Oct 07: Data preparation STARTS SOON3. End Oct 07: Tier-0 operations tests4. Nov 07 - Feb 08: Reprocess at Tier-1, make group DPD's

Round 2 ASSUMING NEW G41. Dec 07 – Jan 08: New data production for final round 2. Feb 08: Data prep for final round using 3. Mar 08: Reco final round ASSUMING SRM v2.24. Apr 08: DPD production at Tier-1s5. Apr 08: More simulated data prod in preparation for first data.6. May 08: final FDR

First pass production should be validated by year-end Reprocessing will be validated months later Analysis roles will be validated


Ramping Up Computing Resources for LHC Datataking

Change of LHC schedule makes little change to the resource profile

– Recall the early data is for calibration and commissioning• This is needed either from collisions or cosmics

New T1 Evolution

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Total Disk (TB)

Total Tape (TB)

Total CPU (kSI2k)

Total Disk (TB) 2090.24 10725.33659 20100.24616 39528.81444 56231.83987 72934.8653

Total Tape (TB) 1246.026667 8067.068427 15498.64876 29423.0892 45830.74975 64721.63041

Total CPU (kSI2k) 3173 18124.42353 28426.02353 49576.22353 70726.42353 91876.62353

2007 2008 2009 2010 2011 2012

New T2 Evolution

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Disk (TB)

CPU (kSI2k)

Disk (TB) 1339.93551 8602.406837 13889.17849 22909.44708 31868.59425 40828.06284

CPU (kSI2k) 2336 17506.89811 26972.75589 51557.13737 69140.91886 86724.70034

2007 2008 2009 2010 2011 2012

New CAF Evolution

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Total Disk (TB)

Total Tape (TB)

Total CPU (kSI2k)

Total Disk (TB) 218.9205102 1145.579516 1780.331401 3127.598482 4811.362354 6311.016939

Total Tape (TB) 43.90035714 370.9648871 645.3581497 1043.271812 1374.785475 1706.299137

Total CPU (kSI2k) 821.4117647 2081.470067 2562.343154 4663.81529 6598.454092 8533.092895

2007 2008 2009 2010 2011 2012

New T0 Evolution

0

5000

10000

15000

20000

25000

30000

Total Disk (TB)

Total Tape (TB)

Total CPU (kSI2k)

Total Disk (TB) 62.85714286 152.4621429 265.0335714 472.3528571 472.3528571 472.3528571

Total Tape (TB) 400 2449.077 5562.274 11893.871 18225.468 24557.065

Total CPU (kSI2k) 1910.411765 3705.823529 4058.823529 6105.823529 6105.823529 6105.823529

2007 2008 2009 2010 2011 2012


ATLAS Analysis Model

Basic principle: Smaller data can be read faster

– Skimming - Keep interesting events

– Thinning - Keep interesting objects in events

– Slimming - Keep interesting info in objects

– Reduction - Build higher-level data Derived Physics Data

– Share the schema with objects in the AOD/ESD

– Can be analyzed interactively

Farbin id 83

19


Analysis: Grid Tools and Experiences

On the EGEE and NDGF infrastructure ATLAS uses direct submission to the middleware using GANGA

– EGEE: LCG RB and gLite WMS

– NDGF: ARC middleware On OSG PANDA system

– Pilot based system

– Also available at some EGEE sites Many users have been exposed to the grid

– Work is getting done Simple user interface is essential to simplify the usage

– But experts required to understand the problem Sometimes user have the impression that they are debugging the grid


Conclusions

ATLAS computing is addressing unprecedented challenges

– we are in a final stages of mastering how to handle those challenges ATLAS experiment mastered complex multi-grid computing infrastructure

at the scale close to the expectations for running conditions

– Resource utilization for simulated event production

– Transfers from CERN A coordinated shift from development to operations/services is happening

in the final year of preparation An increase in scale is expected in facility infrastructure and the

corresponding ability to use new capacities effectively User analysis activities are ramping up

atlas computing

Documents

atlas computing tdr

atlas computing notes

atlas ddm

atlas collaborators

simulations data

production grids

documentationthen operations

robust operations