barberis-atlas computing challenges · 2014. 7. 2. · over 150k concurrent jobs running 350 m jobs...

ATL

-SO

FT-S

LID

E-20

14-3

6102

/07/

2014

Dario Barberis: ATLAS Computing Challenges

ICHEP 2014 – 3 July 2014

1

ATLAS Computing Challenges before the next LHC run

Dario Barberis (Genoa University/INFN)

On behalf of the ATLAS Collaboration


ICHEP 2014 – 3 July 2014

Outline

●  Run2 is more than Run1 at higher energies

●  Technology improvements in software

■  Simulation

■  Reconstruction

■  Infrastructure

■  Data and analysis models

●  Major rework of main distributed computing tools

■  Distributed Data Management: DQ2 -> Rucio, data federation

■  Workload management: ProdSys2+PanDA

■  EventIndex, Event Service

■  Opportunistic resources: HLT farm, SuperComputers, Boinc

●  Data Challenge DC14

2


ICHEP 2014 – 3 July 2014

Run1 Software & Computing

●  Software based on the Gaudi/Athena framework ■  Designed and developed in 2000-2005 ■  Improved continuously ■  Now based on Python (steering) and C++ code

●  Computing model derived from MONARC (1999) ■  Hierarchical data distribution and access ■  Assumed network bandwidth is the limitation ■  "Jobs go to the data"

●  It worked! Even beyond expectations: ■  Over 150k concurrent jobs running ■  350 M jobs completed in 2013 ■  Not only production and simulation but

analysis (>50% of the jobs) ■  1.2 EB of data read-in by ATLAS grid jobs in

2013 !  82% by analysis jobs

●  Lots of physics results! 3

150k jobs

Running jobs on ATLAS Grid sites in 2013

Analysis"

MC Reconstruction"

Others"

Group Production"

MC Simulation"

Data volume processed in 2013


ICHEP 2014 – 3 July 2014

Challenges of Run2

●  Grid resources are limited by funding ■  Pileup drives resource needs ■  Full simulation is costly

●  Physics requires to increase rate ■  Run-2 data taking rate 1 kHz

●  Technologies are evolving fast ■  Software needs to follow

●  Today's infrastructure ■  x86 based, 2-3 GB per core, commodity CPU servers ■  Applications running in parallel on separate cores

4

T0!

T1! T1!

T2! T2!

T2!

T2! T2!

T2! T2!

T3! T3!

●  Network bandwidth fastest growing resource ■  Data transfer to remote jobs is less of a problem

!  Flexible data placement

!  Data popularity driven replication

!  Remote I/O !  Storage

federations


ICHEP 2014 – 3 July 2014

Integrated Simulation Framework

5

●  ISF allows for the following: ■  Mixing of fast and full simulation within a physics event

■  Uniform truth handling by all simulation flavours ■  Consistent detector boundaries by all simulation flavours

■  Complex schemes for routing particles (killing, changing simulation flavours) ●  Improved geometry description due to hard work of detector groups ●  Potential improvements in lateral hadronic shower shapes from new Geant4 version 9.6p2

●  ISF will give more flexibility for fast simulation in Run2

●  Simulation (and digitisation + reconstruction) for DC14 Run1 pre-production samples ready now, production will launch soon

●  MC15 expected to be ready 2-3 months before 2015 summer conferences.

●  Fast simulation is an important part of MC15 for Run2


ICHEP 2014 – 3 July 2014

Optimisation of CPU usage

●  Pile-up drives resource needs ■  Tracking is resource driver in reconstruction

●  Work on technology to improve current algorithms ■  Modified track seeding to explore 4th pixel layer ■  Eigen instead of CLHEP - faster vector+matrix algebra ■  Use vectorised trigonometric functions ■  F90 to C++ for the B-field (speed improvement in Geant4

as well)

■  Simplify data model design to be less OO (was the "thing to do" 10 years ago)

●  Results visible in massive reduction of the reconstruction time per event

6

Reconstruction time/event

Run1

Now

t-tbar simul. 14 TeV 25 ns bunches <pileup> = 40


ICHEP 2014 – 3 July 2014

New Analysis Data Model

7

22%

39% 10%

16%

7%

xAOD Event Sizes [%] Calo InDet Jet/MET Btag Egamma Muon Tau Pflow trigger

●  The goal: create a data format that is produced by reconstruction directly and can be conveniently used in analysis ■  Eliminating the need for creating ROOT n-tuple formats (D3PD) that are almost a full copy of

the old AOD (Analysis Object Data) information ●  Combined the best features of the AOD and D3PD files:

■  Be able to read in a “basic way” even from vanilla ROOT, and in a fully functional way after just loading a small amount of libraries

■  Provide the same flexibility for “slimming” that the D3PDs were capable of: !  Be able to select which properties of objects we want to save into a given file !  Be able to “decorate” objects at the analysis stage with additional information

●  Developed the infrastructure to make it possible to do all the operations on the “primary xAODs” that users were doing in their analysis starting from the “primary D3PDs”

●  Will use an analysis motivated optimisation for I/O settings: ■  Primary xAODs meant mainly for analysis from Athena.

Providing good performance for reading a large part of the event data for every event in the files

■  Derived xAODs meant mainly for analysis from ROOT. Providing good performance for reading a small number of variables for a lot of events.


ICHEP 2014 – 3 July 2014

New Analysis Flow Model

●  The analysis model for Run2 centralises the intermediate level formats (“derivations”) and tools of data handling, which in Run1 were handled by users ■  non-optimal especially for cross-team analyses

●  Provides analysis-ready Root-readable reduced data formats (“DxAOD”) ●  Each derivation is defined by a single set of Athena jobOptions defined by physics

and/or performance groups ●  A key part of the derivation framework is the concept of train production, where a

single job can produce a number of independent output formats from a single input file.

8


ICHEP 2014 – 3 July 2014

AthenaMP and multi-core jobs ●  AthenaMP is an event-parallel version of Athena:

■  It runs serially through the configuration and initialization phases of the job ■  Then uses Unix fork for creating a task farm of sub-processes - the workers

!  who actually process all events assigned to the given job ■  A separate process merges the output files produced by each independent worker ■  Used in 2014 to run new G4 simulation

■  Under test for reconstruction

●  Several multi-core queues enabled at ATLAS Grid sites

■  Good CPU and memory usage efficiency ●  Infrastructure usable also for Cloud and HPC facilities

9

8-core jobs running in March-May 2014

5k


ICHEP 2014 – 3 July 2014

Rucio: a new DDM implementation ●  The current Distributed Data Management (DDM)

implementation (DQ2) successfully managed to serve ATLAS data during Run-1 : ■  160 PB ■  640M files. ■  130 grid sites ■  1000s users

●  But DQ2 will not scale for Run-2: ■  Heavy operational burden ■  Difficult to add new features and technologies ■  Many lessons learned during Run-1

●  Rucio is the new DDM implementation for LHC Run2: ■  Better handling of users, groups, activities ■  Data discovery based on name and metadata ■  No dependence on an external file catalog (deterministic relation LFN->PFN) ■  Supports multiple data management protocols in addition to SRM, e.g. WebDAV, xrootd,

S3, posix, gridftp ■  Smarter and more automated data placement tools (rules, subscriptions)

●  More information in the dedicated talk tomorrow (C. Serfon) 10


ICHEP 2014 – 3 July 2014

Xrootd Data Federation ●  The data federation is needed to allow remote access to data in case of

■  Unavailability of a given file int he local storage element ■  Sparse access to single events

●  FAX is the ATLAS implementation of an xrootd based data federation ■  2 top-level redirectors: EU and US

!  Topology as in diagram ■  Coverage so far 56% of sites and 85%

of data ●  Failover works stably:

■  Tested that all the sites do deliver data efficiently

■  Test tasks submitted to sites that don’t have the data so that FAX is invoked

■  Very satisfied with the error rate: 0.3% of jobs fail due to FAX issues ●  Overflow works, needs larger usage and weights tuning ●  Helps make xAOD access efficient over the WAN

11


ICHEP 2014 – 3 July 2014

ProdSys2+PanDA

●  We developed and we are now commissioning the new distributed production framework: ProdSys2

●  Core components: ■  Request I/F: allows production

managers to define a request ■  DEfT: translates user request

into task definitions ■  JEDI: generates the job

definitions ■  PanDA: executes the jobs in

the distributed infrastructure ●  JEDI+PanDA provides also the new framework for distributed analysis ●  More information in the dedicated talk tomorrow (K. De)

12


ICHEP 2014 – 3 July 2014

EventIndex & Event Service ●  The EventIndex is a catalogue of all ATLAS events

■  In any state of processing or format ■  Contains simple event info, plus references

to files and internal pointers to events ●  Implemented as 3 major components:

■  Data collection and transfer system ■  Core storage in Hadoop technology ■  Web server for data access

●  More information in the dedicated talk tomorrow (A. Fernandez)

13

●  The Event Service is a novel way to distribute payload to workers in different computing environments ■  Clouds, HPCs, ATLAS@home…

●  Uses AthenaMP, remote I/O (FAX), EventIndex together with JEDI+PanDA ■  Makes efficient use of opportunistic

computing resources ●  Now under commissioning:

"  All bits and pieces are there "  The tests are ongoing


ICHEP 2014 – 3 July 2014

CERN Production July'13 – June'14

10k

20k

30k

●  Condor works well with opportunistic computing

●  VM machines can be killed within 10 minutes if a return to HLT operations is needed

●  Killed jobs are retried elsewhere by the Panda infrastructure

●  Disk I/0 and memory considerations limit Sim@P1 to MC generation of hits (GEANT4) ●  May no longer be true with updated

network

Resource diversification: HLT farm ●  Sim@P1: simulation jobs are running on

virtual machines in the HLT nodes ●  Implementation based on OpenStack and

CernVM ●  ~20 minutes needed to launch VMs for

the entire HLT farm ●  VM machines host up to 20k condor jobs

slots that are served by panda ●  Condor jobs slots are automatically

discovered and no manual action needed to fill P1 job slots

14

Sim@P1


ICHEP 2014 – 3 July 2014

Resource diversification: HPC ●  HPC vs Grid:

■  HPC: High Performance Computing !  Leadership Computing Facilities !  Massively Parallel Processors !  Clustered Machines

■  Grid: High Throughput Computing !  Homogeneous processors – Linux OS !  Variety of technologies and middle ware !  ATLAS is extremely successful using Grid

●  ATLAS has members with access to HPC machines ■  Already successfully used in ATLAS as part of NorduGrid and in US+DE

●  HPC nodes have little outside connectivity and no local installation possibilities ■  Need a non-invasive interface like the ARC-CE (or similar) and a way to connect to

CVMFS (for software) and Frontier (for database access) through Squids ●  Many elements of the HEP software stack (Geant4, ROOT, Alpgen, Sherpa...) have been made

to run on many different HPCs. ●  There is a strong interest and support for the ATLAS HPC activity. We have been awarded

something in excess of 63 million CPU-hours over the next 12 months. ■  This is ~6% of our Grid use and half of our event generation budget

●  All the approaches will be presenting a common interface to PanDA, even if for reasons of architecture or policy, the back ends look very different.

15

ARC-CE use with HPC


ICHEP 2014 – 3 July 2014

●  Volunteer computing using Boinc ■  SETI@Home, Einstein@Home,

LHC@Home… now ATLAS@Home! ●  We set up a test server with ARC-CE and

Boinc server with ATLAS@Home app ■  BOINC panda queue with very low

priority MC simulation jobs ■  10 events/job, up to 1000 running jobs now ■  Average: 6000 events/day — 0.2% of total Grid

■  Near future plans: ■  Merge with existing CERN-IT infrastructure ■  Outreach campaign: get volunteers ■  Potential solution for Tier-3/institute desktop

clusters ■  To participate:

■  Install Boinc client and VirtualBox ■  Register for ATLAS@Home and connect client ■  Client downloads CERNVM image (once, ~500MB)

and input files (~50MB per job) and runs job in VirtualBox

Resource diversification: ATLAS@home

16

ARC Control Tower

Panda Server

ARC CE

Session Directory

BOINC LRMS Plugin

Boinc server

Volunteer PC

Boinc Client

VM

Shared Directory

Grid Catalogs and Storage

DB

proxy cert

BOINC PQ

ATLAS@home jobs – June'14 1k


ICHEP 2014 – 3 July 2014

●  Overall goal of the data challenge is to get ATLAS ready for Run2 physics.

●  To achieve this we need to: ■  Commission the Integrated

Simulation Framework (ISF) in the context of physics analyses

■  Run large scale test of updated reconstruction algorithms

■  Provide large-scale test of upgraded distributed computing environment

■  Test the Run2 analysis model ■  Gain experience with the Run2 analysis framework

●  This program is broken down into technical components: ■  Partial reprocessing of Run1 data (for the analysis challenge) ■  Production of new MC events with 2015 geometry and run conditions ■  Reconstruction and distribution of produced data, including cosmics from "M" runs ■  Data analysis challenge

●  The bulk of this program is for the second half of 2014 – now!

Test of new features: DC14

17

Draft schedule


ICHEP 2014 – 3 July 2014

Outlook

●  We defined at the end of 2012 an ambitious plan for improvements of the software and computing infrastructure and tools

●  It's all coming together about now:

■  New simulation framework, improved reconstruction algorithms, faster tools

■  New workload and data management systems

■  New operation model for analysis and for distributed computing

●  Data Challenge DC14 is testing all components of this improved system

●  We'll be ready for taking new LHC data in 2015

18

barberis-atlas computing challenges · 2014. 7. 2. · over 150k concurrent jobs running 350 m jobs...

Documents