barberis-atlas computing challenges · 2014. 7. 2. · over 150k concurrent jobs running 350 m jobs...
TRANSCRIPT
ATL
-SO
FT-S
LID
E-20
14-3
6102
/07/
2014
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
1
ATLAS Computing Challenges before the next LHC run
Dario Barberis (Genoa University/INFN)
On behalf of the ATLAS Collaboration
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Outline
● Run2 is more than Run1 at higher energies
● Technology improvements in software
■ Simulation
■ Reconstruction
■ Infrastructure
■ Data and analysis models
● Major rework of main distributed computing tools
■ Distributed Data Management: DQ2 -> Rucio, data federation
■ Workload management: ProdSys2+PanDA
■ EventIndex, Event Service
■ Opportunistic resources: HLT farm, SuperComputers, Boinc
● Data Challenge DC14
2
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Run1 Software & Computing
● Software based on the Gaudi/Athena framework ■ Designed and developed in 2000-2005 ■ Improved continuously ■ Now based on Python (steering) and C++ code
● Computing model derived from MONARC (1999) ■ Hierarchical data distribution and access ■ Assumed network bandwidth is the limitation ■ "Jobs go to the data"
● It worked! Even beyond expectations: ■ Over 150k concurrent jobs running ■ 350 M jobs completed in 2013 ■ Not only production and simulation but
analysis (>50% of the jobs) ■ 1.2 EB of data read-in by ATLAS grid jobs in
2013 ! 82% by analysis jobs
● Lots of physics results! 3
150k jobs
Running jobs on ATLAS Grid sites in 2013
Analysis"
MC Reconstruction"
Others"
Group Production"
MC Simulation"
Data volume processed in 2013
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Challenges of Run2
● Grid resources are limited by funding ■ Pileup drives resource needs ■ Full simulation is costly
● Physics requires to increase rate ■ Run-2 data taking rate 1 kHz
● Technologies are evolving fast ■ Software needs to follow
● Today's infrastructure ■ x86 based, 2-3 GB per core, commodity CPU servers ■ Applications running in parallel on separate cores
4
T0!
T1! T1!
T2! T2!
T2!
T2! T2!
T2! T2!
T3! T3!
● Network bandwidth fastest growing resource ■ Data transfer to remote jobs is less of a problem
! Flexible data placement
! Data popularity driven replication
! Remote I/O ! Storage
federations
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Integrated Simulation Framework
5
● ISF allows for the following: ■ Mixing of fast and full simulation within a physics event
■ Uniform truth handling by all simulation flavours ■ Consistent detector boundaries by all simulation flavours
■ Complex schemes for routing particles (killing, changing simulation flavours) ● Improved geometry description due to hard work of detector groups ● Potential improvements in lateral hadronic shower shapes from new Geant4 version 9.6p2
● ISF will give more flexibility for fast simulation in Run2
● Simulation (and digitisation + reconstruction) for DC14 Run1 pre-production samples ready now, production will launch soon
● MC15 expected to be ready 2-3 months before 2015 summer conferences.
● Fast simulation is an important part of MC15 for Run2
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Optimisation of CPU usage
● Pile-up drives resource needs ■ Tracking is resource driver in reconstruction
● Work on technology to improve current algorithms ■ Modified track seeding to explore 4th pixel layer ■ Eigen instead of CLHEP - faster vector+matrix algebra ■ Use vectorised trigonometric functions ■ F90 to C++ for the B-field (speed improvement in Geant4
as well)
■ Simplify data model design to be less OO (was the "thing to do" 10 years ago)
● Results visible in massive reduction of the reconstruction time per event
6
Reconstruction time/event
Run1
Now
t-tbar simul. 14 TeV 25 ns bunches <pileup> = 40
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
New Analysis Data Model
7
22%
39% 10%
16%
7%
xAOD Event Sizes [%] Calo InDet Jet/MET Btag Egamma Muon Tau Pflow trigger
● The goal: create a data format that is produced by reconstruction directly and can be conveniently used in analysis ■ Eliminating the need for creating ROOT n-tuple formats (D3PD) that are almost a full copy of
the old AOD (Analysis Object Data) information ● Combined the best features of the AOD and D3PD files:
■ Be able to read in a “basic way” even from vanilla ROOT, and in a fully functional way after just loading a small amount of libraries
■ Provide the same flexibility for “slimming” that the D3PDs were capable of: ! Be able to select which properties of objects we want to save into a given file ! Be able to “decorate” objects at the analysis stage with additional information
● Developed the infrastructure to make it possible to do all the operations on the “primary xAODs” that users were doing in their analysis starting from the “primary D3PDs”
● Will use an analysis motivated optimisation for I/O settings: ■ Primary xAODs meant mainly for analysis from Athena.
Providing good performance for reading a large part of the event data for every event in the files
■ Derived xAODs meant mainly for analysis from ROOT. Providing good performance for reading a small number of variables for a lot of events.
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
New Analysis Flow Model
● The analysis model for Run2 centralises the intermediate level formats (“derivations”) and tools of data handling, which in Run1 were handled by users ■ non-optimal especially for cross-team analyses
● Provides analysis-ready Root-readable reduced data formats (“DxAOD”) ● Each derivation is defined by a single set of Athena jobOptions defined by physics
and/or performance groups ● A key part of the derivation framework is the concept of train production, where a
single job can produce a number of independent output formats from a single input file.
8
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
AthenaMP and multi-core jobs ● AthenaMP is an event-parallel version of Athena:
■ It runs serially through the configuration and initialization phases of the job ■ Then uses Unix fork for creating a task farm of sub-processes - the workers
! who actually process all events assigned to the given job ■ A separate process merges the output files produced by each independent worker ■ Used in 2014 to run new G4 simulation
■ Under test for reconstruction
● Several multi-core queues enabled at ATLAS Grid sites
■ Good CPU and memory usage efficiency ● Infrastructure usable also for Cloud and HPC facilities
9
8-core jobs running in March-May 2014
5k
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Rucio: a new DDM implementation ● The current Distributed Data Management (DDM)
implementation (DQ2) successfully managed to serve ATLAS data during Run-1 : ■ 160 PB ■ 640M files. ■ 130 grid sites ■ 1000s users
● But DQ2 will not scale for Run-2: ■ Heavy operational burden ■ Difficult to add new features and technologies ■ Many lessons learned during Run-1
● Rucio is the new DDM implementation for LHC Run2: ■ Better handling of users, groups, activities ■ Data discovery based on name and metadata ■ No dependence on an external file catalog (deterministic relation LFN->PFN) ■ Supports multiple data management protocols in addition to SRM, e.g. WebDAV, xrootd,
S3, posix, gridftp ■ Smarter and more automated data placement tools (rules, subscriptions)
● More information in the dedicated talk tomorrow (C. Serfon) 10
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Xrootd Data Federation ● The data federation is needed to allow remote access to data in case of
■ Unavailability of a given file int he local storage element ■ Sparse access to single events
● FAX is the ATLAS implementation of an xrootd based data federation ■ 2 top-level redirectors: EU and US
! Topology as in diagram ■ Coverage so far 56% of sites and 85%
of data ● Failover works stably:
■ Tested that all the sites do deliver data efficiently
■ Test tasks submitted to sites that don’t have the data so that FAX is invoked
■ Very satisfied with the error rate: 0.3% of jobs fail due to FAX issues ● Overflow works, needs larger usage and weights tuning ● Helps make xAOD access efficient over the WAN
11
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
ProdSys2+PanDA
● We developed and we are now commissioning the new distributed production framework: ProdSys2
● Core components: ■ Request I/F: allows production
managers to define a request ■ DEfT: translates user request
into task definitions ■ JEDI: generates the job
definitions ■ PanDA: executes the jobs in
the distributed infrastructure ● JEDI+PanDA provides also the new framework for distributed analysis ● More information in the dedicated talk tomorrow (K. De)
12
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
EventIndex & Event Service ● The EventIndex is a catalogue of all ATLAS events
■ In any state of processing or format ■ Contains simple event info, plus references
to files and internal pointers to events ● Implemented as 3 major components:
■ Data collection and transfer system ■ Core storage in Hadoop technology ■ Web server for data access
● More information in the dedicated talk tomorrow (A. Fernandez)
13
● The Event Service is a novel way to distribute payload to workers in different computing environments ■ Clouds, HPCs, ATLAS@home…
● Uses AthenaMP, remote I/O (FAX), EventIndex together with JEDI+PanDA ■ Makes efficient use of opportunistic
computing resources ● Now under commissioning:
" All bits and pieces are there " The tests are ongoing
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
CERN Production July'13 – June'14
10k
20k
30k
● Condor works well with opportunistic computing
● VM machines can be killed within 10 minutes if a return to HLT operations is needed
● Killed jobs are retried elsewhere by the Panda infrastructure
● Disk I/0 and memory considerations limit Sim@P1 to MC generation of hits (GEANT4) ● May no longer be true with updated
network
Resource diversification: HLT farm ● Sim@P1: simulation jobs are running on
virtual machines in the HLT nodes ● Implementation based on OpenStack and
CernVM ● ~20 minutes needed to launch VMs for
the entire HLT farm ● VM machines host up to 20k condor jobs
slots that are served by panda ● Condor jobs slots are automatically
discovered and no manual action needed to fill P1 job slots
14
Sim@P1
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Resource diversification: HPC ● HPC vs Grid:
■ HPC: High Performance Computing ! Leadership Computing Facilities ! Massively Parallel Processors ! Clustered Machines
■ Grid: High Throughput Computing ! Homogeneous processors – Linux OS ! Variety of technologies and middle ware ! ATLAS is extremely successful using Grid
● ATLAS has members with access to HPC machines ■ Already successfully used in ATLAS as part of NorduGrid and in US+DE
● HPC nodes have little outside connectivity and no local installation possibilities ■ Need a non-invasive interface like the ARC-CE (or similar) and a way to connect to
CVMFS (for software) and Frontier (for database access) through Squids ● Many elements of the HEP software stack (Geant4, ROOT, Alpgen, Sherpa...) have been made
to run on many different HPCs. ● There is a strong interest and support for the ATLAS HPC activity. We have been awarded
something in excess of 63 million CPU-hours over the next 12 months. ■ This is ~6% of our Grid use and half of our event generation budget
● All the approaches will be presenting a common interface to PanDA, even if for reasons of architecture or policy, the back ends look very different.
15
ARC-CE use with HPC
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
● Volunteer computing using Boinc ■ SETI@Home, Einstein@Home,
LHC@Home… now ATLAS@Home! ● We set up a test server with ARC-CE and
Boinc server with ATLAS@Home app ■ BOINC panda queue with very low
priority MC simulation jobs ■ 10 events/job, up to 1000 running jobs now ■ Average: 6000 events/day — 0.2% of total Grid
■ Near future plans: ■ Merge with existing CERN-IT infrastructure ■ Outreach campaign: get volunteers ■ Potential solution for Tier-3/institute desktop
clusters ■ To participate:
■ Install Boinc client and VirtualBox ■ Register for ATLAS@Home and connect client ■ Client downloads CERNVM image (once, ~500MB)
and input files (~50MB per job) and runs job in VirtualBox
Resource diversification: ATLAS@home
16
ARC Control Tower
Panda Server
ARC CE
Session Directory
BOINC LRMS Plugin
Boinc server
Volunteer PC
Boinc Client
VM
Shared Directory
Grid Catalogs and Storage
DB
proxy cert
BOINC PQ
ATLAS@home jobs – June'14 1k
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
● Overall goal of the data challenge is to get ATLAS ready for Run2 physics.
● To achieve this we need to: ■ Commission the Integrated
Simulation Framework (ISF) in the context of physics analyses
■ Run large scale test of updated reconstruction algorithms
■ Provide large-scale test of upgraded distributed computing environment
■ Test the Run2 analysis model ■ Gain experience with the Run2 analysis framework
● This program is broken down into technical components: ■ Partial reprocessing of Run1 data (for the analysis challenge) ■ Production of new MC events with 2015 geometry and run conditions ■ Reconstruction and distribution of produced data, including cosmics from "M" runs ■ Data analysis challenge
● The bulk of this program is for the second half of 2014 – now!
Test of new features: DC14
17
Draft schedule
Dario Barberis: ATLAS Computing Challenges
ICHEP 2014 – 3 July 2014
Outlook
● We defined at the end of 2012 an ambitious plan for improvements of the software and computing infrastructure and tools
● It's all coming together about now:
■ New simulation framework, improved reconstruction algorithms, faster tools
■ New workload and data management systems
■ New operation model for analysis and for distributed computing
● Data Challenge DC14 is testing all components of this improved system
● We'll be ready for taking new LHC data in 2015
18