atlas distributed computing

38
ATLAS Distributed Computing 1 Kors Bos Annecy, le 18 Mai 2009

Upload: rea

Post on 07-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

ATLAS Distributed Computing. Kors Bos Annecy, le 18 Mai 2009. ATLAS Workflows. Calibration & Alignment Express Stream Analysis. Prompt Reconstruction. Tier-0. CAF. CASTOR. RAW Re-processing HITS Reconstruction. Tier-1. Tier-1. Tier-1. Tier-2. Tier-2. Tier-2. Simulation Analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ATLAS Distributed Computing

ATLAS Distributed Computing

1

Kors BosAnnecy, le 18 Mai 2009

Page 2: ATLAS Distributed Computing

ATLAS Workflows

Tier-0

CASTOR

CAF

Prompt ReconstructionCalibration & AlignmentExpress Stream Analysis

Tier-1 Tier-1 Tier-1

RAW Re-processingHITS Reconstruction

Tier-2Tier-2

Tier-2Tier-2

Tier-2Tier-2

Tier-2Tier-2

Tier-2Tier-2

Tier-2Tier-2

SimulationAnalysis

2

Page 3: ATLAS Distributed Computing

At the Tier-0

RAW, Data from the detector 1.6 MB/evESD, Event Summary Data 1.0 MB/evAOD, Analysis Object Data 0.2 MB/evDPD, Derived Physics Data 0.2 MB/evTAG, Data tag 0.01 MB/ev

Page 4: ATLAS Distributed Computing

Reality is more complicated

4

Page 5: ATLAS Distributed Computing

From the detector

Data Streams

Physics streams• egamma • muon• Jet• Etmiss• tau• Bphys• minBias

Calibration streams• Inner Detector Calibration Stream

– Contains only partial events

• Muon Calibration Stream– Contains only partial events– Analyzed outside CERN

• Express line– Full events, 10% of data

Runs and RAW Merging

• A start/stop is between 2 Luminosity Blocks ~30 seconds file

• All files in a run dataset• 200 Hz for 30’ is 6000 events but split

between ~10 streams• Streams are unequal and some create

too small files• Small RAW files are merged into 2 GB

files • Only merged files are written to tape

and exported

5

Page 6: ATLAS Distributed Computing

Calibration and Alignment Facility CAF

Per run ..• Express line used for real-time processing

o Initial calibration usedo verified by DQ shifters

o Calibration data processed in CAFo Initial calibrations usedoNew calibrations into offline db

o Express line processed againoNew calibrations usedo Verified by DQ shifterso If necessary fixes applied

o Express line processed again if necessaryo Buffer for several days of data

o Reconstruction of all data triggeredo Results archived on tape. andoMade available at CERN, ando Replicated to other clouds

6

Page 7: ATLAS Distributed Computing

ATLAS Clouds

Cloud Tier-1 Share [%] Tier-2’s [#]

Asian Pas. ASGC 5 0

US BNL 25 6

Italy CNAF 5 4

German FZK 10 11

French CCIN2P3 15 11

Nordic NDGF 5 2

Iberian PIC 5 5

UK RAL 10 13

Dutch SARA 15 9

Canadian Triumf 5 4

Page 8: ATLAS Distributed Computing

French Tier-2

Page 9: ATLAS Distributed Computing

Activity areas

1. Detector data distribution 2. Detector data re-processing (in the Tier-1’s)3. MC Simulation production (in the Tier-2’s)4. User analysis (in the Tier-2’s)

9

Page 10: ATLAS Distributed Computing

STEP09

10

A Functional and Performance TestFor all 4 experiments simultaneously

ATLAS

Page 11: ATLAS Distributed Computing

What we would like to test

• Full computing model• Tape writing and reading simultaneously in Tier-1’s and Tier-0• Processing priorities and shares in Tier-1 and -2’s• Monitoring of all those activities• Simultaneously with other experiments (test shares)• All at nominal rates for 2 weeks: June 1 - 14• Full shift schedule in place like for cosmics data taking• As little disruptive as possible for detector comissioning

11

Page 12: ATLAS Distributed Computing

Activity areas

1. Detector data distribution 2. Detector data re-processing (in the Tier-1’s)3. MC Simulation production (in the Tier-2’s)4. User analysis (in the Tier-2’s)

12

Page 13: ATLAS Distributed Computing

Detector Data Distribution

Page 14: ATLAS Distributed Computing

The Common Computing Readiness Challengeof last year

T0->T1s throughput

MB/s

MB/s

Subscriptions injected every 4

hours and immediately

honored

12h backlog Fully

Recovered in 30

minutes

All Experiments in the game

Page 15: ATLAS Distributed Computing

Tier-0 Tier-1 rates and volumes

Page 16: ATLAS Distributed Computing

Activity areas

1. Detector data distribution 2. Detector data re-processing (in the Tier-1’s)3. MC Simulation production (in the Tier-2’s)4. User analysis (in the Tier-2’s)

16

Page 17: ATLAS Distributed Computing

2. Detector data re-processing (in the Tier-1’s)

• Each Tier-1 responsible to re-process its share• Pre-stage RAW data back from tape to disk• Re-process reconstruction (on average 30’ per event)• Output ESD, AOD, DPD archived to tape• Copy AOD and DPD to all other 9 Tier-1’s• Distribute AOD and DPD over Tier-2’s of ‘this’ cloud• Copy ESD to 1 other (sister) Tier-1

17

Page 18: ATLAS Distributed Computing

Re-processing work flow

Here means mAODmDPD merged AOD/DPD files

Page 19: ATLAS Distributed Computing

Spring09 re-Processing Campaign

• Total input data (RAW)– 138 runs, 852 containers, 334,191 files, 520 TB– https://twiki.cern.ch/twiki/pub/Atlas/ DataPreparationReprocessing/reproMarch09_inputnew.txt

• Total output data (ESD, AOD, DPD, TAG, NTUP, etc.)– 12,339 containers, 1,847,149 files, 133 TB– Compare with last time - 116.8TB - due to extra runs, DPD formats etc

Page 20: ATLAS Distributed Computing

Simplified re-processing for STEP09

• Spring09 campaign too complicated• Simplify by just running RAWESD

– Using Jumbo tasks

• RAW staged from tape• ESD archived back onto tape

– Volume is smaller than with real data

• Increase Data Distribution FT– To match the missing AOD/DPD traffic

Page 21: ATLAS Distributed Computing

Re-Processing targets

• Re-processing at 5x the rate of nominal data taking• Be aware: ESD is much smaller for cosmics than for data

– ESD file size 140 MB i.s.o. 1 GB

Page 22: ATLAS Distributed Computing

Tier-1 Tier-1 Volumes and Rates

• Re-processed data distributed like original data from Tier-0– ESD to 1 partner Tier-1– AOD and DPD to all other 9 Tier-1’s (and CERN)

• and further to the Tier-2’s

• AOD and DPD load simulated through DDM FT

Page 23: ATLAS Distributed Computing

Tier-1 Tier-2 Volume and rates

• Computing Model foresaw 1 copy of AOD+DPD per cloud• Tier-2 sites very hugely in size and many clouds export more

than 1 copy

Page 24: ATLAS Distributed Computing

Activity areas

1. Detector data distribution 2. Detector data re-processing (in the Tier-1’s)3. MC Simulation production (in the Tier-2’s)4. User analysis (in the Tier-2’s)

24

Page 25: ATLAS Distributed Computing

G4 Monte Carlo Simulation Production

EVNT = 0.02 MB/eventHITS = 2.0 MB/eventRDO = 2.0 MB/eventESD = 1.0 MB/eventAOD = 0.2 MB/eventTAG = 0.01 MB/event

G4 simulation takes ~1000 s/eventdigi+reco takes ~20-40 s/event

Page 26: ATLAS Distributed Computing

MC Simulation Production statistics• Only limited by requests and disk space

Page 27: ATLAS Distributed Computing

G4 Simulation Volumes• Mc09 should have started during STEP09• Exclusively run on Tier-2 resources

– Rate will be lower because other activities– Small HITS files produced in Tier-2’s uploaded to Tier-1– Merged into Jumbo HITS and written to tape in Tier-1

• Merged MC08 data from tape will be used for reconstruction– AOD (and some ESD) written back to tape and distributed like data

Page 28: ATLAS Distributed Computing

Activity areas

1. Detector data distribution 2. Detector data re-processing (in the Tier-1’s)3. MC Simulation production (in the Tier-2’s)4. User analysis (in the Tier-2’s)

28

Page 29: ATLAS Distributed Computing

5. User analysis

• Mainly done in Tier-2’s• 50% capacity should be reserved for user analysis

– We already see 30% activity at least in some site• In addition some Tier-1 sites have analysis facilities

– Must make sure does not disrupt scheduled Tier-1 activities• We will also use HammerCloud analysis test framework

– Contains 4 different AOD analyses– Can generate constant flow of jobs– Uses both WMS and PanDA back-ends in EGEE

• Tier-2’s should install the following shares:

29

Page 30: ATLAS Distributed Computing

Putting it all togetherTier-1 Volumes and Rates for STEP09

• For CCIN2P3:– ~10 TB for MCDISK and ~200 TB for DATADISK and ~55 TB on tape– ~166 MB/s data in and 265 MB/s data out (!?)

Page 31: ATLAS Distributed Computing

Tape Usage

For CCIN2P3• Reading: 143 MB/s

– RAW for re-processing– Jumbo HITS for reconstruction

• Writing: 44 MB/s– RAW from the Tier-0– (Merged) HITS from the Tier-2’s– output from re-processing (ESD, AOD, DPD, ..)– output from reconstruction (RDO, AOD, ..)

Page 32: ATLAS Distributed Computing

Nous vous demandons ..

• Que les sites vérifient les chiffres et les dates– nous savons que CCIN2P3 ne peut pas faire pre-staging automatique– C’est prévu d’avoir suffisamment de refroidissement début Juin?– Combien est la capacité des Tier-2’s dans le nuage Français?

• Qu’ils y a une personne/des personnes qui observent– Sur chargement de systèmes, ralentissements, erreurs, ..– Saturations des bandes de passages – Au moins 1 personne par site, Tier-1 et Tier-2– On aimeraient rassembler des noms

• Aide pour rassembler des informations pour le rapport final– Post mortem Juillet 9 - 10

Page 33: ATLAS Distributed Computing

Nous vous offrons

• Un twiki avec de l’information détaillée• La réunion Atlas (aussi par tel.) de 09:00 hr. • La réunion WLCG (aussi par tel.) de 15:00 hr• La réunion d’opération (aussi par tel.) les Jeudis a 15:30 hr.• La réunion virtuelle par Skype (24/24)• Plusieurs listes de courrier électronique• Des adresses email privées• Des numéros de téléphone• Notre bonne volonté

Page 34: ATLAS Distributed Computing

La Fin

34

Page 35: ATLAS Distributed Computing

reconstruction

analysis

simulationinteractivephysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for Physics

Analysisevent filter(selection &

reconstruction)

processeddata

[email protected]

Tier-2

Tier-1

Tier-0

35

Page 36: ATLAS Distributed Computing

atldata

atlprod

atlcal

t0atlas t0merge

CPUsCPUsCPUsT0

CPUsCPUsCPUsCAF

Tape

T1T1T1T1

Calibration data

Xpress Stream

Group

Scratch

CPUsCPUsCPUsCPU

CPUsCPUsCPUsCPUAOD

DPD

AOD

DPD

managers spaceusers space

calibration and alignment

physics group analysis

end-user analysis

afs

Storage Area’s @CERN

36

Detector data

Re-processing data

MC data

default

User

Detector data

Page 37: ATLAS Distributed Computing

T1 Space Token Summary

Page 38: ATLAS Distributed Computing

T2 Space Token Summary