claudio grandi infn bologna acat'03 - kek 3-dec-2003 cms distributed data analysis challenges...

Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003

CMS Distributed Data Analysis Challenges

Claudio Grandion behalf of the CMS Collaboration

3-Dec-2003ACAT'03 - KEK 2Claudio Grandi INFN Bologna

Outline

CMS Computing Environment

CMS Computing Milestones

OCTOPUS: CMS Production System

2002 Data productions

2003 Pre-Challenge production (PCP03)

PCP03 on grid

2004 Data Challenge (DC04)

Summary


CMS Computing Environment


CMS computing context• LHC will produce 40 million bunch crossing per

second in the CMS detector (1000 TB/s)• The on-line system will reduce the rate to 100 events

per second (100 MB/s raw data)– Level-1 trigger: hardware– High level trigger: on-line farm

• Raw data (1MB/evt) will be:– archived on persistent storage (~1 PB/year)– reconstructed to DST (~0.5 MB/evt) and AOD (~20 KB/evt)

• Reconstructed data (and part of raw data) will be:– distributed to computing centers of collaborating institutes– analyzed by physicists at their own institutes


CMS Data Production at LHC

Level 1 Trigger High Level Trigger

40 MHz

40 MHz (1000 TB/sec)

(1000 TB/sec)75 KHz 75 KHz (50 GB/sec)

(50 GB/sec)100 Hz 100 Hz (100

(100 MB/sec)MB/sec)

Data Recording &

Data Recording &

Offline Analysis

Offline Analysis


CMS Distributed Computing Model

Tier 1

Tier2 Center

Online System

CERN Center PBs of Disk;

Tape Robot

FNAL CenterIN2P3 Center INFN Center RAL Center

InstituteInstituteInstituteInstitute

Workstations

~100-1500 MBytes/sec

2.5-10 Gbps

0.1 to 10 GbpsPhysics data cache

~PByte/sec

~2.5-10 Gbps

Tier2 CenterTier2 CenterTier2 Center

~2.5-10 Gbps

Tier 0 +1

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment


CMS software for Data Simulation• Event Generation

– Pythia and other generators• Generally Fortran programs. Produce N-tuple files (HEPEVT format)

• Detector simulation– CMSIM (uses GEANT-3)

• Fortran program. Produces Formatted Zebra (FZ) files from N-tuples

– OSCAR (uses GEANT-4 and the CMS COBRA framework)• C++ program. Produces POOL files (hits) from N-tuples

• Digitization (DAQ simulation)– ORCA (uses the CMS COBRA framework)

• C++ program. Produces POOL files (digis) from hits POOL files or FZ

• Trigger simulation– ORCA

• Reads digis POOL files• Normally run as part of the reconstruction phase


CMS software for Data Analysis• Reconstruction

– ORCA• Produces POOL files (DST and AOD) from hits or digis POOL files

• Analysis– ORCA

• Reads POOL files in (hits, digis,) DST, AOD formats

– IGUANA (uses ORCA and OSCAR as back-end)• Visualization program (event display, statistical analysis)


CMS software: ORCA & C.

Pythia Zebra fileswith HITSHEPEVT

Ntuples

CMSIM(GEANT3)

ORCA/COBRADigitization

Digis Database(POOL)

ORCA/COBRAHit Formatter

Hits Database(POOL)

OSCAR/COBRA(GEANT4)

ORCAReconstructionor User Analysis

Ntuples orRoot files

Database(POOL)

IGUANAInteractive

Analysis

OtherGenerators

Merge signal and pile-up

D

ata Sim

ulatio

n

D

ata An

alysis


CMS Computing Milestones


CMS computing milestonesDAQ TDR (Technical Design Report)

Spring-2002 Data Production

Software Baselining

Computing & Core Software TDR2003 Data Production (PCP04)2004 Data Challenge (DC04)

Physics TDR2004/05 Data Production (DC05)Data Analysis for physics TDR

“Readiness Review”2005 Data Production (PCP06)2006 Data Challenge (DC06)

Commissioning


CMS

1

10

100

1000

10000

100000

2002 2003 2004 2005 2006 2007 2008 2009

kS

I95.M

on

ths

CERN

OFFSITE Average slope=x2.5/year

DC04Physics TDR

DC05LCG TDR

DC06Readiness

LHC2E33

LHC1E34

DAQTDR

Size of CMS Data Challenges1999: 1TB – 1 month – 1 person2000-2001: 27 TB – 12 months – 30 persons 2002: 20 TB – 2 months – 30 persons2003: 175 TB – 6 months – <30 persons


World-wide Distributed Productions

CMS Production Regional CentreCMS Distributed Production Regional Centre


CMS Computing Challenges• CMS Computing challenges include:

– production of simulated data for studies on:• Detector design• Trigger and DAQ design and validation• Physics system setup

– definition and set-up of analysis infrastructure– definition of computing infrastructure– validation of computing model

• Distributed system• Increasing size and complexity• Tightened to other CMS activities

– provide computing support for all CMS activities


OCTOPUSCMS Production System


BOSS DB

Dataset

metadataJob

metadata

OCTOPUS Data Production System

McRunjob+ plug-inCMSProd

Site Manager startsan assignment

RefDB

Phys.Group asks fora new dataset

shellscripts

LocalBatch Manager

Computer farm

Job level query

Data-levelquery

Production Managerdefines assignments

Push data or info

Pull info

JDL Grid (LCG)Scheduler LCG

RLSPOOL

DAG

job job

job

job

DAGMan(MOP)

ChimeraVDL

Virtual DataCatalogue

Planner

DPE


Remote connections to databases

JobWrapper

(job instru-mentation)

User Job

Journalwriter

Remoteupdater

Jobinput

Joboutput

JournalCatalog

MetadataDB

Jobinput

Joboutput

JournalCatalog

Asynchronousupdater

Worker Node User Interface

• Metadata DB are RLS/POOL, RefDB, BOSS DB

Direct connectionfrom WN


Job production MCRunJob

– Modular: produce plug-in’s for:• reading from RefDB• reading from simple GUI• submitting to a local resource manager• submitting to DAGMan/Condor-G (MOP)• submitting to the EDG/LCG scheduler• producing derivations in the Chimera Virtual Data Catalogue

– Runs on the user (e.g. site manager) host– Defines also the sandboxes needed by the job– If needed, the specific submission plug-in takes care of:

• preparing the XML POOL catalogue with input files information• moving the sandbox files to the worker nodes

CMSProd


Job Metadata management

Job parameters that represent the job running status are stored in a dedicated database:– when did the job start?– is it finished?but also:– how many events did it produce so far?

BOSS is a CMS-developed system that does this extracting the info from the job standard input/output/error streams– The remote updater is based on MySQL– Remote updater are being developed now based on:

• R-GMA (still has scalability problems)• Clarens (just started)


Dataset Metadata management

Dataset metadata are stored in the RefDB:– by what (logical) files is it made of?but also:– what input parameters to the simulation program?– how many events have been produced so far?

Information may be updated in the RefDB in many ways:– manual Site Manager operation – automatic e-mail from the job– remote updaters based on R-GMA and Clarens (similar to

those developed for BOSS) will be developed

Mapping of logical names to physical file names will be done on the grid by RLS/POOL


2002 Data Productions


2002 production statistics• Used Objectivity/DB for persistency• 11 Regional Centers, more than 20 sites, about 30 site

managers• Spring 2002 Data production

– Generation and detector simulation:6 million events in 150 physics channels

– Digitization:>13 million events with different configuration (luminosity)

– about 200 KSI2000 months– more than 20 TB digitized data

• Fall 2002 Data production– 10 million events, full chain (small output)– about 300 KSI2000 months– Also productions on grid!


Spring 2002 production history

1.5 m

illio

n events

per m

onth

CMSIM


Fall 2002 CMS grid productions CMS/EDG Stress Test

on EDG testbed & CMS sitesTop-down approach: more functionality but less robust, large manpower needed

USCMS IGT Production in the USBottom-up approach: less functionality but more stable, little manpower needed

1.2 million events in 2 months

260,000 events in 3 weeks


2003 Pre-Challenge Production


PCP04 production statistics• Started in july. Supposed to end by Xmas.• Generation and simulation:

– 48 million events with CMSIM• 50 150 KSI2K s/event, 2000 KSI2K months• ~ 1MB/event, 50 TB• hit-formatting in progress. POOL format reduces size of a factor of 2!

– 6 million events with OSCAR • 100 200 KSI2K s/event, 350 KSI2K months (in progress)

• Digitization just starting– need to digitize ~70 million events. Not all in time for DC04!

Estimated:• ~30-40 KSI2K s/event, ~950 KSI2K months• ~1.5 MB/event, 100 TB

• Data movement to CERN– ~1TB/day for 2 months


PCP 2003 production history

13 mill

ion e

vents

per m

onth

CMSIM


PCP04 on grid


US DPE production systemRunning on Grid2003

– ~ 2000 CPU’s– Based on VDT1.1.11 – EDG VOMS for authentication– GLUE Schema for MDS Information Providers– MonaLisa for monitoring– MOP for production control

- Dagman and Condor-G for specification and submission

- Condor-based match-making process selects resources

US DPE Production on Grid2003

Master Site

Remote Site 1

MCRunJob mop_submitterDAGManCondor-G

GridFTP

BatchQueue

GridFTP

Remote Site N

BatchQueue

GridFTP

MOP System


Performance of US DPEUSMOP Regional Center - 7.7 Mevts pythia:

~30000 jobs ~1.5min each, ~0.7 KSI2000 months

- 2.3 Mevts cmsim: ~9000 jobs ~10hours each, ~90 KSI2000 months

~3.5 TB data

Now running OSCAR productions

CMSIM


CMS/LCG-0 testbedCMS/LCG-0 is a CMS-wide testbed based on the LCG pilot

distribution (LCG-0), owned by CMS– joint CMS – DataTAG-WP4 – LCG-EIS effort– started in june 2003– Components from VDT 1.1.6 and EDG 1.4.X (LCG pilot)– Components from DataTAG (GLUE schemas and info providers)– Virtual Organization Management: VOMS – RLS in place of the replica catalogue (uses rlscms by CERN/IT)– Monitoring: GridICE by DataTAG– tests with R-GMA (as BOSS transport layer for specific tests)– no MSS direct access (bridge to SRB at CERN)

About 170 CPU’s, 4 TB disk– Bari Bologna Bristol Brunel CERN CNAF Ecole Polytechnique Imperial

College ISLAMABAD-NCP Legnaro Milano NCU-Taiwan Padova U.Iowa

Allowed to do CMS software integration while LCG-1 was not out


User Interface

CMS/LCG-0 Production systemOCTOPUS installed on User InterfaceCMS software (installed on Computing Elements as RPM’s)

BOSS DB

McRunjob+ ImpalaLite

RefDB

JDLGrid (LCG)Scheduler

RLS

SECE

CMS software

CE

CMS software

CE

CMS software

CE

SE

SE

SE

WN

SECE

CMS software

Job

metadata

Dataset

metadata

Push data or info

Pull info

Grid InformationSystem (MDS)


CMS/LCG-0 performanceCMS-LCG Regional Center

based on CMS/LCG-0

0.5 Mevts “heavy” pythia: ~2000 jobs ~8hours each, ~10 KSI2000 months

1.5 Mevts cmsim: ~6000 jobs ~10hours each, ~55 KSI2000 months

~2.5 TB dataInefficiency estimation:

– 5% to 10% due to sites’ misconfiguration and local failures

– 0% to 20% due to RLS unavailability

– few errors in execution of job wrapper

– Overall inefficiency: 5% to 30%

Pythia + CMSIM

Now used as a play-ground for CMS grid-tools development


Data Challenge 2004(DC04)


2004 Data Challenge• Test the CMS computing system at a rate which

corresponds to the 5% of the full LHC luminosity– corresponds to the 25% of the LHC startup luminosity– for one month (February or March 2004)– 25 Hz data taking rate at a luminosity of 0.2 x 1034 cm-2s-1

– 50 million events (completely simulated up to digis during PCP03) used as input

• Main tasks– Reconstruction at Tier-0 (CERN) at 25 Hz (~40 MB/s)– Distribution of DST to Tier-1 centers (~5 sites)– Re-calibration at selected Tier-1 centers– Physics-groups analysis at the Tier-1 centers– User analysis from the Tier-2 centers


DC04 Analysis challenge

DC04 Calibration challenge

T0

T1T2

T2

T1

T2

T2

Fake DAQ(CERN)

DC04 T0challenge

SUSYBackground

DST

HLTFilter ?

CERN disk pool~40 TByte(~20 days

data)

TAG/AOD(replica)

TAG/AOD(replica)

TAG/AOD(20

kB/evt)

ReplicaConditions

DB

ReplicaConditions

DB

HiggsDST

Eventstreams

Calibrationsample

CalibrationJobs

MASTERConditions DB

1st passRecon-

struction

25Hz1.5MB/evt40MByte/s3.2 TB/day

Archivestorage

CERNTape

archive

Disk cache

25Hz1MB/evt

raw

25Hz0.5MB recoDST

Higgs backgroundStudy (requests

New events)

Eventserver

50M events75 Tbyte

1TByte/day2 months

PCP

CERNTape

archive


Tier-0 challenge• Data serving pool to serve digitized events at 25Hz to the

computing farm with 20/24 hour operation. – 40 MB/s– Adequate buffer space (at least 1/4 of the digi sample in the disk buffer). – Pre-staging software. File locking while in use, buffer cleaning and

restocking as files have been processed

• Computing Farm: approximately 400 CPU’s– jobs running 20/24 hours. 500 events/job, 3 hour/job– Files in buffer locked till successful job completion– No dead-time can be introduced to the DAQ. Latencies must be no more

than of order 6-8 hours

• CERN MSS: ~50 MB/s archiving rate– archive ~ 1.5 MB * 25 Hz raw data (digis)– archive ~0.5 MB * 25 Hz reconstructed events (DST)

• File catalog: POOL/RLS– Secure and complete catalog of all data input/products– Accessible and/or replicable to the other computing centers


Data distribution challenge• Replication of the DST and part of raw data at one or more

Tier-1 centers– possibly using the LCG replication tools– foreseen some event duplication– At CERN ~3 GB/s traffic without inefficiencies (about 1/5 at Tier-1)

• Tier-0 catalog accessible by all sites• Replication of calibration samples (DST/raw) to selected Tier-1• Transparent access of jobs at the Tier-1 sites to the local data

whether in MSS or on disk buffer• Replication of any Physics-Groups (PG) data produced at the

Tier-1 sites to the other Tier-1 sites and interested Tier-2 sites• Monitoring of Data Transfer activites

– e.g. with MonaLisa


Calibration challenge• Selected sites will run calibration procedures• Rapid distribution of the calibration samples (within hours at

most) to the Tier-1 site and automatically scheduled jobs to process the data as it arrives.

• Publication of the results in an appropriate form that can be returned to the Tier-0 for incorporation in the calibration “database”

• Ability to switch calibration “database” at the Tier-0 on the fly and to be able to track from the meta-data which calibration table has been used.


Tier-1 analysis challenge• All data distributed from Tier-0 safely inserted to local storage• Management and publication of a local catalog indicating status

of locally resident data– define tools and procedures to synchronize a variety of catalogs with the

CERN RLS catalog (EDG-RLS, Globus-RLS, SRB-Mcat, …)– Tier-1 catalog accessible to at least the “associated” Tier-2 centers

• Operation of the physics-group (PG) productions on the imported data– “production-like” activity

• Local computing facilities made available to Tier-2 users– Possibly via the LCG job submission system

• Export of the PG-data to requesting sites (Tier-0, -1 or -2)• Registration of the data produced locally to the Tier-0 catalog to

make them available to at least selected sites – possibly via the LCG replication tools


Tier-2 analysis challenge• Point of access to computing resources of the physicists• Pulling of data from peered Tier-1 sites as defined by the local

Tier-2 activities• Analysis on the local PG-data produces plots and/or summary

tables• Analysis on distributed PG-data or DST available at least at the

reference Tier-1 and “associated” Tier-2 centers. – Results are made available to selected remote users possibly via the

LCG data replication tools.

• Private analysis on distributed PG-data or DST is outside DC04 scope but will be kept as a low-priority milestone– use of a Resource Broker and Replica Location Service to gain access

to appropriate resources without knowing where the input data are– distribution of user-code to the executing machines– user-friendly interface to prepare, submit and monitor jobs and to

retrieve results


Summary of DC04 scaleTier-0

– Reconstruction and DST production at CERN• 75 TB Input Data• 180 KSI2K months = 400 CPU @24 hour operation (@500 SI2K/CPU) • 25TB Output data• 1-2 TB/Day Data Distribution from CERN to sum of T1 centers

Tier-1– Assume all (except CERN) “CMS” Tier-1’s participate

• CNAF, FNAL, Lyon, Karlsruhe, RAL– Share the T0 output DST between them (~5-10TB each)

• 200 GB/day transfer from CERN (per T1)– Perform scheduled analysis group “production”

• ~100 KSI2K months total = ~50 CPU per T1 (24 hrs/30 days)

Tier-2– Assume about 5-8 T2

• may be more…• Store some of PG-data at each T2 (500GB-1TB)• Estimate 20 CPU at each center for 1 month


Summary• Computing is a CMS-wide activity

– 18 regional centers, ~ 50 sites

• Committed to support other CMS activities– support analysis for DAQ, Trigger and Physics studies

• Increasing in size and complexity– 1 TB in 1 month at 1 site in 1999– 170 TB in 6 months at 50 sites today– Ready for full LHC size in 2007

• Exploiting new technologies– Grid paradigm adopted by CMS– Close collaboration with LCG and EU and US grid projects– Grid tools assuming more and more importance in CMS

claudio grandi infn bologna acat'03 - kek 3-dec-2003 cms distributed data analysis challenges...

Documents