claudio grandi infn bologna acat'03 - kek 3-dec-2003 cms distributed data analysis challenges...
TRANSCRIPT
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
CMS Distributed Data Analysis Challenges
Claudio Grandion behalf of the CMS Collaboration
3-Dec-2003ACAT'03 - KEK 2Claudio Grandi INFN Bologna
Outline
CMS Computing Environment
CMS Computing Milestones
OCTOPUS: CMS Production System
2002 Data productions
2003 Pre-Challenge production (PCP03)
PCP03 on grid
2004 Data Challenge (DC04)
Summary
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
CMS Computing Environment
3-Dec-2003ACAT'03 - KEK 4Claudio Grandi INFN Bologna
CMS computing context• LHC will produce 40 million bunch crossing per
second in the CMS detector (1000 TB/s)• The on-line system will reduce the rate to 100 events
per second (100 MB/s raw data)– Level-1 trigger: hardware– High level trigger: on-line farm
• Raw data (1MB/evt) will be:– archived on persistent storage (~1 PB/year)– reconstructed to DST (~0.5 MB/evt) and AOD (~20 KB/evt)
• Reconstructed data (and part of raw data) will be:– distributed to computing centers of collaborating institutes– analyzed by physicists at their own institutes
3-Dec-2003ACAT'03 - KEK 5Claudio Grandi INFN Bologna
CMS Data Production at LHC
Level 1 Trigger High Level Trigger
40 MHz
40 MHz (1000 TB/sec)
(1000 TB/sec)75 KHz 75 KHz (50 GB/sec)
(50 GB/sec)100 Hz 100 Hz (100
(100 MB/sec)MB/sec)
Data Recording &
Data Recording &
Offline Analysis
Offline Analysis
3-Dec-2003ACAT'03 - KEK 6Claudio Grandi INFN Bologna
CMS Distributed Computing Model
Tier 1
Tier2 Center
Online System
CERN Center PBs of Disk;
Tape Robot
FNAL CenterIN2P3 Center INFN Center RAL Center
InstituteInstituteInstituteInstitute
Workstations
~100-1500 MBytes/sec
2.5-10 Gbps
0.1 to 10 GbpsPhysics data cache
~PByte/sec
~2.5-10 Gbps
Tier2 CenterTier2 CenterTier2 Center
~2.5-10 Gbps
Tier 0 +1
Tier 3
Tier 4
Tier2 Center Tier 2
Experiment
3-Dec-2003ACAT'03 - KEK 7Claudio Grandi INFN Bologna
CMS software for Data Simulation• Event Generation
– Pythia and other generators• Generally Fortran programs. Produce N-tuple files (HEPEVT format)
• Detector simulation– CMSIM (uses GEANT-3)
• Fortran program. Produces Formatted Zebra (FZ) files from N-tuples
– OSCAR (uses GEANT-4 and the CMS COBRA framework)• C++ program. Produces POOL files (hits) from N-tuples
• Digitization (DAQ simulation)– ORCA (uses the CMS COBRA framework)
• C++ program. Produces POOL files (digis) from hits POOL files or FZ
• Trigger simulation– ORCA
• Reads digis POOL files• Normally run as part of the reconstruction phase
3-Dec-2003ACAT'03 - KEK 8Claudio Grandi INFN Bologna
CMS software for Data Analysis• Reconstruction
– ORCA• Produces POOL files (DST and AOD) from hits or digis POOL files
• Analysis– ORCA
• Reads POOL files in (hits, digis,) DST, AOD formats
– IGUANA (uses ORCA and OSCAR as back-end)• Visualization program (event display, statistical analysis)
3-Dec-2003ACAT'03 - KEK 9Claudio Grandi INFN Bologna
CMS software: ORCA & C.
Pythia Zebra fileswith HITSHEPEVT
Ntuples
CMSIM(GEANT3)
ORCA/COBRADigitization
Digis Database(POOL)
ORCA/COBRAHit Formatter
Hits Database(POOL)
OSCAR/COBRA(GEANT4)
ORCAReconstructionor User Analysis
Ntuples orRoot files
Database(POOL)
IGUANAInteractive
Analysis
OtherGenerators
Merge signal and pile-up
D
ata Sim
ulatio
n
D
ata An
alysis
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
CMS Computing Milestones
3-Dec-2003ACAT'03 - KEK 11Claudio Grandi INFN Bologna
CMS computing milestonesDAQ TDR (Technical Design Report)
Spring-2002 Data Production
Software Baselining
Computing & Core Software TDR2003 Data Production (PCP04)2004 Data Challenge (DC04)
Physics TDR2004/05 Data Production (DC05)Data Analysis for physics TDR
“Readiness Review”2005 Data Production (PCP06)2006 Data Challenge (DC06)
Commissioning
3-Dec-2003ACAT'03 - KEK 12Claudio Grandi INFN Bologna
CMS
1
10
100
1000
10000
100000
2002 2003 2004 2005 2006 2007 2008 2009
kS
I95.M
on
ths
CERN
OFFSITE Average slope=x2.5/year
DC04Physics TDR
DC05LCG TDR
DC06Readiness
LHC2E33
LHC1E34
DAQTDR
Size of CMS Data Challenges1999: 1TB – 1 month – 1 person2000-2001: 27 TB – 12 months – 30 persons 2002: 20 TB – 2 months – 30 persons2003: 175 TB – 6 months – <30 persons
3-Dec-2003ACAT'03 - KEK 13Claudio Grandi INFN Bologna
World-wide Distributed Productions
CMS Production Regional CentreCMS Distributed Production Regional Centre
3-Dec-2003ACAT'03 - KEK 14Claudio Grandi INFN Bologna
CMS Computing Challenges• CMS Computing challenges include:
– production of simulated data for studies on:• Detector design• Trigger and DAQ design and validation• Physics system setup
– definition and set-up of analysis infrastructure– definition of computing infrastructure– validation of computing model
• Distributed system• Increasing size and complexity• Tightened to other CMS activities
– provide computing support for all CMS activities
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
OCTOPUSCMS Production System
3-Dec-2003ACAT'03 - KEK 16Claudio Grandi INFN Bologna
BOSS DB
Dataset
metadataJob
metadata
OCTOPUS Data Production System
McRunjob+ plug-inCMSProd
Site Manager startsan assignment
RefDB
Phys.Group asks fora new dataset
shellscripts
LocalBatch Manager
Computer farm
Job level query
Data-levelquery
Production Managerdefines assignments
Push data or info
Pull info
JDL Grid (LCG)Scheduler LCG
RLSPOOL
DAG
job job
job
job
DAGMan(MOP)
ChimeraVDL
Virtual DataCatalogue
Planner
DPE
3-Dec-2003ACAT'03 - KEK 17Claudio Grandi INFN Bologna
Remote connections to databases
JobWrapper
(job instru-mentation)
User Job
Journalwriter
Remoteupdater
Jobinput
Joboutput
JournalCatalog
MetadataDB
Jobinput
Joboutput
JournalCatalog
Asynchronousupdater
Worker Node User Interface
• Metadata DB are RLS/POOL, RefDB, BOSS DB
Direct connectionfrom WN
3-Dec-2003ACAT'03 - KEK 18Claudio Grandi INFN Bologna
Job production MCRunJob
– Modular: produce plug-in’s for:• reading from RefDB• reading from simple GUI• submitting to a local resource manager• submitting to DAGMan/Condor-G (MOP)• submitting to the EDG/LCG scheduler• producing derivations in the Chimera Virtual Data Catalogue
– Runs on the user (e.g. site manager) host– Defines also the sandboxes needed by the job– If needed, the specific submission plug-in takes care of:
• preparing the XML POOL catalogue with input files information• moving the sandbox files to the worker nodes
CMSProd
3-Dec-2003ACAT'03 - KEK 19Claudio Grandi INFN Bologna
Job Metadata management
Job parameters that represent the job running status are stored in a dedicated database:– when did the job start?– is it finished?but also:– how many events did it produce so far?
BOSS is a CMS-developed system that does this extracting the info from the job standard input/output/error streams– The remote updater is based on MySQL– Remote updater are being developed now based on:
• R-GMA (still has scalability problems)• Clarens (just started)
3-Dec-2003ACAT'03 - KEK 20Claudio Grandi INFN Bologna
Dataset Metadata management
Dataset metadata are stored in the RefDB:– by what (logical) files is it made of?but also:– what input parameters to the simulation program?– how many events have been produced so far?
Information may be updated in the RefDB in many ways:– manual Site Manager operation – automatic e-mail from the job– remote updaters based on R-GMA and Clarens (similar to
those developed for BOSS) will be developed
Mapping of logical names to physical file names will be done on the grid by RLS/POOL
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
2002 Data Productions
3-Dec-2003ACAT'03 - KEK 22Claudio Grandi INFN Bologna
2002 production statistics• Used Objectivity/DB for persistency• 11 Regional Centers, more than 20 sites, about 30 site
managers• Spring 2002 Data production
– Generation and detector simulation:6 million events in 150 physics channels
– Digitization:>13 million events with different configuration (luminosity)
– about 200 KSI2000 months– more than 20 TB digitized data
• Fall 2002 Data production– 10 million events, full chain (small output)– about 300 KSI2000 months– Also productions on grid!
3-Dec-2003ACAT'03 - KEK 23Claudio Grandi INFN Bologna
Spring 2002 production history
1.5 m
illio
n events
per m
onth
CMSIM
3-Dec-2003ACAT'03 - KEK 24Claudio Grandi INFN Bologna
Fall 2002 CMS grid productions CMS/EDG Stress Test
on EDG testbed & CMS sitesTop-down approach: more functionality but less robust, large manpower needed
USCMS IGT Production in the USBottom-up approach: less functionality but more stable, little manpower needed
1.2 million events in 2 months
260,000 events in 3 weeks
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
2003 Pre-Challenge Production
3-Dec-2003ACAT'03 - KEK 26Claudio Grandi INFN Bologna
PCP04 production statistics• Started in july. Supposed to end by Xmas.• Generation and simulation:
– 48 million events with CMSIM• 50 150 KSI2K s/event, 2000 KSI2K months• ~ 1MB/event, 50 TB• hit-formatting in progress. POOL format reduces size of a factor of 2!
– 6 million events with OSCAR • 100 200 KSI2K s/event, 350 KSI2K months (in progress)
• Digitization just starting– need to digitize ~70 million events. Not all in time for DC04!
Estimated:• ~30-40 KSI2K s/event, ~950 KSI2K months• ~1.5 MB/event, 100 TB
• Data movement to CERN– ~1TB/day for 2 months
3-Dec-2003ACAT'03 - KEK 27Claudio Grandi INFN Bologna
PCP 2003 production history
13 mill
ion e
vents
per m
onth
CMSIM
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
PCP04 on grid
3-Dec-2003ACAT'03 - KEK 29Claudio Grandi INFN Bologna
US DPE production systemRunning on Grid2003
– ~ 2000 CPU’s– Based on VDT1.1.11 – EDG VOMS for authentication– GLUE Schema for MDS Information Providers– MonaLisa for monitoring– MOP for production control
- Dagman and Condor-G for specification and submission
- Condor-based match-making process selects resources
US DPE Production on Grid2003
Master Site
Remote Site 1
MCRunJob mop_submitterDAGManCondor-G
GridFTP
BatchQueue
GridFTP
Remote Site N
BatchQueue
GridFTP
MOP System
3-Dec-2003ACAT'03 - KEK 30Claudio Grandi INFN Bologna
Performance of US DPEUSMOP Regional Center - 7.7 Mevts pythia:
~30000 jobs ~1.5min each, ~0.7 KSI2000 months
- 2.3 Mevts cmsim: ~9000 jobs ~10hours each, ~90 KSI2000 months
~3.5 TB data
Now running OSCAR productions
CMSIM
3-Dec-2003ACAT'03 - KEK 31Claudio Grandi INFN Bologna
CMS/LCG-0 testbedCMS/LCG-0 is a CMS-wide testbed based on the LCG pilot
distribution (LCG-0), owned by CMS– joint CMS – DataTAG-WP4 – LCG-EIS effort– started in june 2003– Components from VDT 1.1.6 and EDG 1.4.X (LCG pilot)– Components from DataTAG (GLUE schemas and info providers)– Virtual Organization Management: VOMS – RLS in place of the replica catalogue (uses rlscms by CERN/IT)– Monitoring: GridICE by DataTAG– tests with R-GMA (as BOSS transport layer for specific tests)– no MSS direct access (bridge to SRB at CERN)
About 170 CPU’s, 4 TB disk– Bari Bologna Bristol Brunel CERN CNAF Ecole Polytechnique Imperial
College ISLAMABAD-NCP Legnaro Milano NCU-Taiwan Padova U.Iowa
Allowed to do CMS software integration while LCG-1 was not out
3-Dec-2003ACAT'03 - KEK 32Claudio Grandi INFN Bologna
User Interface
CMS/LCG-0 Production systemOCTOPUS installed on User InterfaceCMS software (installed on Computing Elements as RPM’s)
BOSS DB
McRunjob+ ImpalaLite
RefDB
JDLGrid (LCG)Scheduler
RLS
SECE
CMS software
CE
CMS software
CE
CMS software
CE
SE
SE
SE
WN
SECE
CMS software
Job
metadata
Dataset
metadata
Push data or info
Pull info
Grid InformationSystem (MDS)
3-Dec-2003ACAT'03 - KEK 33Claudio Grandi INFN Bologna
CMS/LCG-0 performanceCMS-LCG Regional Center
based on CMS/LCG-0
0.5 Mevts “heavy” pythia: ~2000 jobs ~8hours each, ~10 KSI2000 months
1.5 Mevts cmsim: ~6000 jobs ~10hours each, ~55 KSI2000 months
~2.5 TB dataInefficiency estimation:
– 5% to 10% due to sites’ misconfiguration and local failures
– 0% to 20% due to RLS unavailability
– few errors in execution of job wrapper
– Overall inefficiency: 5% to 30%
Pythia + CMSIM
Now used as a play-ground for CMS grid-tools development
Claudio Grandi INFN Bologna ACAT'03 - KEK 3-Dec-2003
Data Challenge 2004(DC04)
3-Dec-2003ACAT'03 - KEK 35Claudio Grandi INFN Bologna
2004 Data Challenge• Test the CMS computing system at a rate which
corresponds to the 5% of the full LHC luminosity– corresponds to the 25% of the LHC startup luminosity– for one month (February or March 2004)– 25 Hz data taking rate at a luminosity of 0.2 x 1034 cm-2s-1
– 50 million events (completely simulated up to digis during PCP03) used as input
• Main tasks– Reconstruction at Tier-0 (CERN) at 25 Hz (~40 MB/s)– Distribution of DST to Tier-1 centers (~5 sites)– Re-calibration at selected Tier-1 centers– Physics-groups analysis at the Tier-1 centers– User analysis from the Tier-2 centers
3-Dec-2003ACAT'03 - KEK 36Claudio Grandi INFN Bologna
DC04 Analysis challenge
DC04 Calibration challenge
T0
T1T2
T2
T1
T2
T2
Fake DAQ(CERN)
DC04 T0challenge
SUSYBackground
DST
HLTFilter ?
CERN disk pool~40 TByte(~20 days
data)
TAG/AOD(replica)
TAG/AOD(replica)
TAG/AOD(20
kB/evt)
ReplicaConditions
DB
ReplicaConditions
DB
HiggsDST
Eventstreams
Calibrationsample
CalibrationJobs
MASTERConditions DB
1st passRecon-
struction
25Hz1.5MB/evt40MByte/s3.2 TB/day
Archivestorage
CERNTape
archive
Disk cache
25Hz1MB/evt
raw
25Hz0.5MB recoDST
Higgs backgroundStudy (requests
New events)
Eventserver
50M events75 Tbyte
1TByte/day2 months
PCP
CERNTape
archive
3-Dec-2003ACAT'03 - KEK 37Claudio Grandi INFN Bologna
Tier-0 challenge• Data serving pool to serve digitized events at 25Hz to the
computing farm with 20/24 hour operation. – 40 MB/s– Adequate buffer space (at least 1/4 of the digi sample in the disk buffer). – Pre-staging software. File locking while in use, buffer cleaning and
restocking as files have been processed
• Computing Farm: approximately 400 CPU’s– jobs running 20/24 hours. 500 events/job, 3 hour/job– Files in buffer locked till successful job completion– No dead-time can be introduced to the DAQ. Latencies must be no more
than of order 6-8 hours
• CERN MSS: ~50 MB/s archiving rate– archive ~ 1.5 MB * 25 Hz raw data (digis)– archive ~0.5 MB * 25 Hz reconstructed events (DST)
• File catalog: POOL/RLS– Secure and complete catalog of all data input/products– Accessible and/or replicable to the other computing centers
3-Dec-2003ACAT'03 - KEK 38Claudio Grandi INFN Bologna
Data distribution challenge• Replication of the DST and part of raw data at one or more
Tier-1 centers– possibly using the LCG replication tools– foreseen some event duplication– At CERN ~3 GB/s traffic without inefficiencies (about 1/5 at Tier-1)
• Tier-0 catalog accessible by all sites• Replication of calibration samples (DST/raw) to selected Tier-1• Transparent access of jobs at the Tier-1 sites to the local data
whether in MSS or on disk buffer• Replication of any Physics-Groups (PG) data produced at the
Tier-1 sites to the other Tier-1 sites and interested Tier-2 sites• Monitoring of Data Transfer activites
– e.g. with MonaLisa
3-Dec-2003ACAT'03 - KEK 39Claudio Grandi INFN Bologna
Calibration challenge• Selected sites will run calibration procedures• Rapid distribution of the calibration samples (within hours at
most) to the Tier-1 site and automatically scheduled jobs to process the data as it arrives.
• Publication of the results in an appropriate form that can be returned to the Tier-0 for incorporation in the calibration “database”
• Ability to switch calibration “database” at the Tier-0 on the fly and to be able to track from the meta-data which calibration table has been used.
3-Dec-2003ACAT'03 - KEK 40Claudio Grandi INFN Bologna
Tier-1 analysis challenge• All data distributed from Tier-0 safely inserted to local storage• Management and publication of a local catalog indicating status
of locally resident data– define tools and procedures to synchronize a variety of catalogs with the
CERN RLS catalog (EDG-RLS, Globus-RLS, SRB-Mcat, …)– Tier-1 catalog accessible to at least the “associated” Tier-2 centers
• Operation of the physics-group (PG) productions on the imported data– “production-like” activity
• Local computing facilities made available to Tier-2 users– Possibly via the LCG job submission system
• Export of the PG-data to requesting sites (Tier-0, -1 or -2)• Registration of the data produced locally to the Tier-0 catalog to
make them available to at least selected sites – possibly via the LCG replication tools
3-Dec-2003ACAT'03 - KEK 41Claudio Grandi INFN Bologna
Tier-2 analysis challenge• Point of access to computing resources of the physicists• Pulling of data from peered Tier-1 sites as defined by the local
Tier-2 activities• Analysis on the local PG-data produces plots and/or summary
tables• Analysis on distributed PG-data or DST available at least at the
reference Tier-1 and “associated” Tier-2 centers. – Results are made available to selected remote users possibly via the
LCG data replication tools.
• Private analysis on distributed PG-data or DST is outside DC04 scope but will be kept as a low-priority milestone– use of a Resource Broker and Replica Location Service to gain access
to appropriate resources without knowing where the input data are– distribution of user-code to the executing machines– user-friendly interface to prepare, submit and monitor jobs and to
retrieve results
3-Dec-2003ACAT'03 - KEK 42Claudio Grandi INFN Bologna
Summary of DC04 scaleTier-0
– Reconstruction and DST production at CERN• 75 TB Input Data• 180 KSI2K months = 400 CPU @24 hour operation (@500 SI2K/CPU) • 25TB Output data• 1-2 TB/Day Data Distribution from CERN to sum of T1 centers
Tier-1– Assume all (except CERN) “CMS” Tier-1’s participate
• CNAF, FNAL, Lyon, Karlsruhe, RAL– Share the T0 output DST between them (~5-10TB each)
• 200 GB/day transfer from CERN (per T1)– Perform scheduled analysis group “production”
• ~100 KSI2K months total = ~50 CPU per T1 (24 hrs/30 days)
Tier-2– Assume about 5-8 T2
• may be more…• Store some of PG-data at each T2 (500GB-1TB)• Estimate 20 CPU at each center for 1 month
3-Dec-2003ACAT'03 - KEK 43Claudio Grandi INFN Bologna
Summary• Computing is a CMS-wide activity
– 18 regional centers, ~ 50 sites
• Committed to support other CMS activities– support analysis for DAQ, Trigger and Physics studies
• Increasing in size and complexity– 1 TB in 1 month at 1 site in 1999– 170 TB in 6 months at 50 sites today– Ready for full LHC size in 2007
• Exploiting new technologies– Grid paradigm adopted by CMS– Close collaboration with LCG and EU and US grid projects– Grid tools assuming more and more importance in CMS