distributed monte carlo production forabbott/doe_review_2011/snow_doerev2010.pdf · dzero 's...
TRANSCRIPT
Distributed Monte CarloProduction for
Joel SnowLangston University
DOE Review March 2011
DOE Review March 2011 Joel Snow Langston University 2
Outline● Introduction● FNAL SAM● SAMGrid● Interoperability with OSG and LCG ● Production System● Production Results● LUHEP Computing● Summary
DOE Review March 2011 Joel Snow Langston University 3
Introduction● Covers my tenure as MC production coordinator● Simulation data (MC) crucial to physics analysis● Tevatron luminosity and hence raw data volume
is at record levels● Challenge for analysts and production● Personnel & computing resources migrating to
LHC experiments● DZero strategy
– Increase automation
– Leverage resources and support
DOE Review March 2011 Joel Snow Langston University 4
Evolution ● Mature experiment, but nimble
– history of adopting innovative technologies● distributed data handling - SAM● early adopter of the grid for production - SAMGrid
– significant investment in these technologies
● Grid technology allows opportunistic usage– DZero can mix “traditional” dedicated and
opportunistic resources
● Grid interoperability– Leverages resources and support, reduces
personnel needs per CPU hour
DOE Review March 2011 Joel Snow Langston University 5
Sequential data Access via Metadata
Sequential data Access via Metadata
● Fermilab system first used by DZero
● SAM distributed data handling system predates grid
● Set of servers working together to store and retrieve files and metadata
● Permanent storage and local disk caches
● Database tracks location, metadata of files, job processing history
● Delivers files to jobs (using GridFTP over WAN), provides job submission capabilities
DOE Review March 2011 Joel Snow Langston University 6
SAMGrid● Fermilab developed grid first used by DZero for global MC
production in 2004
● SAMGrid = SAM + Job and Information Management (JIM) components
● Provides the user with transparent remote job submission, data processing and status monitoring.
● VDT based (Globus + Condor)
● Logically consists of
– Multiple execution sites
– Resource selector
– Multiple Job Submission (Scheduler) sites
– Multiple Clients (User Interface) to Submission site.
DOE Review March 2011 Joel Snow Langston University 7
SAMGrid Interoperability● As Open Science Grid (OSG) and LHC
Computing Grid (LCG) became operational it was desirable to leverage these resources for DZero
● FNAL and DZero developed and deployed SAMGrid interoperability with both LCG and OSG resources
● Execution site acts as a Forwarding node – packages SAMGrid jobs for OSG/LCG job
submission via Condor-G
DOE Review March 2011 Joel Snow Langston University 8
Consolidation, Automation, Exploitation
● SAMGrid sites require operational manpower and expert support
● People power and FNAL support migrating to LHC experiments
● Increase automation - Automc● Reduce number of SAMGrid sites, increase
use of OSG and LCG – comes with support ✔
– provides opportunistic job slots ✔
DOE Review March 2011 Joel Snow Langston University 9
Production SystemMC production gets work
from the SAM Request System
Physics groups' MC requests are parametrized and prioritized as a Python object
DOE Review March 2011 Joel Snow Langston University 10
Automatic Monte Carlo Request Processing
● Developed Automc System – in use at FNAL– Handles official DZero MC production at all but 2 sites
● From approved request to final data storage ● Easy to use – minimizes manpower needs● Site independent
– deploy for any grid site (SAMGrid, OSG, LCG)
– capable of managing many sites
● Handle recovery of common failures● Integrates with existing MC request priority protocol
DOE Review March 2011 Joel Snow Langston University 11
AutoMC MonitoringRunning at FNAL & managing production at 39 sites
http://www-d0.fnal.gov/computing/mcprod/dajd/dajd_status.html
DOE Review March 2011 Joel Snow Langston University 12
Production System Resources
● MC production uses a variety of dedicated and opportunistic resources on 4 continents– Non-gridNon-grid site at ccin2p3 Lyon (FR) – very productive,
flexible
– Native SamgridNative Samgrid sites: FZU (CZ), GridKa (DE), LUHEP (US), USTC (CN)
– LCG resourcesLCG resources: CE's, SE's, and Samgrid-LCG infrastructure in FR, UK, NL
– OSG resources:OSG resources: CE's, SE's, and Samgrid-OSG infrastructure in US
DOE Review March 2011 Joel Snow Langston University 13
MC Production ResultsLooking back at the last 30 days
Averaging 5.8M events per day and totaling 172.8M events in 30 days
DOE Review March 2011 Joel Snow Langston University 14
MC Production ResultsLooking back at the last year
Averaging 49M events per week and totaling 2.6B events in a year
cumulative since September 2005.(2010/02/14 - 2011/02/14)
DOE Review March 2011 Joel Snow Langston University 15
MC Production ResultsLooking back at the last year by production segment
52 week averages per week (2010/02/14 - 2011/02/14)
Non-grid: 19.8M, OSG: 11.4M, Samgrid: 12.6M, LCG: 4.9M
DOE Review March 2011 Joel Snow Langston University 16
MC Production ResultsLooking back at the last year by production segment
Cumulative since September, 2005
52 week totals (2010/02/14 - 2011/02/14)Non-grid: 1041M, OSG: 596M, Samgrid: 658M, LCG: 257M
40.8% 23.3% 25.8% 10.1%
Production Last Year By Segment
Nongrid OSG Samgrid LCG
DOE Review March 2011 Joel Snow Langston University 17
MC Production Geographic Distribution
Events Last Year:
Europe 1925M
N. America 574M
Asia 29M
S. America 24M 75.4%
0.9%
22.5%
1.1%
Europe S. AmericaN. America Asia
(2010/02/14 - 2011/02/14)
DOE Review March 2011 Joel Snow Langston University 18
MC Production ResultsLooking back at the last 5.5 years
Averaging 19.2M events per week and totaling 2.82B events
cumulative since September 2005.(2005/09/05 - 2011/02/14)
DOE Review March 2011 Joel Snow Langston University 19
MC Production ResultsLooking back at the last 5.5 years by production segment
5.5 year averages per week (2005/09/05 - 2011/02/14)
Non-grid: 8.0M, OSG: 4.8M, Samgrid: 5.3M, LCG: 1.1M
DOE Review March 2011 Joel Snow Langston University 20
MC Production ResultsLooking back at the last 5.5 years by production segment
Cumulative since September, 2005
5.5 year totals (2005/09/05 - 2011/02/14)Non-grid: 2.26B, OSG: 1.37B, Samgrid: 1.51B, LCG: 306M
41.5% 25.2% 27.7% 5.6%
Production Last Year By Segment
Nongrid OSG Samgrid LCG
DOE Review March 2011 Joel Snow Langston University 21
Production Results Last 7 YearsDZero MC Production in Millions of Events per year ending 12/26
Year Total Non-Grid OSG LCG2010 2388.5 1011.2 614.8 539.2 223.32009 1122.6 540.3 217.9 364.2 0.32008 794.8 315.6 213.6 259.7 5.82007 398.2 109.1 158.1 96.5 34.42006 348.0 144.4 195.5 0.5 7.62005 98.1 68.6 29.5 0.0 0.02004 42.4 41.8 0.6 0.0 0.0
SAMGrid
2004 2005 2006 2007 2008 2009 2010
0
500
1000
1500
2000
2500
3000
DZero MC Production in Millions of Events
LCGOSGSAMGridNon-Grid
DOE Review March 2011 Joel Snow Langston University 22
Production Results Last 7 YearsDZero MC Production in Terabytes of Data per year ending 12/26
Year Total Non-Grid OSG LCG2010 221.0 83.3 61.8 53.7 22.32009 95.3 42.7 19.8 32.8 0.02008 67.8 26.9 18.4 22.0 0.52007 31.6 7.3 13.2 8.2 2.92006 23.0 9.4 13.1 0.0 0.52005 6.0 4.1 1.9 0.0 0.02004 1.9 1.9 0.0 0.0 0.0
SAMGrid
2004 2005 2006 2007 2008 2009 2010
0
50
100
150
200
250
DZero MC Production in Terabytes of Data
LCGOSGSAMGridNon-Grid
DOE Review March 2011 Joel Snow Langston University 23
OU DZero MC Production2005/09/05 - 2011/02/14
OUHEP produced 306 M events and 28.4 TB data
Last year OUHEP produced 139 M events and 14.0 TB data
Cumulative since Sept. 20052010/02/14 – 2011/02/14
DOE Review March 2011 Joel Snow Langston University 24
LU DZero MC Production2005/09/05 - 2011/02/14
LUHEP produced 15.5 M events and 1.36 TB data
Last year LUHEP produced 4.6 M events and 450 GB data
Cumulative since Sept. 20052010/02/14 – 2011/02/14
DOE Review March 2011 Joel Snow Langston University 25
LUHEP Computing
● 2 grid enabled clusters both producing DØ MC
● Old Samgrid cluster- 12 job slots
● New OSG cluster - 12 job slots with small associated SE used as DØ cache
DOE Review March 2011 Joel Snow Langston University 26
Condor Q's at LUHEP
SAMGrid OSGLast Year
DOE Review March 2011 Joel Snow Langston University 27
Summary● DZero 's early deployment of grid technology
and automation has dramatically increased MC production– First deployment SAM distributed data handling
system
– Early SAMGrid deployment
– Use of OSG and LCG resources through interoperability with SAMGrid
– First opportunistic usage of OSG Storage Elements
– Automated MC production system● Anticipate adequate MC through the last analysis