sam-grid status core sam development sam-grid architecture progress future work
TRANSCRIPT
SAM-Grid Status
http://d0db.fnal.gov/sam
Core SAM development
SAM-Grid architecture
Progress
Future work
Core SAM development
http://d0db.fnal.gov/sam
SAM is a production system300 active users
60,000 file replicas
5,000 files/day cache turnover (1TB)
A fine-tuning example: The Friday afternoon opportunity
Many users submit several projects for the w/e
Station has project limit, 0(Ncpu’s)
Queue projects, but then how to keep required data in cache
Parallelisation & re-education: N processes per project, not N projects with 1 process each
Physicists always cheat…SAM helps
Multi-Process projects
Project Manager
Together, processes see each file once.
Process is simple:
Asks:
“Give me a file”
Responds to:
“Here`s the path”
“Hang on”
“None left”
Processes
SAM
Grid RC
Condor MMS
Condor-G
GRAM
Grid sensors
Job
Def
init
ion
and
Man
agem
ent
Monitoring and Inform
ationD
ata Handling
Request Broker
Compute ElementResource
Logging andBookkeeping
Job Scheduler
Info Processorand Converter
Replica Catalog
DH Resource Management
Data Delivery and Caching
Resource Info
Job Client Job Status Updates
Principal Component Service Implementationor Library
Information
GSI
Batch System
Site Gatekeeper
AAA
MDS-2Condor Class
Ads
SAM-Grid Architecture
• Job Definition and ManagementBased on the Match Making Service of Condor® through collaboration with University of Wisconsin CS Group
• Monitoring and Information ServicesProvides a view of the status and history of the system, as well as the information relevant for job and data management
• Data HandlingThe existing SAM system, developed at Fermilab to accommodate high volume data management, plays a principal role in providing Data Handling services to the Job Management infrastructure
Job
Def
init
ion
and
Man
agem
ent
Condor MMS
Condor-G
GRAM
Grid sensors
Request Broker
Compute ElementResource
Job Scheduler
Job Client Job Status Updates
Batch System
Site Gatekeeper
Job Management• Globus GRAM for inter-operability
• CondorG for remote submission
• Condor MMS for resource brokerage
• Condor is Resource Broker
Collaboration with Condor group
• Condor members at weekly SAM-Grid meetings
• CVS branch of v6_3_2 with our requested functionality
• Ability to choose globus-scheduler
• External function calls allowed in MMS – can query SAM Db
Grid RC
Monitoring and
Information
Logging andBookkeeping
Info Processorand Converter
Replica Catalog
Resource Info GSI AAA
MDS-2Condor Class
Ads
Monitoring & Information
Package of information providers to interrogate:
•SAM Station: project progress, disk caches
•Replica Catalogue: file location, size
•Batch Systems: free cpu’s
•Resources: os, code releases present, memory, disk space,…
Monitoring & Information
Monitoring & Information
Monitoring & Information
Monitoring & Information
SAM
Data H
andling
DH Resource Management
Data Delivery and Caching
Data Handling
•Existing SAM system•Added gridftp as a transfer protocol (also kerb-rcp,bbftp available)
•Use server certificates issued by FNAL Kerberized CA
•Delegation of user proxy not (yet) done (accounting, security)
•Server runs as unprivileged user
•Report bug, receive patch. Apply. Re-build on Linux and Ultrix, repackage,… i.e. very poor support.
•Globus bundles packaged as upd products
•During testing - re-discovered globus-url-copy bug STILL in downloadable globus release! Repeat above procedure? No, take EDG special globus-url-copy binary.
Future Work
•nth order brokering•0th order: Submit to site where most data replicated. Trivial with condor additions.
•1st order: Sense grid connectivity using WP7 tools as plugin to condor
•Inter-site parallelisation: Split datasets, move jobs to data
•Dynamic station installation•To use non-dedicated resources and clean-up afterwards
•upd has almost no dependencies on native packages
•Auto-tailoring forced by CDF makes this possible
•Further MC production/SAM integration.
0th order brokeringFile Count: 99Average File Size: 674153Total File Size: 66741199Total Event Count: 214914
4 known domains and 3 stations
At wuppertal :- 4719Mb( 7%) from fnal.gov at 0.5Mb/s. 48539Mb( 73%) from ic.ac.uk at 2.0Mb/s. 13483Mb( 20%) from pnfs at 0.5Mb/s. Transfer time =18.0hrs. Plus 2 tape mounts.
At imperial-test :- 4719Mb( 7%) from fnal.gov at 0.5Mb/s. 48539Mb( 73%) from ic.ac.uk at 10.0Mb/s. 13483Mb( 20%) from pnfs at 0.5Mb/s. Transfer time =11.8hrs. Plus 2 tape mounts.
At central-analysis :- 51909Mb( 78%) from pnfs at 10.0Mb/s. 14831Mb( 22%) from fnal.gov at 100.0Mb/s. Transfer time =1.5hrs. Plus 2 tape mounts. …but no free cpu!
enstore tape
Conclusions
• SAM production system•Heavy and increasing D0 use. Fine tuning.
•CDF deployment – no show stoppers
•SAM-Grid taking shape• Monitoring & Information prototype available
• GridFTP pre-deployment tests. System failed me.
• Remote job submission works. CondorG enhancements allow site matching in MMS by query of SAM replica catalogue.
•Outreach-SAM offers unique, working example of a PP grid
• already some interest in PP data access patterns.
• expect more interest in real data handling & optimisation.
Wise learn from other peoples mistakes