sam-grid status core sam development sam-grid architecture progress future work

15
SAM-Grid Status http://d0db.fnal.gov/sa Core SAM development SAM-Grid architecture Progress Future work

Upload: rebecca-hopkins

Post on 28-Mar-2015

244 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

SAM-Grid Status

http://d0db.fnal.gov/sam

Core SAM development

SAM-Grid architecture

Progress

Future work

Page 2: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Core SAM development

http://d0db.fnal.gov/sam

SAM is a production system300 active users

60,000 file replicas

5,000 files/day cache turnover (1TB)

A fine-tuning example: The Friday afternoon opportunity

Many users submit several projects for the w/e

Station has project limit, 0(Ncpu’s)

Queue projects, but then how to keep required data in cache

Parallelisation & re-education: N processes per project, not N projects with 1 process each

Page 3: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Physicists always cheat…SAM helps

Page 4: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Multi-Process projects

Project Manager

Together, processes see each file once.

Process is simple:

Asks:

“Give me a file”

Responds to:

“Here`s the path”

“Hang on”

“None left”

Processes

Page 5: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

SAM

Grid RC

Condor MMS

Condor-G

GRAM

Grid sensors

Job

Def

init

ion

and

Man

agem

ent

Monitoring and Inform

ationD

ata Handling

Request Broker

Compute ElementResource

Logging andBookkeeping

Job Scheduler

Info Processorand Converter

Replica Catalog

DH Resource Management

Data Delivery and Caching

Resource Info

Job Client Job Status Updates

Principal Component Service Implementationor Library

Information

GSI

Batch System

Site Gatekeeper

AAA

MDS-2Condor Class

Ads

SAM-Grid Architecture

• Job Definition and ManagementBased on the Match Making Service of Condor® through collaboration with University of Wisconsin CS Group

• Monitoring and Information ServicesProvides a view of the status and history of the system, as well as the information relevant for job and data management

• Data HandlingThe existing SAM system, developed at Fermilab to accommodate high volume data management, plays a principal role in providing Data Handling services to the Job Management infrastructure

Page 6: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Job

Def

init

ion

and

Man

agem

ent

Condor MMS

Condor-G

GRAM

Grid sensors

Request Broker

Compute ElementResource

Job Scheduler

Job Client Job Status Updates

Batch System

Site Gatekeeper

Job Management• Globus GRAM for inter-operability

• CondorG for remote submission

• Condor MMS for resource brokerage

• Condor is Resource Broker

Collaboration with Condor group

• Condor members at weekly SAM-Grid meetings

• CVS branch of v6_3_2 with our requested functionality

• Ability to choose globus-scheduler

• External function calls allowed in MMS – can query SAM Db

Page 7: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Grid RC

Monitoring and

Information

Logging andBookkeeping

Info Processorand Converter

Replica Catalog

Resource Info GSI AAA

MDS-2Condor Class

Ads

Monitoring & Information

Package of information providers to interrogate:

•SAM Station: project progress, disk caches

•Replica Catalogue: file location, size

•Batch Systems: free cpu’s

•Resources: os, code releases present, memory, disk space,…

Page 8: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Monitoring & Information

Page 9: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Monitoring & Information

Page 10: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Monitoring & Information

Page 11: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Monitoring & Information

Page 12: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

SAM

Data H

andling

DH Resource Management

Data Delivery and Caching

Data Handling

•Existing SAM system•Added gridftp as a transfer protocol (also kerb-rcp,bbftp available)

•Use server certificates issued by FNAL Kerberized CA

•Delegation of user proxy not (yet) done (accounting, security)

•Server runs as unprivileged user

•Report bug, receive patch. Apply. Re-build on Linux and Ultrix, repackage,… i.e. very poor support.

•Globus bundles packaged as upd products

•During testing - re-discovered globus-url-copy bug STILL in downloadable globus release! Repeat above procedure? No, take EDG special globus-url-copy binary.

Page 13: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Future Work

•nth order brokering•0th order: Submit to site where most data replicated. Trivial with condor additions.

•1st order: Sense grid connectivity using WP7 tools as plugin to condor

•Inter-site parallelisation: Split datasets, move jobs to data

•Dynamic station installation•To use non-dedicated resources and clean-up afterwards

•upd has almost no dependencies on native packages

•Auto-tailoring forced by CDF makes this possible

•Further MC production/SAM integration.

Page 14: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

0th order brokeringFile Count:  99Average File Size:  674153Total File Size:  66741199Total Event Count:  214914

4 known domains and 3 stations

At wuppertal :-    4719Mb(  7%) from                  fnal.gov at 0.5Mb/s.   48539Mb( 73%) from                  ic.ac.uk at 2.0Mb/s.   13483Mb( 20%) from                      pnfs at 0.5Mb/s.     Transfer time =18.0hrs. Plus 2 tape mounts.

At imperial-test :-    4719Mb(  7%) from                  fnal.gov at 0.5Mb/s.   48539Mb( 73%) from                  ic.ac.uk at 10.0Mb/s.   13483Mb( 20%) from                      pnfs at 0.5Mb/s.     Transfer time =11.8hrs. Plus 2 tape mounts.

At central-analysis :-   51909Mb( 78%) from                      pnfs at 10.0Mb/s.   14831Mb( 22%) from                  fnal.gov at 100.0Mb/s.     Transfer time =1.5hrs. Plus 2 tape mounts. …but no free cpu!

enstore tape

Page 15: SAM-Grid Status  Core SAM development SAM-Grid architecture Progress Future work

Conclusions

• SAM production system•Heavy and increasing D0 use. Fine tuning.

•CDF deployment – no show stoppers

•SAM-Grid taking shape• Monitoring & Information prototype available

• GridFTP pre-deployment tests. System failed me.

• Remote job submission works. CondorG enhancements allow site matching in MMS by query of SAM replica catalogue.

•Outreach-SAM offers unique, working example of a PP grid

• already some interest in PP data access patterns.

• expect more interest in real data handling & optimisation.

Wise learn from other peoples mistakes