the sam-grid project gabriele garzoglio ods, computing division, fermilab ppdg, doe scidac acat...

35
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Sam-Grid project

Gabriele GarzoglioODS, Computing Division, Fermilab

PPDG, DOE SciDACACAT 2002, Moscow, Russia

June 26, 2002

Page 2: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Outline

The SAM-Grid Project• The SAM & JIM Architecture

– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System

• The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions

Page 3: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The scope of the project• Enable fully distributed computing for DZero and CDF, by enhancing

the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing, in a secure and accountable environment.

• The SAM ‘grid-ification’ is funded by PPDG and GridPP: we are working with both Computer scientists, like the Condor Team, and physicists, like UTA and Imperial College.

• We are collaborating with other groups working on Grid technologies as well (EDG, DataTAG among them).

• Warm cooperation between Fermilab CD Departments and the Project (e.g. ISD for the SAM/DCache integration)

• We promote interoperability and code reuse (via modularization and standardization).

• CDF and DZero are running now! Short-term deliverables are are due at the end of the Summer; long-term in 2 yrs.

Page 4: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Why a Job and Data Handling infrastructure?

• Increases the productivity of physics results • A high level of transparency to the user: maximize time

spent by the physicist doing physics• Enable worldwide analysis of the data• Efficient utilization of the resources: disks, mass

storage systems, processing nodes, network…• Automatic bookkeeping: reproducibility +

accountability• Extensibility to new standardized services and protocols

via modularization and “plug-in” mechanisms

Page 5: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Outline

• The SAM-Grid ProjectThe SAM & JIM Architecture

– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System

• The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions

Page 6: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

High Level Components

Information and

Monitoring

Data Handling

Job Management

Page 7: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Data Handling: SAM

Data Handling

DH Resource Management

Data Delivery and Caching

SAM

PrincipalComponent Service

ImplementationOr Library

Information

Information and

Monitoring

Job Management

Page 8: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

History

• SAM is Sequential data Access via Meta-data

• Joint project between D0 and Computing Division Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling started in 1997 to meet the Run II data handling needsneeds

• SAM is integrated into DZero at all levels.

• SAM is in commissioning phase for CDF

• http://d0db.fnal.gov/sam

• http://runIIcomputing.fnal.gov

Page 9: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

SAM as a Distributed SystemDatabaseServer(s)

(Central Database)

Station 1Servers

Station 2Servers

Station 3 Servers

Station nServers

Mass Storage System(s)

SharedGlobally

LocalTo Site

SharedLocally

Arrows indicateControl and data flow

NameServer

Global Resource

Manager(s)Log server

services

A Station is a collection of resourcescontrolled by the SAM system. SAM services can be accessed to monitor the status of the systemThe central Database Server has proven to be robust and reliable.

Page 10: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Components of a SAM Station

• SAM is a distributed data movement and management service: data replication is achieved by the use of disk caches during file routing.

• SAM is a fully functional meta-data catalog.

Station &Cache

Manager

File Storage Server

File Stager(s)

Project Managers

/Consumers

eworkers

FileStorageClients

MSS orOther

Station

MSS orOther

Station

Data flowControl

Producers/

Cache DiskTemp Disk

… …

Page 11: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Accessibility of the Fabric via SAM Services

MSS1

LocalStation 1Cache1

LocalStation 1Cache2

LocalStation 2Cache1

RemoteStationCache1

• A station can access a remote resource via the services offered by other connected stations

• Service connectivity does not in general correspond to network connectivity

• Requests are routed from the originator to the destination

• File caching during routing leads to file replication

More in Igor Terekhov’s Talk:“Meta-Computing at DØ”

MSS2

RemoteStationCache2

Page 12: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Current Developments of SAM

• Site Autonomy: the goal is enabling site installations of SAM and JIM to work even if disconnected from the network. The distribution of the Replica and Meta-data Catalogs is a prerequisite for this to happen.

• Opportunistic deployment: in order to enable SAM and JIM to operate in full efficiency in a dynamic environment like the Grid, automatic deployment of stations at resources that are momentarily available is an interesting path to investigate.

Page 13: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Job Management

Data Handling

DH Resource Management

Data Delivery and Caching

SAM

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

PrincipalComponent Service

ImplementationOr Library

Information

Information and

Monitoring

Page 14: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Job Description Language• User interface: the Job

Description Language must be expressive enough to fully characterize the structure of the job (Monte Carlo and Analysis)

• We are collaborating with the University of Texas Arlington to define the structure of a DZero (CDF) job.

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 15: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Request Broker• The Brokering Service is

implemented using the Condor Match Making Service

• The idea is to use a stable technology in a new way

• Because of the collaboration with the Condor Team under PPDG, 2 features have been added to make this possible :– Runtime selection of the

remote execution site– Execution of external code

when negotiating the matches

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 16: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Job Submission Service• The job submission

service relies on standard Condor technologies

• It implements a high level of robustness to service failures and loss of connectivity

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 17: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Job Submission Mechanism (I)• Physical job dispatch is

achieved via the GRAM protocol from the Globus Toolkit

• When applicable, executables, configuration files, stdio and stderr are transported via GASS servers

• Gatekeepers deployed at each site serve client requests for job submission

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 18: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Job Submission Mechanism (II)• A Gatekeeper

authenticates and authorizes the client via the Globus Security Infrastructure

• After AA, the Gatekeeper spawns a Job Manager that submits the job to the local batch system, reports the status to the submission client (Condor-G), cleans up after job termination.

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 19: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Fabric (I)• Among the Batch

systems currently supported by the Gatekeeper are LSF, PBS, Condor, FBS

• In our architecture Grid Sensors are deployed at the compute elements as well as the local submission nodes.

• The Sensors report static and small-size dynamic states to the Information and Monitoring System.

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 20: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Fabric (II)• What attributes best

describe resources is still a research topic. The choice of such schema as implication on the semantics of the JDL.

• We are collaborating with DataTAG and EDG to find a common Glue Schema in order to enable interoperability of EU and US Grids.

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

Page 21: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Information FlowUser Interfac

e

User Interfac

e

Condor-G

InformationAnd

Monitoring

Gatekeeper

Batch Syestem

Grid Sensors

Compute Resource

GRAM

CondorNegotiator

CondorCollector

CondorGrid Manager

External Code

Execution Site

ParserParserJDLClassAd

ClassAd

CinCout

User Interfac

eParser

CondorScheddCondorSchedd

CondorSchedd

CondorCollector

CondorCollector

Grid Sensors

Grid Sensors

CondorNegotiator

CondorNegotiator

External Code

External Code

CondorGrid Manager

CondorGrid Manager

GatekeeperGatekeeper

Batch Syestem

Batch Syestem

Compute Resource

Compute Resource

Page 22: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Monitoring and Information: the glue

Data Handling

DH Resource Management

Data Delivery and Caching

SAM

Job Management

RequestBroker

Compute ElementResource

SiteGatekeeper

Job Scheduler

JH Client

BatchSystem

Condor-G

Condor MMS

GRAM

Grid sensors

(All) Job Status

Updates

PrincipalComponent Service Implementation

Or Library

Information

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

Page 23: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Status Monitor

– Meta Directory Service from the Globus Toolkit (LDAP protocol)

– Condor Components (ClassAds)

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

DataHandling

Job Management

• Resource and Information Service implementations:

• MDS automatically discard old information and pull the new information from information providers.

• Well suited for the run-time monitoring of the system.

Page 24: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Logging and Bookkeeping

implemented via a plug-able back-end module.

• SAM servers already use the logger,

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

DataHandling

Job Management

• SAM provides a UDP-based message logger. Persistency is

which results in a valuable debugging tool. We are going to extend the use of this service to JIM.

• Messages will be store in XML format.

Page 25: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Replica Catalog

• We plan to migrate to the Grid Replica Catalog, in order to allow distribution of the service and a set of standardized interfaces to external services

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

DataHandling

Job Management

• The Replica Catalog is currently implemented with SAM

Page 26: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Information Conversion and Accessibility

when needed: LDAP, ClassAd, XML.• We are evaluating web portal frameworks to

enable access to the system from the internet

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

DataHandling

Job Management

• A translation service is responsible to convert the 3 protocols used

Page 27: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Site AAA

Information System are built on top of standard grid tools and adopt the GSI security mechanisms.

Monitoring and Information

Logging andBookkeeping

Info ProcessorAnd Converter

Replica Catalog

ResourceInfo

AAAGSI

MDS-2Condor

Class Ads Grid RC

DataHandling

Job Management

• The Job Management Infrastructure and the Monitoring and

• The full integration of the Data Handling System with GSI is work in progress…

• Open issue: the management of the AA map files

Page 28: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Outline

• The SAM-Grid Project• The SAM & JIM Architecture

– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System

The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions

Page 29: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Current Grid Infrastructure

Node_1GRA

MCondor-G

Node_3GRA

M

Fork

Node_2GRA

MPBS

Node_4GRA

M

Condor Condor

FNAL

IC

UTA

Node_1GRA

M

Condor

Condor-G

Node_1GRA

M

LSF

Condor-G

pcBS

client

Info

Page 30: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Outline

• The SAM-Grid Project• The SAM & JIM Architecture

– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System

• The Current Grid InfrastructureMilestones of the Deliverables• Conclusions

Page 31: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Organization: a Collaborative Effort

• We hold weekly meetings to coordinate efforts on the DZero/CDF SAM Grid Project.

• Participants are from UK institutions, NIKHEF, INFN and US institutions.

• We discuss deliverables, design, implementation.

• The real pressure comes from the experiments that are taking data now!

Page 32: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Short Term Project Goals

• Deployment of JIM to enable execution of unstructured Monte Carlo jobs with basic brokering (end of Summer)

• Status Monitoring of unstructured jobs (end of Summer)

• Basic System Monitoring (end of Summer)• Execution of unstructured SAM analysis

jobs with basic brokering (end of the year)

Page 33: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The 2yr-Term Project Goals

• Reliable Execution of structured, locally distributed Monte Carlo and SAM analysis jobs with basic brokering.

• Scheduling criteria for data-intensive jobs, full Job Handling – Data Handling interaction.

• Fully Distributed Monitoring and Information Services for Structured Jobs and Data Handling.

Page 34: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

The Milestones Dependencies

Job Def Doc

Execute unstructured MC andSAM analysis jobs with

basic brokering

Tech. Rev. doc.

Execute unstructured SAM analysis jobs

UC doc

Arch. Doc

Execute User-routed MC Jobs Prototype Grid with

RB, JSS, GMA-based MIS

Study JDLs Use Cases Condor GMA, MDSGSI SAM

GSI InSAM

CondorIn SAM

Basic SAM Res Info Service

Toy Grid with JSS, basic Monitoring

MDS TestBed

Status Mon-ing ofunstructured jobs

Basic System Mon-ing

CondorG TestBed

SAM Grid-ready

Reliable Execution ofstructured, locally

distributed MC and SAM analysis

jobs with basic brokering

Scheduling criteria fordata-intensive

jobs, JH-DH interactiondesign

Monitoring of structured jobs

DHMon-ing

JH, MIS fullydistributed

JDL

6 M

o9-

19 M

oN

ow

Page 35: The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002

Conclusions• SAM is the Data Handling System of the DZero experiment and in phase

of commissioning for CDF.• The SAM-Grid project has the goal of integrating SAM with standard grid

technologies to enable fully distributed computing for DZero and CDF.• The Brokering service of the Grid Architecture of the project is based on

the Condor Match Making Service.• We are funded by PPDG and GridPP and we collaborate with Grid groups

in US and EU to best tailor and develop the technologies for the experiments.

• We are deploying a test bed in US and EU to develop and test SAM and JIM.

• The experiments are running now! Closest delivery milestones at the end of the Summer and at the end of the year.

• http://www-d0.fnal.gov/computing/grid/