instrumentation of the sam-grid gabriele garzoglio csc 426 research proposal

22
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal

Upload: edmund-williamson

Post on 02-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Instrumentation of the SAM-Grid

Gabriele Garzoglio

CSC 426

Research Proposal

Overview

Characteristics of the High Energy Physics Community

• The SAM-Grid: enabling fully distributed analysis job processing

• The Proposed Instrumentation

Characteristics of the work in High Energy Physics

• High Energy Physics studies the fundamental interaction of Nature.

• Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed.

• Experiments become every decade more challenging/expensive: the collaborations are large groups of people.

• The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed

The Fermi National Accelerator Laboratory

The Nature of the Data

An example: the D0 Experiment

• Detector Data– 1,000,000 Channels

– Event size 250KB

– Event rate ~50 Hz

– On-line Data Rate 12 MBps

– Est. 2 year totals (incl Processing and analysis):

• 1 x 109 events

• ~0.5 PB

• Monte Carlo Data (simulations)– 5 remote processing centers

– Estimate ~300 TB in 2 years.

The D0 Collaboration

• ~500 Physicists

• 72 institutions

• 18 Countries

How can all of them work together ?

Using Large Distributed System Middleware:

the Grid

Overview

Characteristics of the High Energy Physics Community

The SAM-Grid: enabling fully distributed analysis job processing

• The Proposed Instrumentation

The SAM-Grid Project

• Mission: enable fully distributed computing for DZero and CDF

• Strategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM)

• Funds: the Particle Physics Data Grid (US) and GridPP (UK)

• People: Computer scientists and Physicists from Fermilab and the collaborating Universities

• History: SAM from 1997, JIM from end of 2001

• Schedule: CDF and DZero are running now! A prototype is running, scheduled for production in Spring 03; long-term deliverables in 2 yrs.

The Logistics

JOB

Computing Element

Submission Client

User Interface

QueuingSystem

Job ManagementUser Interface

User Interface

ResourceSelector

Match Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System

Data Handling System

Storage Element

Storage Element

Storage Element

Storage Element

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

Data Handling System

Data Handling System

Data Handling System

Data Handling System

Overview

Characteristics of the High Energy Physics Community

The SAM-Grid: enabling fully distributed analysis job processing

The Proposed Instrumentation

Why is this useful ?

The SAM-Grid is a complex system: theinstrumentation is of critical importance to • Troubleshoot the system

– Production systems are maintained 24x7– Ease user support– Find anomalies/bugs

• Gather statistics– User data access patterns– Resource utilization– Global parameter optimization

Why is this challenging ?

• The SAM-Grid is composed of hundreds of servers, widely geographically distributed: what is a suitable architecture ?

• Servers have very diverse functionalities: is it possible to enable some form of uniform data access ?

Current instrumentation….

• The SAM System uses a global log service: every SAM Server records free-format events/messages

• JIM V1 is under intense development: the current instrumentation is insufficient

…and its limitations

• The current log server is centralized: for the SAM system only it records 1 GB every few days. This does not scale.

• Message transport is UDP-based: this scales in the number of reporting servers, but data integrity is not guaranteed.

• The messages are not structured: data mining / presentation is non-trivial.

The direction 1

• The CODA distributed File System is a good example of successful distributed architecture for instrumentation.

Client

Server

DataCollector

DataLog

Reaper

DatabaseOff-LineAnalyses

The direction 2

• The structure of the message should include:

• the name of the client/server• the types of the client/server: various groupings may be meaningful i.e. logistical, functional, logical, etc.• the location of the client/server• a global time stamp• an id code, related to the severity of the message

Rough time estimate

• 1 FTE month to design the architecture + the message structure

• 1 FTE month to implement basic messaging

• 1 FTE month to study initial results

• 1 FTE month to feedback changes to the message structure and implementation