paul graham software architect, epcc p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 pcp –...

Post on 19-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Paul GrahamSoftware Architect, EPCCp.graham@epcc.ed.ac.uk

+44 131 650 4992

PCP – TheProbes

Coordination

ProtocolA secure, robust framework for

scheduling and coordinating regular

tasks across multiple sites

AHM 2008 2

Overview

• Background

• Motivation

• The Probes Coordination Protocol

• New implementation

• PCP implementation features

• Summary

AHM 2008 3

Background

• Work has spanned three projects– European Data Grid (EDG) 2001-2004– Enabling Grids for eScience (EGEE/EGEE-II)

2004-2008– Joint Information Systems Committee (JISC)

NPM 2008-2009

• Network performance measurements– The collection of monitoring data in a Grid

environment– Grid users want to know the expected

performance of their network-based application

– e2emonit, gridmon

AHM 2008 4

Motivation

• Issues for collecting monitoring data– Different measurement types

– End to end– Backbone

– Different tools– Different formats– Heterogeneous environments

– Grid!– Many administrative domains– Different user groups

AHM 2008 5

The problem - sites

• Deployment of monitoring tools is not so easy

– There has to be a clear benefit to the site before they install tools– This benefit is not obvious until after an incident has occurred, by

which time it is too late…– Firewall changes may be difficult

– Technically or politically– Tools need to be trivial to install and robust when running

– Sys-admins very busy– Need to carefully consider scheduling for end-to-end tests

– Overlapping measurements– Network overload

AHM 2008 6

The problem - users

• Users need to be able to start, stop and adjust the

measurements– Potentially on remote administrative domains

• Traditionally system administrators manually set up, start and

stop cron jobs for the tools– This caused various problems for scalability, coordination and basic

practicalities

AHM 2008 7

Solution:The Probes Coordination Protocol

• Developed to solve the management overhead of running

active measurement probes

• Token-based mechanism to co-ordinate periodic execution of

monitoring tasks– But has other applications

• Initially developed as part of EDG (Robert Harakaly et al.)– Prototype implementation in C: usable but lacking some features

• Re-engineered and extended by EPCC to address these

issues

AHM 2008 8

PCP Operation

• Client/Server model

• Based on a system of tokens passed between sites

• Client submits tokens to a site

• Server acts upon the arrival of a token– registers and monitors job tokens– Performs function defined by an admin token

• Sites are grouped into cliques

AHM 2008 9

PCP Token

• Trigger for activity at a site

• Job token– Name – an identifier– Delay – time to wait before executing the job for the first time– Period – frequency of command– Command – indicator of which command to run at the sites– Member(s) – sites in the clique to run the command

• Admin token– List - for retrieving data about the activities currently registered at a site– Kill – destroys the named clique activity– Clear – removes (i.e. deregisters) all the activities from a site– Update – modifies the named clique activity with the new token message

(enables changes to values such as the period)– Exit – stops the PCP server at the given site

• Also can include security information

AHM 2008 10

PCP Clique

• The clique represents a group of sites, all of which are

required to run a particular activity at particular intervals

• Example: will look at clique with three sites, A, B and C ...

Site A

Site B

Site C

Site D

Site E

Site F

Clique 1: Sites A and B

Clique 3: Sites B, C and F

Clique 2: Sites C, D, E and F

AHM 2008 11

Example PCP Token

# Lines beginning with # are ignored as comments

#

name:PJG-EPCC-PCP_TEST

member:sitea.epcc.ed.ac.uk

member:siteb.epcc.ed.ac.uk

member:sitec.epcc.ed.ac.uk

period:1800

timeout:0

delay:300

command:pcp_test

owner:somebody@epcc.ed.ac.uk

lockDependent:true

AHM 2008 12

PCP normal operation

15:10

15:15

15:30

15:35

15:40

15:05

15:00

Token

Site A Site B Site C

Token

Token

Run pcp_test

Token Registered.Pause for delay seconds.

Token arrives. Unlock job.Pause until (time last run+period)

Token

Token Registered.Pause for delay seconds.

Token Registered.Pause for delay seconds.

Run pcp_test

Run pcp_test Token

Run pcp_test

Run pcp_test Token

Run pcp_test Token

Token arrives. Unlock job.…. and so on

Token arrives. Unlock job.Pause until (time last run+period)

Token arrives. Unlock job.Pause until (time last run+period)

Lock job

Lock job

Lock job

Lock job

Lock job

Lock job

AHM 2008 13

PCP Site failure operation

16:05

16:10

16:30

16:35

16:40

16:00

Site A Site B Site C

TokenRun pcp_test

Token arrives. Unlock job.Pause until (time last run+period)

Timeout! Unlock job.Generate replacement token.

Lock job

Site down.Token is lost

Token should have arrived!

Token arrives. Unlock job.Pause until (time last run+period)

Token

Run pcp_test Token

Run pcp_test

Token

Token arrives. Unlock job.Pause until (time last run+period)

Lock job

Lock job

Site restored.

Run pcp_test

Token Registered.Pause for delay seconds.

Lock job

Run pcp_test Token

Lock jobToken arrives. Unlock job.…. and so on

AHM 2008 14

PCP Lock operation

• Individual sites may temporarily wish to drop out of a clique

• Previously required inter-site coordination to stop/restart

commands

• Enabled via a locking mechanism– Administrator sets the lock– Lock dependent tokens are not allowed to execute– Lock either expires or is removed by administrator– The site operates normally as part of the clique

AHM 2008 15

PCP Features

• For NPM, prevents overlapping measurements– Probe will not run until token received

• Extensible “plug-in” design

• Communication– TCP/IP

• Security– VOMS/X.509 based authentication– Limited set of commands can be run

• Logging– Configurable to various levels– Security-related messages straightforwardly distinguishable

• Portable– Pure java

AHM 2008 16

Summary

• Protocol provides a means for scheduling regular tasks at multiple sites

with minimal overheads for both users and administrators

• Software is:– Portable– Secure– Robust– Extensible

• Available for download: http://www.egee-npm.org/pcp/

Any questions?

Thank you

p.graham@epcc.ed.ac.uk

top related