paul graham software architect, epcc [email protected]@epcc.ed.ac.uk +44 131 650 4992 pcp –...

16
Paul Graham Software Architect, EPCC [email protected] +44 131 650 4992 PCP – The Probes Coordination Protocol A secure, robust framework for scheduling and coordinating regular tasks across multiple sites

Upload: miles-norman

Post on 19-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

Paul GrahamSoftware Architect, [email protected]

+44 131 650 4992

PCP – TheProbes

Coordination

ProtocolA secure, robust framework for

scheduling and coordinating regular

tasks across multiple sites

Page 2: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 2

Overview

• Background

• Motivation

• The Probes Coordination Protocol

• New implementation

• PCP implementation features

• Summary

Page 3: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 3

Background

• Work has spanned three projects– European Data Grid (EDG) 2001-2004– Enabling Grids for eScience (EGEE/EGEE-II)

2004-2008– Joint Information Systems Committee (JISC)

NPM 2008-2009

• Network performance measurements– The collection of monitoring data in a Grid

environment– Grid users want to know the expected

performance of their network-based application

– e2emonit, gridmon

Page 4: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 4

Motivation

• Issues for collecting monitoring data– Different measurement types

– End to end– Backbone

– Different tools– Different formats– Heterogeneous environments

– Grid!– Many administrative domains– Different user groups

Page 5: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 5

The problem - sites

• Deployment of monitoring tools is not so easy

– There has to be a clear benefit to the site before they install tools– This benefit is not obvious until after an incident has occurred, by

which time it is too late…– Firewall changes may be difficult

– Technically or politically– Tools need to be trivial to install and robust when running

– Sys-admins very busy– Need to carefully consider scheduling for end-to-end tests

– Overlapping measurements– Network overload

Page 6: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 6

The problem - users

• Users need to be able to start, stop and adjust the

measurements– Potentially on remote administrative domains

• Traditionally system administrators manually set up, start and

stop cron jobs for the tools– This caused various problems for scalability, coordination and basic

practicalities

Page 7: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 7

Solution:The Probes Coordination Protocol

• Developed to solve the management overhead of running

active measurement probes

• Token-based mechanism to co-ordinate periodic execution of

monitoring tasks– But has other applications

• Initially developed as part of EDG (Robert Harakaly et al.)– Prototype implementation in C: usable but lacking some features

• Re-engineered and extended by EPCC to address these

issues

Page 8: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 8

PCP Operation

• Client/Server model

• Based on a system of tokens passed between sites

• Client submits tokens to a site

• Server acts upon the arrival of a token– registers and monitors job tokens– Performs function defined by an admin token

• Sites are grouped into cliques

Page 9: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 9

PCP Token

• Trigger for activity at a site

• Job token– Name – an identifier– Delay – time to wait before executing the job for the first time– Period – frequency of command– Command – indicator of which command to run at the sites– Member(s) – sites in the clique to run the command

• Admin token– List - for retrieving data about the activities currently registered at a site– Kill – destroys the named clique activity– Clear – removes (i.e. deregisters) all the activities from a site– Update – modifies the named clique activity with the new token message

(enables changes to values such as the period)– Exit – stops the PCP server at the given site

• Also can include security information

Page 10: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 10

PCP Clique

• The clique represents a group of sites, all of which are

required to run a particular activity at particular intervals

• Example: will look at clique with three sites, A, B and C ...

Site A

Site B

Site C

Site D

Site E

Site F

Clique 1: Sites A and B

Clique 3: Sites B, C and F

Clique 2: Sites C, D, E and F

Page 11: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 11

Example PCP Token

# Lines beginning with # are ignored as comments

#

name:PJG-EPCC-PCP_TEST

member:sitea.epcc.ed.ac.uk

member:siteb.epcc.ed.ac.uk

member:sitec.epcc.ed.ac.uk

period:1800

timeout:0

delay:300

command:pcp_test

owner:[email protected]

lockDependent:true

Page 12: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 12

PCP normal operation

15:10

15:15

15:30

15:35

15:40

15:05

15:00

Token

Site A Site B Site C

Token

Token

Run pcp_test

Token Registered.Pause for delay seconds.

Token arrives. Unlock job.Pause until (time last run+period)

Token

Token Registered.Pause for delay seconds.

Token Registered.Pause for delay seconds.

Run pcp_test

Run pcp_test Token

Run pcp_test

Run pcp_test Token

Run pcp_test Token

Token arrives. Unlock job.…. and so on

Token arrives. Unlock job.Pause until (time last run+period)

Token arrives. Unlock job.Pause until (time last run+period)

Lock job

Lock job

Lock job

Lock job

Lock job

Lock job

Page 13: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 13

PCP Site failure operation

16:05

16:10

16:30

16:35

16:40

16:00

Site A Site B Site C

TokenRun pcp_test

Token arrives. Unlock job.Pause until (time last run+period)

Timeout! Unlock job.Generate replacement token.

Lock job

Site down.Token is lost

Token should have arrived!

Token arrives. Unlock job.Pause until (time last run+period)

Token

Run pcp_test Token

Run pcp_test

Token

Token arrives. Unlock job.Pause until (time last run+period)

Lock job

Lock job

Site restored.

Run pcp_test

Token Registered.Pause for delay seconds.

Lock job

Run pcp_test Token

Lock jobToken arrives. Unlock job.…. and so on

Page 14: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 14

PCP Lock operation

• Individual sites may temporarily wish to drop out of a clique

• Previously required inter-site coordination to stop/restart

commands

• Enabled via a locking mechanism– Administrator sets the lock– Lock dependent tokens are not allowed to execute– Lock either expires or is removed by administrator– The site operates normally as part of the clique

Page 15: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 15

PCP Features

• For NPM, prevents overlapping measurements– Probe will not run until token received

• Extensible “plug-in” design

• Communication– TCP/IP

• Security– VOMS/X.509 based authentication– Limited set of commands can be run

• Logging– Configurable to various levels– Security-related messages straightforwardly distinguishable

• Portable– Pure java

Page 16: Paul Graham Software Architect, EPCC p.graham@epcc.ed.ac.uk@epcc.ed.ac.uk +44 131 650 4992 PCP – The P robes C oordination P rotocol A secure, robust framework

AHM 2008 16

Summary

• Protocol provides a means for scheduling regular tasks at multiple sites

with minimal overheads for both users and administrators

• Software is:– Portable– Secure– Robust– Extensible

• Available for download: http://www.egee-npm.org/pcp/

Any questions?

Thank you

[email protected]