paul graham software architect, epcc [email protected]@epcc.ed.ac.uk +44 131 650 4992 pcp –...
TRANSCRIPT
Paul GrahamSoftware Architect, [email protected]
+44 131 650 4992
PCP – TheProbes
Coordination
ProtocolA secure, robust framework for
scheduling and coordinating regular
tasks across multiple sites
AHM 2008 2
Overview
• Background
• Motivation
• The Probes Coordination Protocol
• New implementation
• PCP implementation features
• Summary
AHM 2008 3
Background
• Work has spanned three projects– European Data Grid (EDG) 2001-2004– Enabling Grids for eScience (EGEE/EGEE-II)
2004-2008– Joint Information Systems Committee (JISC)
NPM 2008-2009
• Network performance measurements– The collection of monitoring data in a Grid
environment– Grid users want to know the expected
performance of their network-based application
– e2emonit, gridmon
AHM 2008 4
Motivation
• Issues for collecting monitoring data– Different measurement types
– End to end– Backbone
– Different tools– Different formats– Heterogeneous environments
– Grid!– Many administrative domains– Different user groups
AHM 2008 5
The problem - sites
• Deployment of monitoring tools is not so easy
– There has to be a clear benefit to the site before they install tools– This benefit is not obvious until after an incident has occurred, by
which time it is too late…– Firewall changes may be difficult
– Technically or politically– Tools need to be trivial to install and robust when running
– Sys-admins very busy– Need to carefully consider scheduling for end-to-end tests
– Overlapping measurements– Network overload
AHM 2008 6
The problem - users
• Users need to be able to start, stop and adjust the
measurements– Potentially on remote administrative domains
• Traditionally system administrators manually set up, start and
stop cron jobs for the tools– This caused various problems for scalability, coordination and basic
practicalities
AHM 2008 7
Solution:The Probes Coordination Protocol
• Developed to solve the management overhead of running
active measurement probes
• Token-based mechanism to co-ordinate periodic execution of
monitoring tasks– But has other applications
• Initially developed as part of EDG (Robert Harakaly et al.)– Prototype implementation in C: usable but lacking some features
• Re-engineered and extended by EPCC to address these
issues
AHM 2008 8
PCP Operation
• Client/Server model
• Based on a system of tokens passed between sites
• Client submits tokens to a site
• Server acts upon the arrival of a token– registers and monitors job tokens– Performs function defined by an admin token
• Sites are grouped into cliques
AHM 2008 9
PCP Token
• Trigger for activity at a site
• Job token– Name – an identifier– Delay – time to wait before executing the job for the first time– Period – frequency of command– Command – indicator of which command to run at the sites– Member(s) – sites in the clique to run the command
• Admin token– List - for retrieving data about the activities currently registered at a site– Kill – destroys the named clique activity– Clear – removes (i.e. deregisters) all the activities from a site– Update – modifies the named clique activity with the new token message
(enables changes to values such as the period)– Exit – stops the PCP server at the given site
• Also can include security information
AHM 2008 10
PCP Clique
• The clique represents a group of sites, all of which are
required to run a particular activity at particular intervals
• Example: will look at clique with three sites, A, B and C ...
Site A
Site B
Site C
Site D
Site E
Site F
Clique 1: Sites A and B
Clique 3: Sites B, C and F
Clique 2: Sites C, D, E and F
AHM 2008 11
Example PCP Token
# Lines beginning with # are ignored as comments
#
name:PJG-EPCC-PCP_TEST
member:sitea.epcc.ed.ac.uk
member:siteb.epcc.ed.ac.uk
member:sitec.epcc.ed.ac.uk
period:1800
timeout:0
delay:300
command:pcp_test
owner:[email protected]
lockDependent:true
AHM 2008 12
PCP normal operation
15:10
15:15
15:30
15:35
15:40
15:05
15:00
Token
Site A Site B Site C
Token
Token
Run pcp_test
Token Registered.Pause for delay seconds.
Token arrives. Unlock job.Pause until (time last run+period)
Token
Token Registered.Pause for delay seconds.
Token Registered.Pause for delay seconds.
Run pcp_test
Run pcp_test Token
Run pcp_test
Run pcp_test Token
Run pcp_test Token
Token arrives. Unlock job.…. and so on
Token arrives. Unlock job.Pause until (time last run+period)
Token arrives. Unlock job.Pause until (time last run+period)
Lock job
Lock job
Lock job
Lock job
Lock job
Lock job
AHM 2008 13
PCP Site failure operation
16:05
16:10
16:30
16:35
16:40
16:00
Site A Site B Site C
TokenRun pcp_test
Token arrives. Unlock job.Pause until (time last run+period)
Timeout! Unlock job.Generate replacement token.
Lock job
Site down.Token is lost
Token should have arrived!
Token arrives. Unlock job.Pause until (time last run+period)
Token
Run pcp_test Token
Run pcp_test
Token
Token arrives. Unlock job.Pause until (time last run+period)
Lock job
Lock job
Site restored.
Run pcp_test
Token Registered.Pause for delay seconds.
Lock job
Run pcp_test Token
Lock jobToken arrives. Unlock job.…. and so on
AHM 2008 14
PCP Lock operation
• Individual sites may temporarily wish to drop out of a clique
• Previously required inter-site coordination to stop/restart
commands
• Enabled via a locking mechanism– Administrator sets the lock– Lock dependent tokens are not allowed to execute– Lock either expires or is removed by administrator– The site operates normally as part of the clique
AHM 2008 15
PCP Features
• For NPM, prevents overlapping measurements– Probe will not run until token received
• Extensible “plug-in” design
• Communication– TCP/IP
• Security– VOMS/X.509 based authentication– Limited set of commands can be run
• Logging– Configurable to various levels– Security-related messages straightforwardly distinguishable
• Portable– Pure java
AHM 2008 16
Summary
• Protocol provides a means for scheduling regular tasks at multiple sites
with minimal overheads for both users and administrators
• Software is:– Portable– Secure– Robust– Extensible
• Available for download: http://www.egee-npm.org/pcp/
Any questions?
Thank you