grid checkpoining architecture
DESCRIPTION
Grid Checkpoining Architecture. Radosław Januszewski CoreGrid Summer School 2007. motivation. The Grids are complex and therefore prone to errors. The distributed nature of the Grid makes scheduling of system maintenance hard. - PowerPoint PPT PresentationTRANSCRIPT
Managed by
Grid Checkpoining Architecture
Radosław Januszewski
CoreGrid Summer School 2007
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2
motivation
- The Grids are complex and therefore prone to errors.
- The distributed nature of the Grid makes scheduling of system maintenance hard.
- Each uncoordinated power-down or failure effects in loss of currently running applications.
- Loss of computation time means additional cost!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3
goal
To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4
the solution
Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5
Grid Broker
User Interface
Operating System Operating System Operating System
Globally Accessible Storage (Data Management)
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Local Resource Manager
grid - model
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6
GCA in the Grid
Grid Broker
User Interface
Core Setvice
Operating System Core Service
Operating System Core Service
Operating System
Globally Accessible Storage (Data Management)
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Local Resource Manager
Checkpoint Translation service (CTS)
Grid Checkpoint Service
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7
Proof of concept – the goals
• check whether the GCA survives contact with the reality
• prepare PoC on the basis of real-life installation• the Grid with the GCA should provide additional
value comparing with the „traditional” approach
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8
GCA proof of concept installation
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9
involved elements
• GUI: command line, Grid Sphere, Migrating Desktop
• Broker: GRMS• Local Resource Manager: Globus + TORQUE• Core service: SGIckpt
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10
Bottom-up approach
How to make the checkpointer work with the local resource manager?
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11
pbs/torque special features
action checkpoint
action restart
action checkpoint_abort
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12
config
$action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta
skid %path
$action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid
$restart_transmogrify true
$action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid
%jobid %sid %taskid %path
Detailed description accessible on the http://checkpointing.psnc.pl
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13
Broker – local RM connectivity
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14
problem
The checkpointer: a service or resource?
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15
<grmsJob appid="matrix_demo_submit"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrixi"> <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>
job description with checkpointing
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16
the end-user point of view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
manual scenario
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
manual scenario - restart
Application
Failure!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19
<grmsJob appid="matrix_demo_resume"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <hostname>node-03.checkpointing.psnc.pl</hostname> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrix_long"> <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <recovery>true</recovery> <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20
failure – end-user view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21
problem
This semi-automatic solution is not optimal.
How to introduce automatic job failure handling without introducing new functionality in the Broker?
Use the workflows!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22
the workflow
submit job description
send results to useryes
submit „restart scenario” job
job finished successfullty?
send results to useryes
no
no
return error description
job finished successfullty?
Problem: using this broker we are not able to model loops
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
automatic scenario
Application
Failure!
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24
end-user point of view
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25
the benefits
user: more robust and fault-tolerant Grid environment
sysadmin: much easier system management due to automatic checkpoint and recovery mechanism
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26
Thank you!