rochester institute of technology job submission andrew pangborn & myles maxfield...

30
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 03/22/22 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1

Upload: thomas-lawson

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Rochester Institute of Technology

Job Submission

Andrew Pangborn & Myles Maxfield

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1

The Grid

• <Insert some structural picture of grid>?

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 2

The Problem

• At one end are computing resources managed by batch queuing systems and other middleware

• At the other end are end-users and their jobs/applications

• Need software and protocols for submitting jobs to the computing resources

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 3

Job Submission

• More motivation stuff?

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 4

Batch Queuing Systems

• Submitting a job directly to the batch queuing system

• One or more queues– Priorities

• Two common architectures– Client/server– Dynamic offloading

• User credential (delegation)

• Jobs have states (e.g. Pending, Running)

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 5

Batch Queuing Systems

• Important examples:– Portable Batch System– TORQUE– Xgrid– Sun Grid Engine– Load Sharing Facility– Condor

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 6

Portable Batch System (PBS)

• Originally developed for NASA

• Client/server architecture

• Server: pbs_server

• Client: pbs_mom

• Works with MPI with built-in shell script variables

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 7

PBS Example

litherum@gras:~$ cat test.sh#!/bin/sh#testpbsecho This is a testecho today is `date`echo This is `hostname`echo The current working directory is `pwd`ls -alF /homeuptime

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 8

PBS Examplelitherum@gras:~$ qsub test.sh6.gras.carrion.rit.edulitherum@gras:~$ qstatJob id Name User Time Use S Queue------------------------- ---------------- --------------- -------- - -----6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6This is a testtoday is Sat Jan 17 18:20:20 EST 2009This is carrion02The current working directory is /home/litherumtotal 20drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00,

0.0004/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 9

Torque

• Built on top of PBS

• Supports reservations, where you can reserve specific resources for specific times.

• Supports partitions, where you can partition a cluster into smaller sub-clusters.

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 10

Torquelitherum@gras:~$ showqACTIVE JOBS--------------------JOBNAME USERNAME STATE PROC REMAINING

STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%)IDLE JOBS----------------------JOBNAME USERNAME STATE PROC WCLIMIT

QUEUETIME0 Idle JobsBLOCKED JOBS----------------JOBNAME USERNAME STATE PROC WCLIMIT

QUEUETIMETotal Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 11

Xgrid

• Apple• Essentially the same as

Condor• GUI! =)• Client/server model

http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 12

Sun Grid Engine

• Open source, like everything new Sun puts out

• Supports– Reservations– Job dependencies,– Checkpointing– Multiple scheduling algorithms– Web interface

• Professional!

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 13

Load Sharing Facility

• Used by GRAM, which we’ll talk about later

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 14

Condor

• More about this later, but it implements its own scheduler

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 15

Challenging!

• These queuing systems are hard to use

• There may be many systems employed in a given grid

• Wouldn’t it be nice if all this were unified in a single implementation?

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 16

• A tool for pooling and “scavenging” computing resources and distributing jobs

• Similar to a batch queuing system [2]– job management– scheduling policy– priority scheme– resource monitoring– resource management.

• Also focuses on high-throughput and “opportunistic computing” [2]

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 17

Condor image from: http://www.cs.wisc.edu/condor/

Condor Universes [1]

• Standard

• Vanilla– Simpler, can run universal binaries (do not need

to be “condor compiled”)– No support for partial execution or job relocation

• Others– PVM– MPI– Java

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 18

Condor Submission File Example [1]

#hello.sub

#condor job file example

Universe = Vanilla

Executable = hello

Output = hello.out

Input = hello.in

Error = hello.err

Log = hello.log

Queue

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 19

Condor Commands

• condor_submit <job_file.sub>

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 20

Condor Daemons

• On all condor deployed machines– Master– Startd– Schedd

• On the condor pool master– Collector– Negotiator

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 21

GRAM [4]• Globus Resource Allocation Manager (GRAM)

– Resource allocation – Process creation – Monitoring– Management – Maps requests expressed in a Resource Specification Language

(RSL) into commands to local schedulers and computers.

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 22

GRAM

• Pluggable!

• Can’t make up their mind how to describe jobs

• Will submit jobs to:– Condor– LSF– PBS/Torque– ???

• Unified interface, identifier for which cluster/service to use

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 23

GRAM Examplemaxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-

login.ornl.teragrid.org:8444/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-

command /bin/hostnameDelegating user credentials...Done.Submitting job...Done.Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6Termination time: 01/18/2009 23:57 GMTCurrent job state: PendingCurrent job state: Activetg-c15Current job state: CleanUp-HoldCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done.Cleaning up any delegated credentials...Done.

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 24

Condor-G [4]• Condor-G is a Globus-enabled version of the Condor scheduler.

It uses Globus to handle inter-organizational problems like:– Security– Resource management for supercomputers,– Executable staging.

• The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites.

• It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as:

– GSI– GRAM protocol for job submission

• Condor-g can be used to submit jobs to systems managed by Globus.

• Globus tools can be used to submit jobs to systems managed by Condor

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 25

Condor-G

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 26

UNICORE

• <couple slides on UNICORE>

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 27

Upperware

• Talk about motivation for upperware applications

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 28

GridShell

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 29

References1. http://www.linuxjournal.com/node/9058/print - Getting started with Condor

2. Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience.

3. http://grid.rit.edu/seminar/lib/exe/fetch.php/users:jeremy_espenshade:condorjobsubmission.ppt – Jeremy Espenshade’s condor job submission presentation

4. http://iag.iucc.ac.il/presentations/front2.ppt

04/20/23 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 30