batch systems in a number of scientific computing environments, multiple users must share a compute...

Batch Systems

• In a number of scientific computing environments, multiple users must share a compute resource:– research clusters– supercomputing centers

• On multi-user HPC clusters, the batch system is a key component for aggregating compute nodes into a single, sharable computing resource

• The batch system becomes the “nerve center” for coordinating the use of resources and controlling the state of the system in a way that must be “fair” to its users

• As current and future expert users of large-scale compute resources, you need to be familiar with the basics of a batch system

Batch Systems• The core functionality of all batch systems are essentially the same,

regardless of the size or specific configuration of the compute hardware:– Multiple Job Queues:

• queues provide an orderly environment for managing a large number of jobs• queues are defined with a variety of limits for maximum run times, memory

usage, and processor counts; they are often assigned different priority levels as well

• may be interactive or non-interactive– Job Control:

• submission of individual jobs to do some work (eg. serial, or parallel HPC applications)

• simple monitoring and manipulation of individual jobs, and collection of resource usage statistics (e.g., memory usage, CPU usage, and elapsed wall-clock time per job)

– Job Scheduling• policy which decides priority between individual user jobs• allocates resources to scheduled jobs

Batch Systems• Job Scheduling Policies:

– the scheduler must decide how to prioritize all the jobs on the system and allocate necessary resources for each job (processors, memory, file-systems, etc)

– scheduling process can be easy or non-trivial depending on the size and desired functionality

• first in, first out (FIFO) scheduling: jobs are simply scheduled in the order in which they are submitted

• political scheduling: enables some users to have more priority than others• fairshare scheduling, scheduler ensures users have equal access over time

– Additional features may also impact scheduling order:• advanced reservations - resources can be reserved in advance for a

particular user or job• backfill - can be combined with any of the scheduling paradigms to allow

smaller jobs to run while waiting for enough resources to become available for larger jobs

– back-fill of smaller jobs helps maximize the overall resource utilization– back-fill can be your friend for small duration jobs

Batch Systems

• Common batch systems you may encounter in scientific computing:– Platform LSF– PBS– Loadleveler (IBM)– SGE

• All have similar functionality but different syntax

• Reasonably straight forward to convert your job scripts from one system to another

• Above all include specific batch system directives which can be placed in a shell script to request certain resources (processors, queues, etc).

• We will focus on LSF primarily since it is the system running on Lonestar

Batch Submission Process

internet

internet

Server

Head

C1 C2 C3 C4Submission:bsub < job

Queue: Job Script waits for resources on Server Master: Compute Node that executes the job

script, launches ALL MPI processes Launch: ssh to each compute node to start

executable (e.g. a.out)

Launch mpirun

Master

Queue

Compute Nodes

mpirun –np # ./a.out

ibrun ./a.out

LSF Batch System• Lonestar uses Platform LSF for both the batch queuing system and scheduling

mechanism (provides similar functionality to PBS, but requires different commands for job submission and monitoring)

• LSF includes global fairshare, a mechanism for ensuring no one user monopolizes the computing resources

• Batch jobs are submitted on the front end and are subsequently executed on compute nodes as resources become available

• Order of job execution depends on a variety of parameters:

– Submission Time

– Queue Priority: some queues have higher priorities than others

– Backfill Opportunities: small jobs may be back-filled while waiting for bigger jobs to complete

– Fairshare Priority: users who have recently used a lot of compute resources will have a

lower priority than those who are submitting new jobs

– Advanced Reservations: jobs my be blocked in order to accommodate advanced

reservations (for example, during maintenance windows)

– Number of Actively Scheduled Jobs: there are limits on the maximum number of concurrent

processors used by each user

Lonestar Queue Definitions

Queue NameMax

RuntimeMin/Max

Procs

SU Charge Rate

Use

normal 24 hours 2/256 1.0 Normal usage

high 24 hours 2/256 1.8 Higher priority usage

development 15 min 1/32 1.0Debugging and developmentAllows interactive jobs

hero 24 hours >256 1.0Large job submission

Requires special permission

serial 12 hours 1/1 1.0For serial jobs. No more than 4 jobs/user

request Special Requests

spruceDebugging & development, special priority, urgent comp. env.

systest System Use (TACC Staff only)

Lonestar Queue Definitions

• Additional Queue Limits– In the normal and high queues, only a maximum of 512

processes can be used at one time. Jobs requiring more processors are deferred for possible scheduling until running jobs complete. For example, a single user can have the following job combinations eligible for scheduling:

• Run 2 jobs requiring 256 procs• Run 4 jobs requiring 128 procs each• Run 8 jobs requiring 64 procs each• Run 16 jobs requiring 32 procs each

– A maximum of 25 queued jobs per user is allowed at one time

LSF Fairshare

• A global fairshare mechanism is implemented on Lonestar to provide fair access to its substantial compute resources

• Fairshare computes a dynamic priority for each user and uses this priority in making scheduling decisions

• Dynamic priority is based on the following criteria– Number of shares assigned– Resources used by jobs belonging to the user:

• Number of job slots reserved• Run time of running jobs• Cumulative actual CPU time (not normalized), adjusted so that recently

used CPU time is weighted more heavily than CPU time used in the distant past

LSF Fairshare

• bhpart: Command to see current fairshare priority. For example:

lslogin1--> bhpart -rHOST_PARTITION_NAME: GlobalPartitionHOSTS: all

SHARE_INFO_FOR: GlobalPartition/USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIMEavijit 1 0.333 0 0 0.0 0chona 1 0.333 0 0 0.0 0ewalker 1 0.333 0 0 0.0 0minyard 1 0.333 0 0 0.0 0phaa406 1 0.333 0 0 0.0 0bbarth 1 0.333 0 0 0.0 0milfeld 1 0.333 0 0 2.9 0karl 1 0.077 0 0 51203.4 0vmcalo 1 0.000 320 0 2816754.8 7194752

Pri

ority

Commonly Used LSF Commandsbhosts

Displays configured compute nodes and their static and dynamic resources (including job slot limits)

lsloadDisplays dynamic load information for compute nodes (avg CPU usage, memory usage, available /tmp space)

bsub submits a batch job to LSF

bqueues displays information about available queues

bjobs displays information about running and queued jobs

bhist displays historical information about jobs

bstop suspends unfinished jobs

bresume resumes one or more suspended jobs

bkill Sends signal to kill, suspend, or resume unfinished jobs

bhpart Displays global fairshare priority

lshosts Displays hosts and their static resource configuration

lsuser Shows user job information

Note: most of these commands support a “-l” argument for long listings. For example: bhist –l <jobID> will give a detailed history of a specific job. Consult the man pages for each of these commands for more information.

Note: most of these commands support a “-l” argument for long listings. For example: bhist –l <jobID> will give a detailed history of a specific job. Consult the man pages for each of these commands for more information.

LSF Batch System

• LSF Defined Environment Variables:LSB_ERRORFILE name of the error file

LSB_JOBID batch job id

LS_JOBPID process id of the job

LSB_HOSTS list of hosts assigned to the job. Multi-cpu hosts will appear more than once (may get truncated)

LSB_QUEUE batch queue to which job was submitted

LSB_JOBNAME name user assigned to the job

LS_SUBCWD directory of submission, i.e. this variable is set equal to $cwd when the job is submitted

LSB_INTERACTIVE set to ‘y’ when the –I option is used with bsub

LSF Batch System

• Comparison of LSF, PBS and Loadleveler commands that provide similar functionality

LSF PBS Loadleveler

bresume qrls | qsit llhold -r

bsub qsub llsubmit

bqueues qstat llclass

bjobs qstat llq

bstop qhold llhold

bkill qdel llcancel

Batch System Concerns

• Submission (need to know)– Required Resources– Run-time Environment– Directory of Submission– Directory of Execution– Files for stdout/stderr Return– Email Notification

• Job Monitoring• Job Deletion

– Queued Jobs– Running Jobs

LSF: Basic MPI Job Script

#!/bin/csh#BSUB -n 32#BSUB -J hello#BSUB -o %J.out#BSUB -e %J.err#BSUB -q normal#BSUB -P A-ccsc#BSUB -W 0:15

echo "Master Host = "`hostname`echo "LSF_SUBMIT_DIR: $LS_SUBCWD"echo "PWD_DIR: "`pwd`

ibrun ./hello

Total number of processesJob nameStdout Output file name (%J = jobID)

Submission queue

Echo pertinent environment info

Execution command

executableParallel application manager and mpirun wrapper script

Stderr Output file name

Your Project NameMax Run Time (15 minutes)

LSF: Extended MPI Job Script#!/bin/csh#BSUB -n 32#BSUB -J hello#BSUB -o %J.out#BSUB -e %J.err#BSUB -q normal#BSUB -P A-ccsc#BSUB -W 0:15#BSUB -w ‘ended(1123)'#BSUB -u [email protected]#BSUB -B#BSUB -N

echo "Master Host = "`hostname`echo "LSF_SUBMIT_DIR: $LS_SUBCWD"

ibrun ./hello

Total number of processesJob nameStdout Output file name (%J = jobID)

Submission queueStderr Output file name

Your Project Name

Email addressEmail when job begins executionEmail job report informationupon completion

Dependency on Job <1123>Max Run Time (15 minutes)

mailto:[email protected]

LSF: Job Script Submission

• When submitting jobs to LSF using a job script, a redirection is required for bsub to read the commands. Consider the following script:

lslogin1> cat job.script#!/bin/csh#BSUB -n 32#BSUB -J hello#BSUB -o %J.out#BSUB -e %J.err#BSUB -q normal#BSUB -W 0:15echo "Master Host = "`hostname`echo "LSF_SUBMIT_DIR: $LS_SUBCWD“echo "PWD_DIR: "`pwd`

ibrun ./hello

• To submit the job:

lslogin1% bsub < job

Re-direction is required!

LSF: Interactive Execution• Several ways to run interactively

– Submit entire command to bsub directly:

> bsub –q development -I -n 2 -W 0:15 ibrun ./hello

Your job is being routed to the development queueJob <11822> is submitted to queue <development>.<<Waiting for dispatch ...>><<Starting on compute-1-0>> Hello, world! --> Process # 0 of 2 is alive. ->compute-1-0 --> Process # 1 of 2 is alive. ->compute-1-0

– Submit using normal job script and include additional -I directive:

> bsub -I < job.script

Batch Script Suggestions

• Echo issuing commands – (“set -x” and “set echo” for ksh and csh).

• Avoid absolute pathnames– Use relative path names or environment variables ($HOME,

$WORK)

• Abort job when a critical command fails. • Print environment

– Include the "env" command if your batch job doesn't execute the same as in an interactive execution.

• Use “./” prefix for executing commands in the current directory– The dot means to look for commands in the present working

directory. Not all systems include "." in your $PATH variable. (usage: ./a.out).

• Track your CPU time

LSF Job Monitoring (showq utility)lslogin1% showqACTIVE JOBS--------------------JOBID JOBNAME USERNAME STATE PROC REMAINING STARTTIME

11318 1024_90_96x6 vmcalo Running 64 18:09:19 Fri Jan 9 10:43:5311352 naf phaa406 Running 16 17:51:15 Fri Jan 9 10:25:4911357 24N phaa406 Running 16 18:19:12 Fri Jan 9 10:53:46 23 Active jobs 504 of 556 Processors Active (90.65%)

IDLE JOBS----------------------JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

11169 poroe8 xgai Idle 128 10:00:00 Thu Jan 8 10:17:0611645 meshconv019 bbarth Idle 16 24:00:00 Fri Jan 9 16:24:18 3 Idle jobs

BLOCKED JOBS-------------------JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

11319 1024_90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09:1111320 1024_90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09:11 17 Blocked jobs

Total Jobs: 43 Active Jobs: 23 Idle Jobs: 3 Blocked Jobs: 17

LSF Job Monitoring (bjobs command)lslogin1% bjobsJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME11635 bbarth RUN normal lonestar 2*compute-8 *shconv009 Jan 9 16:24 2*compute-9-22 2*compute-3-25 2*compute-8-30 2*compute-1-27 2*compute-4-2 2*compute-3-9 2*compute-6-1311640 bbarth RUN normal lonestar 2*compute-3 *shconv014 Jan 9 16:24 2*compute-6-2 2*compute-6-5 2*compute-3-12 2*compute-4-27 2*compute-7-28 2*compute-3-5 2*compute-7-511657 bbarth PEND normal lonestar *shconv028 Jan 9 16:3811658 bbarth PEND normal lonestar *shconv029 Jan 9 16:3811662 bbarth PEND normal lonestar *shconv033 Jan 9 16:3811663 bbarth PEND normal lonestar *shconv034 Jan 9 16:3811667 bbarth PEND normal lonestar *shconv038 Jan 9 16:3811668 bbarth PEND normal lonestar *shconv039 Jan 9 16:38

Note: Use “bjobs -u all” to see jobs from all users.

LSF Job Monitoring (lsuser utility)

lslogin1$ lsuser -u vapJOBID QUEUE USER NAME PROCS SUBMITTED547741 normal vap vap_hd_sh_p96 14 Tue Jun 7 10:37:01

2005

HOST R15s R1m R15m PAGES MEM SWAP TEMPcompute-11-11 2.0 2.0 1.4 4.9P/s 1840M 2038M

24320Mcompute-8-3 2.0 2.0 2.0 1.9P/s 1839M 2041M

23712Mcompute-7-23 2.0 2.0 1.9 2.3P/s 1838M 2038M

24752Mcompute-3-19 2.0 2.0 2.0 2.6P/s 1847M 2041M

23216Mcompute-14-19 2.0 2.0 2.0 2.1P/s 1851M 2040M

24752Mcompute-3-21 2.0 2.0 1.7 2.0P/s 1845M 2038M

24432Mcompute-13-11 2.0 2.0 1.5 1.8P/s 1841M 2040M

24752M

LSF Job Manipulation/Monitoring

• To kill a running or queued job (takes ~30 seconds to complete):bkill <jobID>bkill -r <jobID> (Use when bkill alone won’t delete the job)

• To suspend a queued job:bstop <jobId>

• To resume a suspended job:bresume <jobID>

• To see more information on why a job is pending:bjobs –p <jobID>

• To see a historical summary of a job:bhist <jobID>

lslogin1> bhist 11821Summary of time in seconds spent in various states:JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL11821 karl hello 131 0 127 0 0 0 258

batch systems in a number of scientific computing environments, multiple users must share a compute...

Documents

batch queuing system

job job schedulingpolicy

job processors

batch systemsin

job scripts

necessary resources

use of resources

smaller jobs