© 2014 ibm corporation slurm for yorktown bluegene/q

© 2014 IBM Corporation

SLURM for Yorktown Bluegene/Q

© 2014 IBM Corporation2

SLURM on Wat2q

• Goals• Setup a scheduler for the Yorktown Bluegene system to increase research utilization of the system.

• Become familiar with the Bluegene/Q SRM (system resource manager) interfaces as it is a model for future HPC control API’s.

• Divide the Yorktown system into multipl[‘e submidplane blocks.• Develop scripts to allow users (optionally) to land on a specific submidplane block.

• Get slurm to run the bgas.pl script automatically based on information in the SLURM sbatch command used to queue a job.• This requires that jobs be limited to running on complete partitions.• SLURM by default will attempt to run a job on part of a submidplane partition if that partition is already booted.

• This is accomplished with prolog scripts.


SLURM Scheduling Jobs


SLURM Allocation Vs. Task Placement

Allocation is the selection of the resources needed for the job– Each job includes zero or more job steps (srun)– Each job step is comprised of one to multiple tasks– This is done by the “sbatch” command.

Task placement is the process of assigning a subset of the job’s allocated resources (cpus) to each task.

– This is handled by the SLURM “srun” command invoked from within the script scheduled by “sbatch”.


Effectively this becomes a game of Tetris


Slurm documentation

Slurm docs can be found here:– http://slurm.schedmd.com/documentation.html– Typical commands:

sacct displays accounting data for all jobs and job steps in the SLURM job accounting log.

sbatch Submit a batch job to SLURM.

scancel Used to signal jobs or job steps that are under the control of Slurm.

scontrol Used view and modify Slurm configuration and state

sinfo view information about SLURM nodes and partitions

smap graphically view information about SLURM jobs, partitions, and set configurations parameters.

squeue view information about jobs located in the SLURM scheduling queue.

srun run parallel jobs.

sstat Display various status information of a running job/step.

sview graphical user interface to view and modify SLURM state.


SLURM functions

SLURMD carries out five key tasks and has five corresponding subsystems:– Machine Status

• responds to SLURMCTLD requests for machine state information and sends asynchronous reports of state changes to help with queue control.

– Job Status • responds to SLURMCTLD requests for job state information and sends

asynchronous reports of state changes to help with queue control.– Remote Execution

• starts, monitors, and cleans up after a set of processes (usually shared by a parallel job), as decided by SLURMCTLD (or by direct user intervention).

– Stream Copy Service• handles all STDERR, STDIN, and STDOUT for remote tasks. This may involve

redirection, and it always involves locally buffering job output to avoid blocking local tasks.

– Job Control • propagates signals and job-termination requests to any SLURM-managed processes

(often interacting with the Remote Execution subsystem).


Slurm software

SLURM daemons don’t execute directly on the compute nodes.

SLURM gets system state, allocates resources and other state from the Bluegene/Q control system.

This interface is entirely contained in a SLURM plugin (src/plugings/select/bluegene).

The user interacts bluegene with the following slurm commands.– sbatch.– srun.– scontrol.– squeue.


Slurm Architecture for Bluegene/Q


Job Launch Process


Sview of BlueGene system


Slurm naming conventions

R00-M0 bgq0000

R00-M1 bgq0001

R01-M0 bgq0010

R01-M1 bgq0011

R00 bgq[0000x0001]

R01 Bgq[0010x0011]

R00R01 Bgq[0000x0011]

Slurm nameBgq name

• Slurm names things with torus coordinates• Top level names use 4 dimension midplane coordinates.• Submidplane partitions use 5 dimension torus coordinates.

R00-M0-N00 bgq0000[00000x11111]

R00-M0-N01 bgq0000[00200x11311]

R00-M0-N02 bgq0000[00020x11131]

R00-M0-N03 bgq0000[00220x11331]

R00-M0-N04 bgq0000[20000x31111]

R00-M0-N05 bgq0000[20200x31311]

R00-M0-N06 bgq0000[20020x31131]

R00-M0-N07 bgq0000[20220x31331]

R00-M0-N08 bgq0000[02000x12111]

R00-M0-N09 bgq0000[02200x13311]

R00-M0-N10 bgq0000[02020x13131]

R00-M0-M11 bgq0000[03330x13331]

R00-M0-N12 bgq0000[22000x33111]

R00-M0-N13 bgq0000[22200x33311]

R00-M0-N14 bgq0000[22020x33131]

R00-M0-N15 bgq0000[22220x33331]

Slurm nameBgq name

R01-M0-N00-128 Bgq0010[00000x11331]

Example larger blocks


Slurm queuing a JOB.

Use the sbatch command to queue a script that will run one or more jobs.

Within the script presented to the sbatch command do one or more “srun” commands.– The srun command will eventually cause a runjob command to be created.

For example:– This schedules the script rj01.sh to be run when a 64 node block on the partition “prod” is

booted.sbatch –nodes=64 --partition=prod rj01.sh

– Inside rj01.sh we have:#!/bin/bashsrun --chdir=/bgusr/home1/bvt_scratch /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf

– The srun will call runjob as follows:runjob --exe /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf --block RMP28Ap122959767 --cwd /bgusr/home1/bvt_scratch


Queuing a job with only one script.

Using sbatch/srun to queue a job typically requires two scripts, one to queue the job, (sbatch) and one to run one or more jobs (srun) once the block is allocated.

One can do this with a single script with this simple boilerplate.##!/bin/bashif [ -z "$SLURM_JOBID" ]; then sbatch --gid=bqluan --time=5:00 --nodes=128 --ntasks-per-node=32 -O --qos=umax-128 $0else srun --chdir=/gpfs/DDNgpfs2/bqluan/mushroomP \ --output=equilibrate-4V-21-new.out --error=equilibrate-4V-22-new.namd \ /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 equilibrate-4V-22-new.namdfi

The above script is a re-expression of the following (original) run job script

runjob --block R01-M0-N04-128 --ranks-per-node 32 --cwd /gpfs/DDNgpfs2/bqluan/mushroomP \ --exe /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 \ --args equilibrate-4V-21-new.namd > equilibrate-4V-21-new.out 2> equilibrate-4V-21-new.err &


Srun/runjob decoder

--cwd --chdir

--exe (first field without an option)

--label xx --label=xx

--verbose --verbose

--ranks-per-node --ntasks-per-node

All other options

--launcher_opts=

Runjob option Srun option

• Launcher options is a catch-all for all other runjob options• For example:

--launcher-opts=“—timeout-300 –strace”


Partitions (SLURM queue names).

We have setup multiple basic slurm queues (partitions).– prod – regular production nodes (R00-M0, R00-M1, R01-M0, R01-M1).– bgas – full system bgas allocation (R00-M0, R00-M1, R01-M0, R01-M1).

There are a couple of midplane level reservations setup to run each day.– bgas_daily – active 3am to 3:30pm– bgas_full – 3:30 pm to 6pm.

– The default queue/partition is the “prod” queue.

– The queue/partition name is used by the prolog script to determine if it is necessary to switch the IO nodes to either BGAS or production.


SLURM small block divisions.

Block divisions as of May 2024.– bgq0000 (R00-M0) – divided into 16 32 way blocks.– bgq0001 (R00-M1) – divided into 32,64,128,256 way (overlapping blocks)– Bgq0010 (R01-M0) – divided into ,64,128,256 way (overlapping blocks)– Bgq0011 (R01-M1) – divided into ,64,128,256 way (overlapping blocks)

• sbatch option “--nodes=xx” where xx is, either 32,64,128,256 will cause a job to land on one of the small block partitions. Slurm will pick which small block to run it on.

• Prolog scripts ensure that partial blocks are not used (i.e. 2 32 way jobs running on the same 64 way block at the same time.

• You can restrict which midplane that slurm will try to select its blocks from with the –nodelist=xxxx, where xxxx is bgq0000, bgq0001, bgq0010, or bgq0011.


Getting SLURM to run on a specific node card/block

To get slurm to land on a specific block we use the prolog script and the “nodelist” and “constraint” option for sbatch.

For example:sbatch --partition=prod –nodelist=bgq0000 --nodes=32 --constraint=N00-32

NOTE:– The --nodes option and the constraint must agree as to the size.– A sub-block of that size MUST exist on the nodelist requested.

Valid constraints are:– Nxx-32, where xx is 00-15– Nxx-64, where xx is 00,02,04,06,08,10,12,14– Nxx-128, where xx is 00,04,08,12– Nxx-256, where xx is 00,08

If the block is not capable of being scheduled the job will be canceled and a message will appear in the stdout file (slurm-$jobid.out).

Trying to use the higher number Nxx cards for 64 and 32 ways is discouraged, because the system will try to run the jobs on the Lower Number cards first and down each node card in turn until it lands on the card it needs to run on.


SLURM Job order.

If the user uses the –constraints parameter to select a specific node card, the order that jobs are submitted on may not be respected.

This is because the prolog scripts can reject the node SLURM first selects either due to it trying to run on a block larger than requested, or by a constraint.

– When the job is rejected on a specific node, it gets re-queued and this will cause some reordering.

If Job order is required one can use the --singleton and --jobname options as follows:– sbatch --job-name=a --dependency=singleton -N32 --constraint=N01-32 rj01.s

Another way to do this is with the “--dependency”:– after:job_id[:jobid...] : This job can begin execution after the specified jobs have begun execution. – afterany:job_id[:jobid...] : This job can begin execution after the specified jobs have terminated. – afternotok:job_id[:jobid...]: This job can begin execution after the specified jobs have terminated in some failed

state (non-zero exit code, node failure, timed out, etc). – afterok:job_id[:jobid...] : This job can begin execution after the specified jobs have successfully

executed (ran to completion with an exit code of zero).


SLURM – reservations.

Slurm can reserve an entire Midplane for jobs by a specific reservation id.

The current version can only reserve entire midplane blocks (not sub-midplane)– The September release of SLURM is supposed to have better sub-midplane capabilities

for both node selection and reservations.

Creating a reseveration:scontrol create reservation user=myid starttime=now duration=120 \ nodes=bgq0001

– This will reply with a reservation id as follows:Reservation created: myid_5

Using the reservation:sbatch --reservation=myid_5 –nodes=64 my.script

This web page outlines reservations in more detail

https://computing.llnl.gov/linux/slurm/reservations.html


Reservation Time-limit interaction.

For each job in there queue there is an execution timelimit imposed on it.

The default for this normally comes from the queue name.– It can be overridden at various levels such as the sbatch command line.– The initial default for the SLURM queues is 1 hour, so to over ride it use the --time

parameter on the sbatch as follows:sbatch –time=xxx nameofscript.sh• The xxx value is in minutes, other forms of date/times can be found in the sbatch

man page: “man sbatch”

The job will not run if the timelimit overlaps a node reservation. – So for example, if there is a reservation every day at 3:30 for the entire machine and the

time limit associated for the job will over lap that full system reservation, the job won’t run. Until after the reservation is over.

– If the time-limit exceeds the queue/partition time-limit the job will be left in the pending state indefinitely.


QOS settings.

QOS (quality of service settings), are used by SLURM to control limits on the amount of resources a given user/group/account/job can consume at any one time.

Our initial deployment of SLURM will associate a default QOS setting limiting each user to the total number of compute nodes that they previously had as a static allocation.

This will be used to keep users from consuming all of the machine by submitting multiple sbatch commands, but still allow a user to run 3 32 way jobs if their normal allocaiton was 128 nodes.

Each user will have a “default QOS” setting associated with their ID as well as a list of qos settings they are allowed to use.

– umax-32 == user max nodes = 32– umax-64 == user max nodes = 64– umax-128 == user max nodes == 128– …

One can select one of the authorized qos settings in the sbatch command line as follows:sbatch –qos=umax-128 –nodes=32 xx.sh

– The above command would allow the user to run 4 32 way jobs in parallel, before the queue would back up his jobs behind other work.

© 2014 ibm corporation slurm for yorktown bluegene/q

Documents