runtime environment slurm basics · runtime environment slurm basics hpc 101 shaheen ii training...

50
Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory [email protected] 19 September 2017

Upload: others

Post on 30-Apr-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Runtime environmentSLURM Basics

HPC 101 Shaheen II Training Workshop

Dr Samuel KortasComputational Scientist

KAUST Supercomputing [email protected]

19 September 2017

Page 2: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM

commands• How to place my program on allocated Resource ?• SLURM interesting features

• → comprehensive example in the application talk!

Page 3: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM

commands• How to place my program on allocated Resource ?• SLURM interesting features

• → comprehensive example in the application talk!

Page 4: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Which type of resource available?

� Computing power � Fast acess storage (Burst Buffer)� Software licenses (commercial application)� Mass Storage (Lustre)� KSL expertise

Page 5: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

One at a time…to guaranty performance and restitution time

� Computing power (1 full node per user)� Fast acess storage (reserved Space) � Software licences (token)� Mass Storage (Lustre filesystem)� KSL expertise (us!)

Page 6: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

It’s our job to help you…But we need inputs from you.

� Computing power = core x hour = project application� Fast acess storage (SSD) = Burst Buffer application

� Mass Storage (Lustre) = in project application

� KSL expertise = in project application

[email protected]@[email protected][email protected]….

Page 7: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM

commands• How to place my program on allocated Resource ?• SLURM interesting features•

• → comprehensive example in the application talk!

Page 8: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Why scheduling?...…to optimize shared resource use.

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

vvvvvvvvvv

v

v

v

v

v

v

v

v

v

n n

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

TIME

On Shaheen 2 and Dragon : SLURMSame principle with PBS, LoadLeveler, Sun GridEngine….

4 nodes

20 mins

6174 nodes availables

Page 9: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

vvvvvvv☐☐☐☐☐☐☐☐☐☐☐

Theconcept ofbackfilling

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

vvvvvvvvvvvvvv

vvvvvvvvvvvvvv

vvvvvvvvvvvvvv

vvvvvvvvvvvvvv

vvvv☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

vvvvvvvvvv☐☐☐☐☐☐☐☐

nn

Resources available : nodes, memory, disk space….

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

TIME

In order to run the big job, the scheduler has to make some room on the machine. Some

empty space appears

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

10 H

24 H

fits

Only jobs small enough will fit in the gap…

It’s your best interest to set a realistic job duration

18 H

Page 10: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

The concept of backfilling

In order to run the big job, the scheduler has to make some room on the machine. Some

empty space appears vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

10 H

24 Hfits

Only jobs small enough will fit in the gap…

It’s your best interest to set a realistic job duration

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐

n n

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

6174 nodes availables

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Page 11: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

� Shaheen II has � 6,174 compute nodes� Each node having 2 sockets� Each socket having 16 Cores� Each core having 2

HyperThreads(choice by default)

� à 197, 568 cores� à Double of threads� = SLURM CPU

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nn

36 racks of compute nodes

workq

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

nnnnnnnnnnnnnnnnnnnnnnnnnnnn

Shaheen II’s Resources

Page 12: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Shaheen II’s ResourcesScheduler view

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 13: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM commands• How to place my program on allocated Resource ?• SLURM interesting features•

• → comprehensive example in the application talk!

Page 14: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Steps to run a job

1. Listing available resource : sinfo2. Acquiring resources asking SLURM to make

available what you need : sbatch3. Waiting for the resource to be scheduled : squeue,4. Gathering information on resource: scontrol5. Using the resource : srun6. Releasing the resource : scancel7. Tracking my resource use : sacct

Page 15: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Listing Resources : sinfo

Slurm CPU is an HyperthreadStatus of each node inthe partition

Name ofthe partition

Job limits in size and duration

Among all the nodes avaiblable, 30 of them will accept 1-node computation jobs for a duration of 72 hours

Page 16: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Acquiring resources :sbatch

� sbatch job.sh� Minimum request :

� 1 full node (the samenode cannot be used by two different jobs)

� 1 minute

#!/bin/bash#SBATCH --partition=workq#SBATCH --job-name=my_job#SBATCH --output=out.txt#SBATCH --error=err.txt#SBATCH –nodes=32#SBATCH --ntasks-per-node=32#SBATCH --time=hh:mm:ss (or dd-hh:mm)

echo helllo

cd WORKING_DIRsrun –n 1024 a.out….

Parallel runon the compute

nodes

File job.sh

Only execute on the first node

Page 17: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Acquiring interactive resources : salloc

� salloc -ntasks=4 –time=10:00� Used to get a resource allocation and use it interactively

from your computer terminal��� à example given in application talk

Page 18: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Monitoring resources :squeue

� squeue : lists all jobs queued (can be long…)� squeue –u <my_login> (only my jobs, much faster)� squeue –l (queue displayed in more detailed

format)

� squeue --start (display expected starting time)� squeue -i60� reports currently active jobs every 60 seconds

à warning 1 : 1 SLURM CPU = 1 Hyperthreadà warning 2: on the number of CPUs printed for a high number

: it’s truncated! 197, 568 CPUS will read 19758

Page 19: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Monitoring resources :scontrol show job

� scontrol show job jobID

Page 20: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Managing already allocated resources

scancel <job_id>kills a given job

scancel --user=<login> --state=pendingkills all pending jobs of the user <login>

scancel --name=<job_name>kills all jobs of the user <login> with a given name

sacctReport accounting information by individual job core-hours

sb kxxxReport accounting information for the whole project

Page 21: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM

commands• How to place my program on allocated Resource ?• SLURM interesting features

• → comprehensive example in the application talk!

Page 22: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

1 Shaheen II nodes

Page 23: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

1 Shaheen II nodes2 Hyperthreads per core

1 Hyperthread

1 node x 2 sockets x 16 cores x 2 hyperthreads = 64 hyperthreads

Page 24: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

1 Shaheen II nodes2 Hyperthreads per core

1 Hyperthread

srun my_program.exe

Page 25: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spawning my program to the resource

1 Hyperthread

srun my_program.exe

1 MPI process

Page 26: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

1 Shaheen II nodes2 Hyperthreads per core

1 Hyperthread

srun --ntasks 32 my_program.exe

Page 27: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spawning my program to the resource

1 Hyperthread

srun --ntasks 32 my_program.exe

1 MPI process

Page 28: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spawning my program to the resource

1 Hyperthread

srun --ntasks 32 my_program.exe

1 MPI process

Page 29: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spawning my program to the resource

1 Hyperthread

srun --hint=nomultithread--ntasks 32 my_program.exe

1 MPI process

Page 30: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → pinning my program to the resource

1 Hyperthread

srun --hint=nomultithread--ntasks 8 my_program.exe

1 MPI process

Page 31: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → pinning my program to the resource

1 Hyperthread

srun --hint=nomultithread--ntasks 8 my_program.exe

1 MPI process

Page 32: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → pinning my program to the resource

1 Hyperthread

srun --hint=nomultithread--ntasks 8 my_program.exe

1 MPI process

Page 33: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → pinning my program to the resource

1 Hyperthread

srun –hint=nomultithread --cpu-bind=cores

--ntasks 8 my_program.exe

1 MPI process

Page 34: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → pinning my program to the resource

1 Hyperthread

srun –hint=nomultithread --cpu-bind=cores

--ntasks 8 my_program.exe

1 MPI process

Page 35: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spreading my program on the resource

1 Hyperthread

srun –hint=nomultithread --cpu-bind=cores

--ntasks 8 my_program.exe

1 MPI process

16 GB

variablesprog

prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB

Page 36: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

srun → spreading my program on the resource

1 Hyperthread

srun –hint=nomultithread --ntasks-per-socket=4--ntasks 8 my_program.exe

1 MPI process

16 GB

variablesprog

prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB

Page 37: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.sh

Slurm counts in threads: 1 Slurm CPU is a Thread

Slurm Control

parameters

ScriptCommand

Page 38: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.sh

Slurm counts in threads: 1 Slurm CPU is a Thread

Name of the job in the queueOutput file name for the jobError file name for the jobRequested number of tasksRequested ellapse timepartitionGroup to be charged

Page 39: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.sh

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 40: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.out

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 41: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.err

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 42: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.out

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 43: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Job.err

Slurm counts in threads: 1 Slurm CPU is a Thread

Page 44: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Outline

• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM

commands• How to place my program on allocated Resource ?• SLURM interesting features

• → comprehensive example in the application talk!

Page 45: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Environment Variables

Page 46: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

SLURM’s interesting features

� Job arrays� Job dependency� Email on job state transition� Interactive job with salloc� Reservation (with the help of sysadmin)� Flexible scheduling : the job will run with the set of

parameters with which it can start soonest� sbatch –partition=debug,batch� sbatch –nodes=16-32 ... � sbatch –time=10:00:00 –time-min=4:00:00 ...

Page 47: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

So much more…

� How to submit workflows?� How to handle thousands jobs?� How to manage dependency between jobs?� How to tune scheduling?� How to cope with hardware failure?

à ask us or check or KSL monthly Seminars!

Page 48: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

That’s all Folks!

Questions?

[email protected]

[email protected]

Page 49: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

Launching hundreds of jobs…

Submit and manage collection of similar jobs easily To submit 50,000 element job array:

sbatch –array=1-50000 -N1 -i my_in_%a -o my_out_%a job.sh

“%a” in file name mapped to array task ID (1 – 50000)

Default standard output: slurm-<job_id>_<task_id>.out

Only supported for batch jobs

Page 50: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory

squeue and scancel commands plus some • scontrol options can operate on entire job array or select task IDs • squeue -r option prints each task ID separately