![Page 1: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/1.jpg)
Runtime environmentSLURM Basics
HPC 101 Shaheen II Training Workshop
Dr Samuel KortasComputational Scientist
KAUST Supercomputing [email protected]
19 September 2017
![Page 2: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/2.jpg)
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
![Page 3: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/3.jpg)
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
![Page 4: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/4.jpg)
Which type of resource available?
� Computing power � Fast acess storage (Burst Buffer)� Software licenses (commercial application)� Mass Storage (Lustre)� KSL expertise
![Page 5: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/5.jpg)
One at a time…to guaranty performance and restitution time
� Computing power (1 full node per user)� Fast acess storage (reserved Space) � Software licences (token)� Mass Storage (Lustre filesystem)� KSL expertise (us!)
![Page 6: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/6.jpg)
It’s our job to help you…But we need inputs from you.
� Computing power = core x hour = project application� Fast acess storage (SSD) = Burst Buffer application
� Mass Storage (Lustre) = in project application
� KSL expertise = in project application
![Page 7: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/7.jpg)
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features•
• → comprehensive example in the application talk!
![Page 8: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/8.jpg)
Why scheduling?...…to optimize shared resource use.
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvv
v
v
v
v
v
v
v
v
v
n n
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
TIME
On Shaheen 2 and Dragon : SLURMSame principle with PBS, LoadLeveler, Sun GridEngine….
4 nodes
20 mins
6174 nodes availables
![Page 9: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/9.jpg)
vvvvvvv☐☐☐☐☐☐☐☐☐☐☐
Theconcept ofbackfilling
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvv☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvv☐☐☐☐☐☐☐☐
nn
Resources available : nodes, memory, disk space….
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
TIME
In order to run the big job, the scheduler has to make some room on the machine. Some
empty space appears
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
10 H
24 H
fits
Only jobs small enough will fit in the gap…
It’s your best interest to set a realistic job duration
18 H
![Page 10: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/10.jpg)
The concept of backfilling
In order to run the big job, the scheduler has to make some room on the machine. Some
empty space appears vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
10 H
24 Hfits
Only jobs small enough will fit in the gap…
It’s your best interest to set a realistic job duration
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
n n
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
6174 nodes availables
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
![Page 11: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/11.jpg)
� Shaheen II has � 6,174 compute nodes� Each node having 2 sockets� Each socket having 16 Cores� Each core having 2
HyperThreads(choice by default)
� à 197, 568 cores� à Double of threads� = SLURM CPU
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
…
nn
36 racks of compute nodes
workq
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
Shaheen II’s Resources
![Page 12: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/12.jpg)
Shaheen II’s ResourcesScheduler view
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 13: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/13.jpg)
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM commands• How to place my program on allocated Resource ?• SLURM interesting features•
• → comprehensive example in the application talk!
![Page 14: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/14.jpg)
Steps to run a job
1. Listing available resource : sinfo2. Acquiring resources asking SLURM to make
available what you need : sbatch3. Waiting for the resource to be scheduled : squeue,4. Gathering information on resource: scontrol5. Using the resource : srun6. Releasing the resource : scancel7. Tracking my resource use : sacct
![Page 15: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/15.jpg)
Listing Resources : sinfo
Slurm CPU is an HyperthreadStatus of each node inthe partition
Name ofthe partition
Job limits in size and duration
Among all the nodes avaiblable, 30 of them will accept 1-node computation jobs for a duration of 72 hours
![Page 16: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/16.jpg)
Acquiring resources :sbatch
� sbatch job.sh� Minimum request :
� 1 full node (the samenode cannot be used by two different jobs)
� 1 minute
#!/bin/bash#SBATCH --partition=workq#SBATCH --job-name=my_job#SBATCH --output=out.txt#SBATCH --error=err.txt#SBATCH –nodes=32#SBATCH --ntasks-per-node=32#SBATCH --time=hh:mm:ss (or dd-hh:mm)
echo helllo
cd WORKING_DIRsrun –n 1024 a.out….
Parallel runon the compute
nodes
File job.sh
Only execute on the first node
![Page 17: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/17.jpg)
Acquiring interactive resources : salloc
� salloc -ntasks=4 –time=10:00� Used to get a resource allocation and use it interactively
from your computer terminal��� à example given in application talk
![Page 18: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/18.jpg)
Monitoring resources :squeue
� squeue : lists all jobs queued (can be long…)� squeue –u <my_login> (only my jobs, much faster)� squeue –l (queue displayed in more detailed
format)
� squeue --start (display expected starting time)� squeue -i60� reports currently active jobs every 60 seconds
à warning 1 : 1 SLURM CPU = 1 Hyperthreadà warning 2: on the number of CPUs printed for a high number
: it’s truncated! 197, 568 CPUS will read 19758
![Page 19: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/19.jpg)
Monitoring resources :scontrol show job
� scontrol show job jobID
![Page 20: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/20.jpg)
Managing already allocated resources
scancel <job_id>kills a given job
scancel --user=<login> --state=pendingkills all pending jobs of the user <login>
scancel --name=<job_name>kills all jobs of the user <login> with a given name
sacctReport accounting information by individual job core-hours
sb kxxxReport accounting information for the whole project
![Page 21: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/21.jpg)
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
![Page 22: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/22.jpg)
1 Shaheen II nodes
![Page 23: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/23.jpg)
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
1 node x 2 sockets x 16 cores x 2 hyperthreads = 64 hyperthreads
![Page 24: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/24.jpg)
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
srun my_program.exe
![Page 25: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/25.jpg)
srun → spawning my program to the resource
1 Hyperthread
srun my_program.exe
1 MPI process
![Page 26: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/26.jpg)
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
srun --ntasks 32 my_program.exe
![Page 27: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/27.jpg)
srun → spawning my program to the resource
1 Hyperthread
srun --ntasks 32 my_program.exe
1 MPI process
![Page 28: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/28.jpg)
srun → spawning my program to the resource
1 Hyperthread
srun --ntasks 32 my_program.exe
1 MPI process
![Page 29: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/29.jpg)
srun → spawning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 32 my_program.exe
1 MPI process
![Page 30: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/30.jpg)
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
![Page 31: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/31.jpg)
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
![Page 32: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/32.jpg)
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
![Page 33: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/33.jpg)
srun → pinning my program to the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
![Page 34: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/34.jpg)
srun → pinning my program to the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
![Page 35: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/35.jpg)
srun → spreading my program on the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
16 GB
variablesprog
prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB
![Page 36: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/36.jpg)
srun → spreading my program on the resource
1 Hyperthread
srun –hint=nomultithread --ntasks-per-socket=4--ntasks 8 my_program.exe
1 MPI process
16 GB
variablesprog
prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB
![Page 37: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/37.jpg)
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
Slurm Control
parameters
ScriptCommand
![Page 38: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/38.jpg)
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
Name of the job in the queueOutput file name for the jobError file name for the jobRequested number of tasksRequested ellapse timepartitionGroup to be charged
![Page 39: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/39.jpg)
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 40: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/40.jpg)
Job.out
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 41: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/41.jpg)
Job.err
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 42: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/42.jpg)
Job.out
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 43: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/43.jpg)
Job.err
Slurm counts in threads: 1 Slurm CPU is a Thread
![Page 44: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/44.jpg)
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
![Page 45: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/45.jpg)
Environment Variables
![Page 46: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/46.jpg)
SLURM’s interesting features
� Job arrays� Job dependency� Email on job state transition� Interactive job with salloc� Reservation (with the help of sysadmin)� Flexible scheduling : the job will run with the set of
parameters with which it can start soonest� sbatch –partition=debug,batch� sbatch –nodes=16-32 ... � sbatch –time=10:00:00 –time-min=4:00:00 ...
![Page 47: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/47.jpg)
So much more…
� How to submit workflows?� How to handle thousands jobs?� How to manage dependency between jobs?� How to tune scheduling?� How to cope with hardware failure?
à ask us or check or KSL monthly Seminars!
![Page 49: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/49.jpg)
Launching hundreds of jobs…
Submit and manage collection of similar jobs easily To submit 50,000 element job array:
sbatch –array=1-50000 -N1 -i my_in_%a -o my_out_%a job.sh
“%a” in file name mapped to array task ID (1 – 50000)
Default standard output: slurm-<job_id>_<task_id>.out
Only supported for batch jobs
![Page 50: Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory](https://reader030.vdocuments.mx/reader030/viewer/2022040407/5ead3c0e2d0239422909014d/html5/thumbnails/50.jpg)
squeue and scancel commands plus some • scontrol options can operate on entire job array or select task IDs • squeue -r option prints each task ID separately