Slurm

Slurm's architecture

Slurm is made of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node.

Node

In slurm a node is a compute resource, usually defined by particular consumable resources, i.e. cores, memory, etc…

Partitions

A partition (or queue) is a set of nodes with usually common characteristics and/or limits. Partitions group nodes into logical sets. Nodes are shareable between partitions.

Jobs

Jobs are allocations of consumable resources from the nodes and assigned to a user under the specified conditions.

Job Steps

A job step is a single task within a job. Each job can have multiple tasks (steps) even parallel ones.

Common user commands:

sacct: report job accounting information about running or completed jobs.
salloc: allocate resources for a job in real time. Typically used to allocate resources and spawn a shell. Then the shell is used to execute commands to launch parallel tasks.
sbatch: submit a job script for later execution. The script typically contains the tasks plus and the environment definitions needed to execute the job.
scancel: cancel a pending or running job or job step.
sinfo: overview of the resources (node and partitions).
squeue: used to report the state of running and pending jobs.
srun:submit a job for execution or initiate job steps in real time. The srun allows users to requests consumable resources.

Slurm

Jobs information

My first slurm job

overview of the resources offered

show job accounting data

stop or cancel jobs

Show jobs information in queue

How to run parallel job's with srun

Preparing the Environment

Interactive Sessions

Job pipeline using slurm dependencies

Use of user QOS for CPU jobs

How to Run a Job with a GPU

Use QOS to run GPU jobs

Deep Learning Example

How to selected a GPU

My jobs need to run longer than the queues permit

Resource Consuption

Slurm

Slurm's architecture

Common user commands: