Slurm

Slurm's architecture

Slurm is made of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node.

Node

In slurm a node is a compute resource, usually defined by particular consumable resources, i.e. cores, memory, etc…

Partitions

A partition (or queue) is a set of nodes with usually common characteristics and/or limits. Partitions group nodes into logical sets. Nodes are shareable between partitions.

Jobs

Jobs are allocations of consumable resources from the nodes and assigned to a user under the specified conditions.

Job Steps

A job step is a single task within a job. Each job can have multiple tasks (steps) even parallel ones.

Common user commands:

  • sacct: report job accounting information about running or completed jobs.

  • salloc: allocate resources for a job in real time. Typically used to allocate resources and spawn a shell. Then the shell is used to execute commands to launch parallel tasks.

  • sbatch: submit a job script for later execution. The script typically contains the tasks plus and the environment definitions needed to execute the job.

  • scancel: cancel a pending or running job or job step.

  • sinfo: overview of the resources (node and partitions).

  • squeue: used to report the state of running and pending jobs.

  • srun:submit a job for execution or initiate job steps in real time. The srun allows users to requests consumable resources.


Revision #8
Created Thu, Nov 28, 2019 10:54 AM by Joao Machado
Updated Tue, May 25, 2021 10:43 PM by Joao Machado