# Manage slurm  jobs

How to handle jobs management using slurm batch system. Used at Minho and ISEC  and Lisbon data center

# Slurm

## Slurm's architecture

Slurm is made of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node.

**Node**

In slurm a node is a compute resource, usually defined by particular consumable resources, i.e. cores, memory, etc…

**Partitions**

A partition (or queue) is a set of nodes with usually common characteristics and/or limits. Partitions group nodes into logical sets. Nodes are shareable between partitions.

**Jobs**

Jobs are allocations of consumable resources from the nodes and assigned to a user under the specified conditions.

**Job Steps**

A job step is a single task within a job. Each job can have multiple tasks (steps) even parallel ones.


## Common user commands:

- [__sacct__](https://wiki.incd.pt/books/manage-jobs/page/sacct): report job accounting information about running or completed jobs.

- [__salloc__](#salloc): allocate resources for a job in real time. Typically used to allocate resources and spawn a shell. Then the shell is used to execute commands to launch parallel tasks.

- [__sbatch__](#sbatch): submit a job script for later execution. The script typically contains the tasks plus and the environment definitions needed to execute the job.

- [__scancel__](https://wiki.incd.pt/books/manage-jobs/page/scancel): cancel a pending or running job or job step.

- [__sinfo__](https://wiki.incd.pt/books/manage-jobs/page/sinfo): overview of the resources (node and partitions).

- [__squeue__](https://wiki.incd.pt/books/manage-jobs/page/seueue): used to report the state of running and pending jobs.

- [__srun__](https://wiki.incd.pt/books/manage-jobs/page/srun):submit a job for execution or initiate job steps in real time. The srun allows users to requests consumable resources.

# Jobs information

#### List all current jobs for a user:

```squeue -u <username>```

#### List all running jobs for a user:

```squeue -u <username> -t RUNNING```

#### List all pending jobs for a user:

```squeue -u <username> -t PENDING```

#### List all current jobs in the shared partition for a user:
```squeue -u <username> -p shared ```

#### List detailed information for a job (useful for troubleshooting):

```scontrol show jobid -dd <jobid>```

#### List status info for a currently running job:
```sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps```

#### Additional information for complet jobs (not available during the run):

```sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed```

#### To view information for all jobs of a user:

```sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed```

# My first slurm job

## Examples

### Submit a simple MPI job

* On this example we run a small MPI application doing the following steps:
  * Create a submission file
  * Submit the job to the default partition
  * Execute a simple MPI code
  * Check the status of the job
  * Read the output


* Download source code
```
wget --no-check-certificate https://wiki.incd.pt/attachments/71 -O cpi.c
```

* Create a submission file
```
vi my_first_slurm_job.sh
```
* Edit the file

```

#!/bin/bash

#SBATCH --job-name=MyFirstSlurmJob
#SBATCH --time=0:10:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

# Be sure to request the correct partition to avoid the job to be held in the queue, furthermore
#	on CIRRUS-B (Minho)  choose for example HPC_4_Days
#	on CIRRUS-A (Lisbon) choose for example hpc
#SBATCH --partition=hpc

# Used to guarantee that the environment does not have any other loaded module
module purge

# Load software modules. Please check session software for the details
module load gcc63/openmpi/4.0.3

# Prepare
src='cpi.c'
exe="./cpi.$SLURM_JOB_ID"

# Compile application
echo "=== Compiling ==="
mpicc -o $exe $src

# Run application. Please note that the number of cores used by MPI are assigned in the SBATCH directives.
echo "=== Running ==="
if [ -e $exe ]; then
    chmod u+x $exe
    mpiexec -np $SLURM_NTASKS $exe
    rm -f $exe
fi

echo "Finished with job $SLURM_JOBID"
```
* Submit the job

```
sbatch my_first_slurm_job.sh
```

* Check status of the job

```
$ squeue

JOBID PARTITION     NAME     USER     ST       TIME      NODES NODELIST(REASON)
1171  HPC_4_Days    MyFirstS username PD       0:00      1     wn075
```

  * Check further details about your job (very long output)

```
scontrol show job 1171
```


* Read the output of the job:

If name is not specified slurm will create by default a file with the output of your run

slurm-{job_id}.out

e.g. slurm-1171.out


* Cancel your job

```
$ scancel 1171
```

### MPI examples:

#### Hellow World:
```
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
    processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
    }
```

#### PI calculation
```
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
 *  (C) 2001 by Argonne National Laboratory.
 *      See COPYRIGHT in top-level directory.
 */

#include "mpi.h"
#include <stdio.h>
#include <math.h>

int main(int argc,char *argv[])
{
    long int    n, i;
    int    myid, numprocs;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);

    n = 100000000000;		/* default # of rectangles */
    if (myid == 0) {
    	startwtime = MPI_Wtime();
	}

    MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

    h   = 1.0 / (double) n;
    sum = 0.0;

    /* A slightly better approach starts from large i and works back */
    for (i = myid + 1; i <= n; i += numprocs)
    {
	x = h * ((double)i - 0.5);
	sum += 4.0 / (1.0 + x*x);
    }
    mypi = h * sum;

    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (myid == 0) {
    	endwtime = MPI_Wtime();
    	printf("pi=%.16f, error=%.16f, ncores %d, wall clock time = %f\n", pi, fabs(pi - PI25DT), numprocs, endwtime-startwtime);
    	fflush(stdout);
    }

    MPI_Finalize();
    return 0;
}
```

# overview of the resources offered

### `sinfo` : overview of the resources offered by the cluster

By default, `sinfo` lists the available partitions name(s), availability, time limit, number of nodes, their __state__ and the nodelist. A partition is a set of compute nodes.

The command `sinfo` by default
```
$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      5  down* wn[075,096,105,110,146]
all*         up   infinite      6  drain wn[077,091,101,117,143,148]
all*         up   infinite      2    mix wn[079,097]
all*         up   infinite     33  alloc wn[081-089,092-095,099-100,104,108,112,115,118,124,135-139,144-145,151,155-158]
all*         up   infinite     40   idle wn[071-073,076,080,090,098,102-103,106-107,109,111,113-114,116,120-123,125-128,130-134,140-142,147,149-150,152-154,159-160]
all*         up   infinite      4   down wn[074,078,119,129]
debug        up   infinite      8   idle wn[060-063,065-067,069]
debug        up   infinite      3   down wn[064,068,070]

```

The command `sinfo --Node` provides the list of nodes and their actual state individually.

```
$ sinfo -Node

NODELIST   NODES PARTITION STATE
wn071          1      all* alloc
wn072          1      all* drain
wn073          1      all* alloc
wn074          1      all* down
wn075          1      all* down*
wn076          1      all* alloc

```

The command `sinfo --summarize` provides the node state in the form "available/idle/other/total"

```
$ sinfo --summarize

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
all*         up   infinite       36/7/47/90  wn[071-160]
debug        up   infinite         2/6/3/11  wn[060-070]


```

The command `sinfo --long` provides additional information than `sinfo`. Informations about the OverSubscribe (OVERSUBS),  All the queues are defined as OVERSUBS=NO, none of the partitions(queues) allow requestes over the limit of the consumable resources.

```
$ sinfo --long

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
all*         up   infinite 1-infinite   no       NO        all      5       down* wn[075,096,105,110,146]
all*         up   infinite 1-infinite   no       NO        all     38     drained wn[072-073,076-077,080,090-091,098,101-103,106-107,109,113-114,116-117,120-123,125-128,130,133-134,136,140-141,143,147-148,150,152,159]
all*         up   infinite 1-infinite   no       NO        all      4       mixed wn[079,094,097,137]
all*         up   infinite 1-infinite   no       NO        all     32   allocated wn[071,081-089,092-093,095,099-100,104,108,112,115,118,124,131-132,135,138-139,144,151,155-158]
all*         up   infinite 1-infinite   no       NO        all      7        idle wn[111,142,145,149,153-154,160]


```

With `sinfo` you can also filter the nodes/partitions for specific situation, in this example we requested to list the nodes either idle or down

```
$sinfo --states=idle,down

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      5  down* wn[075,096,105,110,146]
all*         up   infinite      8   idle wn[113,116,121-122,126,140-141,143]
all*         up   infinite      4   down wn[074,078,119,129]
debug        up   infinite      7   idle wn[060-063,065-067]
debug        up   infinite      3   down wn[064,068,070]


```

>**For more detailed information, please see manual `man sinfo`**


### __states__:

- **mix** : consumable resources partially allocated
- **idle** : available to requests consumable resources
- **drain** : unavailable for use per system administrator request
- **drng** :  currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
- **alloc** : consumable resources fully allocated
- **down** : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.

# show job accounting data

`sacct:` displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

If you use the command without any paremeters it will show you the currently running jobs accounting data.

```
$ sacct

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1127         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1128         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1128.0            a.out                  incd         16    RUNNING      0:0
1129         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1129.0            a.out                  incd         16    RUNNING      0:0
1130         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1156         run_zacar+ HPC_4_Days       root          1    RUNNING      0:0

```

You can specify the job which data you would like to view by using the `-j` flag.

```
$ sacct -j 1156

JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1156         run_zacar+ HPC_4_Days       root          1    RUNNING      0:0
```

You can list jobs by user, by adding the `-u` flag and choosing the user.

```
$ sacct -u jprmachado

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1127         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1128         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1128.0            a.out                  incd         16    RUNNING      0:0
1129         omp-bkp-o+      debug       incd         16    RUNNING      0:0
1129.0            a.out                  incd         16    RUNNING      0:0
1130         omp-bkp-o+      debug       incd         16    RUNNING      0:0

```

You can also filter or create your own custom reports by using the `--format` flag and choosing what data to show.

```
$ sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
jprmacha+ 1127         omp-bkp-o+      debug    RUNNING 20-20:00:+ 2019-11-20T11:44:28             Unknown 9-04:00:00                              1         16           wn018
jprmacha+ 1128         omp-bkp-o+      debug    RUNNING 20-20:00:+ 2019-11-20T11:46:43             Unknown 9-03:57:45                              1         16           wn019
          1128.0            a.out               RUNNING            2019-11-20T11:46:43             Unknown 9-03:57:45                              1         16           wn019
jprmacha+ 1129         omp-bkp-o+      debug    RUNNING 20-20:00:+ 2019-11-20T11:51:30             Unknown 9-03:52:58                              1         16           wn020
          1129.0            a.out               RUNNING            2019-11-20T11:51:31             Unknown 9-03:52:57                              1         16           wn020
jprmacha+ 1130         omp-bkp-o+      debug    RUNNING 20-20:00:+ 2019-11-20T11:52:37             Unknown 9-03:51:51                              1         16           wn012
     root 1156         run_zacar+ HPC_4_Days    RUNNING 8-00:00:00 2019-11-27T13:40:02             Unknown 2-02:04:26                              1          1           wn035

```

There is also the possibility to filter you custom report by user and date, you just have to add the `-u` and `--start` flags.

```
$ sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -u zbenta --start 2019-11-28

     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
   zbenta 1163           clover32     stage2    TIMEOUT   04:00:00 2019-11-28T13:22:31 2019-11-28T17:22:46   04:00:15                              8        128     wn[022-029]
          1163.batch        batch             CANCELLED            2019-11-28T13:22:31 2019-11-28T17:22:47   04:00:16     40152K    186176K        1         16           wn022
          1163.0            orted                FAILED            2019-11-28T13:22:35 2019-11-28T17:22:46   04:00:11     38104K    254748K        7          7     wn[023-029]

```

You can also use the flags to give you a report during a specific time interval, just use the `--start` and `--end` flags.

```
$ sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -u zbenta --start 2019-10-07 --end 2019-10-11

     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
   zbenta 15            Run_PRISM      debug     FAILED 365-00:00+ 2019-10-07T11:05:58 2019-10-07T11:06:09   00:00:11                              2         32     wn[018-019]
          15.batch          batch                FAILED            2019-10-07T11:05:58 2019-10-07T11:06:09   00:00:11                              1         16           wn018
          15.0              orted             COMPLETED            2019-10-07T11:06:02 2019-10-07T11:06:07   00:00:05                              1          1           wn019
   zbenta 20            Run_PRISM      debug CANCELLED+  UNLIMITED 2019-10-08T11:42:01 2019-10-08T12:12:03   00:30:02                              2         32     wn[018-019]
          20.batch          batch             CANCELLED            2019-10-08T11:42:01 2019-10-08T12:12:05   00:30:04   2626556K    186140K        1         16           wn018
          20.0              orted                FAILED            2019-10-08T11:42:05 2019-10-08T12:12:08   00:30:03   2594880K    292116K        1          1           wn019
   zbenta 28            Run_PRISM      debug     FAILED  UNLIMITED 2019-10-11T14:33:06 2019-10-11T14:33:06   00:00:00                              2         32     wn[003,015]
          28.batch          batch                FAILED            2019-10-11T14:33:06 2019-10-11T14:33:06   00:00:00                              1         16           wn003
```


>**For more detailed information, please see the manual `man sacct` **

# stop or cancel jobs

## `scancel` : used to signal jobs or job steps that are under the control of Slurm



The command `scancel` is used to signal or __cancel__ __jobs__, __job arrays__  or __job steps__ . A job or job step can only be signaled by the __owner__ of that job or user root. If an attempt is made by an unauthorized user to signal a job or job step, an error message will be printed and the job will not be signaled.


```
$ scancel <jobid>


JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
33416       all Hexadeca    fcruz  R    3:26:11      2 wn[131-132]
33434     debug  OFBuild  lmendes  R    1:50:42      1 wn069
33437       all       FE ngalamba  R      58:07      1 wn094
33439       all       FE ngalamba  R      29:43      1 wn097
33440       all       FE ngalamba  R      29:13      1 wn137
33441       all       FE ngalamba  R      13:43      1 wn126
33442       all       FE ngalamba  R       1:58      1 wn071
33443       all       FE ngalamba  R       1:41      1 wn071
33445       all       FE ngalamba  R       0:12      1 wn079

```

You can all your jobs (running and pending)

```
$ scancel --user <username>

```

You may also only cancel all your jobs in a specific element, i.e. state, partition...

```
$ scancel --state PENDING --user <username>

```

$ Job can be also canceled using the job name

```
$ scancel --name <jobname>

```

>**For more detailed information, please see `man scancel`**

# Show  jobs information in queue

`squeue:` view information about jobs located in the Slurm scheduling queue.

`gqueue:` squeue alias formated to show specific jobs information

#### general usage
If you use the command without any paremeters it will show you the currently running jobs in the queue.

```
$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1127     debug omp-bkp- jprmacha  R 9-04:38:00      1 wn018
1128     debug omp-bkp- jprmacha  R 9-04:35:45      1 wn019
1129     debug omp-bkp- jprmacha  R 9-04:30:58      1 wn020
1130     debug omp-bkp- jprmacha  R 9-04:29:51      1 wn012
1156 HPC_4_Day run_zaca     root  R 2-02:42:26      1 wn035
```

#### view jobs from a specific user
You can filter by user, using the `--user` flag

```
$ squeue --user root

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1156 HPC_4_Day run_zaca     root  R 2-02:44:28      1 wn035
```

#### view particular jobs
You can slso filter by job id, using the `-j` flag.

```
$ squeue -j 1127

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1127     debug omp-bkp- jprmacha  R 9-04:41:26      1 wn018
```
it is possible to provide multiple job id's separated by comma.

#### format the command output

The user may provide the output fields with format option *"-O"*, for example showing the number of requested cpus:

	$ squeue -o "%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C %N" -u jmartins
    JOBID PARTITION     NAME     USER ST       TIME  NODES CPUS NODELIST
	192427    debug   cpi.sh jmartins  R       0:06      1   64 hpc047

#### gqueue alias

The user interfaces have an alias for the ***squeue*** comand called ***gqueue*** with some useful fields

	$ gqueue
    JOBID  PARTITION NAME    USER    ST TIME     NODES CPUS TRES_PER_NODE  NODELIST
    184472 gpu       gpu-job gpuuser R  18:34:54 1     1    gpu            hpc058

>**For more detailed information, please see the manual `man squeue` **

# How to run parallel job's with srun

### `srun` :  Used to submit/initiate job or job step

 Typically, srun is invoked from a SLURM job script but alternatively, srun can be run directly from the command, in which case srun will first create a resource allocation for running the parallel job (the salloc is implicit)

```
srun -N 1 -c 16 -p HPC_4_Days --time=1:00:00 --pty /bin/bash
```
This command will request 16 cores (`-c`) of one Node (`-N`) for 1h00 in the partition (`-p`) HPC_4_Days. Please note that this is subject to Nodes availability, if no Nodes are available your request will be put in the queue waiting for resources.

The srun may also be executed inside a shell script.
```
#!/bin/bash

#SBATCH -N 3
#SBATCH -p HPC_4_Days

echo Starting job $SLURM_JOB_ID
echo SLURM assigned me these nodes
srun -l hostname
```

This batch job will result in the following output:
```
Starting job 51057
SLURM assigned me these nodes
0: wn054.b.incd.pt
1: wn055.b.incd.pt
2: wn057.b.incd.pt
```
The 3 allocated nodes are released after the `srun` finish.

By default srun will use the `pmi2`, but you may consult the full list of the available mpi types.

```
$ srun --mpi=list

srun: MPI types are...
srun: pmi2
srun: openmpi
srun: none

```
To use a different mpi type e.g. `srun --mpi=openmpi`


>**For more detailed information, please see man srun**

# Preparing the Environment

There are lots of litte tweaks we need in order to prepate the environment for running specific software.
We will try to describe the ones we use more regularly so it is easier for the users to work with them.

## mvapich
**Version 2.3.3 compiled wiht Intel 2020**
```
module load intel/mvapich2/2.3.3
source $I_MPI_ROOT/intel64/bin/mpivars.sh intel64 -ofi_internal=0
export LD_PRELOAD="libmpi.so"
```
## mpich
**Version 3.2.2 compiled with Intel 2020**
```
module load intel/mpich/3.3.2
export LD_PRELOAD="libmpi.so"
```

## OpenMPI 4.0.3
**Version 4.0.3 compiled with Intel 2019**
```
module load intel/openmpi/4.0.3
export I_MPI_PMI_LIBRARY=/lib64/libpmi.so
```

## openfoam

**Version 1912 compiled wiht Intel 2020**
```
module load intel/openfoami20/1912
source /cvmfs/sw.el7/ar/ix_es2680/i20/openfoami20/1912/build01/OpenFOAM-v1912/etc/bashrc
. /cvmfs/sw.el7/ar/ix_es2680/i20/openfoami20/1912/build01/OpenFOAM-v1912/bin/tools/RunFunctions
```

**Version 1906 compiled wiht Intel 2020**
```
module load intel/openfoami20/1906
source /cvmfs/sw.el7/ar/ix_es2680/i20/openfoami20/1906/build01/OpenFOAM-v1912/etc/bashrc
. /cvmfs/sw.el7/ar/ix_es2680/i20/openfoami20/1906/build01/OpenFOAM-v1912/bin/tools/RunFunctions
```

## gromacs

**intel/gromacs/2020.2**
```
module load gcc-6.3
source /cvmfs/sw.el7/ar/ix_es2680/i20/gromacs/2020.2/build01/bin/GMXRC.bash
source /cvmfs/sw.el7/intel/2020/bin/compilervars.sh intel64
module load intel/gromacs/2020.2
```

**intel/gromacs/2020.20-i20**
```
module load gcc-7.5
source /cvmfs/sw.el7/ar/ix_es2680/i20/gromacs/2020.2/build02/bin/GMXRC.bash#source /cvmfs/sw.el7/intel/2020/bin/compilervars.sh intel64
source /cvmfs/sw.el7/intel/2020/bin/compilervars.sh intel64
module load intel/gromacs/2020.2
```

**gromacs-4.6.7**
```
module load gromacs-4.6.7
module load gcc63/openmpi/4.0.3
export GMX_MAXBACKUP=-1
mpirun -np 10 mdrun -s  benchMEM.tpr -nsteps 500000 -maxh 3.0 -resethway
```

**Version 2020.2 compiled wiht Intel 2020**
```
module load gcc-6.3
source /cvmfs/sw.el7/ar/ix_es2680/i20/gromacs/2020.2/build02/bin/GMXRC.bash
source /cvmfs/sw.el7/intel/2020/bin/compilervars.sh intel64
module load intel/gromacs/2020.2
```

# Job with GPU usage

## Requesting a GPU


Requesting a GPU for a job is very simple and only demand the option (--partition | -p), for example:

#### Create a submission script

	[user@cirrus01 ~]$ cat gpu.sh
    #!/bin/bash
    #SBATCH -p gpu
    # load local cuda tool kit
    module load cuda
	# do your work here...
    
    [user@cirrus01 ~]$ sbatch gpu.sh

For users convenience the ***module environment*** ***cuda*** points to the most recent version present on the local software. Users can inspect available version with the following command but be advise that not all version will be present on the execution node, only the ***cuda*** environment is guarantee to be operational:

	[user@cirrus01 ~]$ module avail
    ---------- /cvmfs/sw.el7/modules/soft -----------
    ...
    conn-R2018b                  macs-1.4.2
	cuda                         matlab/R2018a
	cuda-10.1                    matlab/R2018b
	cuda-10.2                    matlab/R2019b

#### Submit the job

For the time being the subission do not demand for extra options:

	[user@cirrus01 ~]$ sbatch gpu.sh
    Submitted batch job 3415

In the near future we will receive new hardware with GPU's and the submission will include requests for GPU types; the submission will include options like:

	[user@cirrus01 ~]$ sbatch --gres=gpu:1 gpu.sh

in this example we are requesting one gpu regardless of the actually model.

#### Check status of the job

	[user@cirrus01 ~]$ qstat
    Job id              Name             Username        Time Use S Queue          
	------------------- ---------------- --------------- -------- - ---------------
    3415                gpu.sh           user            00:00:00 R gpu

or

	[user@cirrus01 ~]$ squeue
    JOBID PARTITION     NAME     USER     ST       TIME      NODES NODELIST(REASON)
	 3415       gpu   gpu.sh     user      $       0:00      1     hpc050


#### Check further details about your job (very long output)

	[user@cirrus01 ~]$ scontrol show job 3415

#### Check the output of your job

Read the output of the job:
If name is not specified slurm will create by default a file with the output of your run:

	[user@cirrus01 ~]$ ls -l
	-rw-r-----+ 1 user hpc 227 Ago 19 11:45 gpu.sh
	-rw-r-----+ 1 user hpc 747 Ago 19 11:46 slurm-3415.out

If your job produce output files then they will be also present on the working directory.

# Interactive Sessions

Slurm allow interactive sessions into the workernodes, using ssh, but within a valid job allocation, normal ssh are disabled. The interactive session can be created on the scope of normal partitions but those jobs will have the same priority as a regular job.

There is a limitation of 1 job and 1 task per node on partitions ***hpc*** and ***gpu***, we would like to encourage users to close sessions as soon as possible to give all a good chance to use the resources.

> The *FCT* grant users should use the partition *fct* instead in the examples bellow.

## Starting srun Session

The most simple way to start an interactive session is:

	[user@cirrus01 ~]$ srun -p hpc --job-name "my_interactive" --pty bash -i
	srun: job 72791 queued and waiting for resources
	srun: job 72791 has been allocated resources
	[user@hpc059 ~]$

You will have an ssh session on a worker node were other users are running jobs or interactive sessions as well, try not bother them with unsolicitated interactions, and exit the session when you are finished.

>The *FCT* call users should target the *partition* **fct** and the *QOS* associate to the user, e.g. ***"srun -p fct -q cpcaXXXX2020 ..."***, where *XXXX* is the call ID.

The ***srun*** command have the same restrictions as a normal *job* and will be aborted or refused to run when the system limits are axceeded. If you run the ***squeue*** you will see your interactive job listed as any other job:

	[user@hpc059 ~]$ squeue 
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
    72818       hpc my_inter     user  R       2:03      1 hpc059 

## Starting salloc Session

The ***salloc*** is setup to behave like the ***srun*** command, for example:

	[user@cirrus01 ~]$ salloc -p hpc --job-name "my_interactive" 
	salloc: Pending job allocation 72818
	salloc: job 72818 queued and waiting for resources
	salloc: job 72818 has been allocated resources
	salloc: Granted job allocation 72818
	salloc: Waiting for resource configuration
	salloc: Nodes hpc059 are ready for job
	[user@hpc059 ~]$

>Once again the *FCT* call users should target the *partition* **fct** and the *QOS* associate to the user

# Geting Available Resources

There are a few ways to find out the available cluster resource such as *CPU's* or *GPU's*. The [List of HPC and HTC servers](https://wiki.incd.pt/books/compute-node-specs-and-information/page/list-of-hpc-and-htc-servers) is one source of information but the best way is to find the actually resources available with system command line tools, such as, ***sinfo***, ***squeue***, ***sacct*** or ***[pestat](https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat)*** (a very handy third party tool from *Ole Holm Nielsen*).

### sinfo

The ***sinfo*** run without options show a concise list, for example:

	[user@cirrus01 ~]$ sinfo 
	PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
	hpc*         up 4-00:00:00      1   drng hpc049
	hpc*         up 4-00:00:00      2    mix hpc[047,059]
	hpc*         up 4-00:00:00      2  alloc hpc[046,048]
	gpu          up 4-00:00:00      4   idle hpc[050-051],hpc[057-058]
    fct          up 4-00:00:00      4   idle hpc[060-063]

In order to show the available GPU's we should provide format options, for example:

	[user@cirrus01 ~]$ sinfo -o "%10P %.6D %.11T %11G %N"
	PARTITION   NODES       STATE GRES        NODELIST
	hpc*            1    draining (null)      hpc049
	hpc*            2       mixed (null)      hpc[047,059]
	hpc*            2   allocated (null)      hpc[046,048]
    gpu             4        idle gpu:t4:1    hpc[050-051],hpc[057-058]
    fct             2        idle gpu:v100s:1 hpc060,hpc062
    fct             2        idle gpu:t4:1    hpc061,hpc063

We can see eight nodes with one GPU's each, six nodes have a NVIDIA Tesla-T4 (abreviated as ***gpu:t4:1***) and two nodes provide a NVIDIA Tesla-V100S (abreviated as ***gpu:v100s:1***). The ***gpu*** tag stands for the provided resource, a GPU in this case, the second field ***t4*** stands for the resource type, a Tesla-T4 abreviated as ***t4***, the third field stands for the number of the specified resource available on selected node, one on all the examples.

In this example we can only check out the available GPU's, we are not able to assess if they are in use or not, event when the respectively nodes are allocated. The command line tool ***squeue*** can provide that information, as seen below.

The user interface ***cirrus.ncg.ingrid.pt*** have an *alias* **ginfo** defined for user convenience and avoid a long command line options listing:

	[user@cirrus01 ~]$ ginfo
	PARTITION   NODES       STATE GRES        NODELIST
	hpc*            1    draining (null)      hpc049
	hpc*            2       mixed (null)      hpc[047,059]
	hpc*            2   allocated (null)      hpc[046,048]
    gpu             4        idle gpu:t4:1    hpc[050-051],hpc[057-058]
    fct             2        idle gpu:v100s:1 hpc060,hpc062
    fct             2        idel gpu:t4:1    hpc061,hpc063
	
Check the man page for more details:

	[user@cirrus01 ~]$ man sinfo

### squeue

Like the ***sinfo*** command, the ***squeue*** runned without options show a concise list of jobs and miss the *generic resources* details, in order to get this information we have to supply the format option as following:

	[user@cirrus01 ~]$ squeue -O jobid:6,partition:10,name:10,username:9,state:3,timeused:11,numnodes:6,numcpus:5,gres:14,nodelist
    JOBID PARTITION NAME      USER   ST  STATIME       NODES CPUS TRES_PER_NODE NODELIST 
	1170  fct       nbody.sh  user5  R   RUN1-16:10:41 1     2    gpu:v100s     hpc060
	1171  fct       nbody.sh  user5  R   RUN1-15:26:42 1     2    gpu:t4        hpc061   
	1175  fct       teste.sh  user5  R   RUN0:00       1     1    N/A           hpc060

In this example we can see two GPU's allocated for user ***user5*** on partition ***fct***, jobs *1170* and *1171*, the job *1175* is using one of the previous nodes but is not using any GPU.

We defined an alias named **gqueue** for user convenience and avoid the long argument listing:

	[user@cirrus01 ~]$ gqueue
    JOBID PARTITION NAME      USER     ST STATIME       NODES CPUS TRES_PER_NODE NODELIST 
	1170  fct       nbody.sh  user5    R  RUN1-16:10:41 1     2    gpu:v100s     hpc060
	1171  fct       nbody.sh  user5    R  RUN1-15:26:42 1     2    gpu:t4        hpc061
    1175  fct       teste.sh  user5    R  RUN0:00       1     1    N/A           hpc060

The ***squeue*** command, as well the **gqueue** alias, accept extra options to refine the job list. For example, we can target only the running jobs

	[user@cirrus01 ~]$ gqueue -t R
    JOBID PARTITION NAME      USER   ST  STATIME       NODES CPUS TRES_PER_NODE NODELIST 
	1170  fct       nbody.sh  user5  R   RUN1-16:10:41 1     2    gpu:v100s     hpc060
	1171  fct       nbody.sh  user5  R   RUN1-15:26:42 1     2    gpu:t4        hpc061
    1175  fct       teste.sh  user5  R   RUN0:00       1     1    N/A           hpc060

or other states, check the man page for more details:

	[user@cirrus01 ~]$ man squeue

### sacct

The ***sacct*** command runned without options do not shows the *generic resource* list, as all the privious commands, but we can list the present day jobs, including the completed or failed jobs, and more information not needed on most cases.

As we made with the other commands we defined a convenient *alias" named ***gacct*** for cluster users that include also the *generic resources*:

	[user@cirrus01 ~]$ gacct 
       JobID    JobName  Partition    Account  AllocCPUS      ReqGRES    AllocGRES      State  ExitCode 
	------------ ---------- ---------- ---------- ---------- ------------ ------------ ---------- -------- 
	1170       nbody.sh        fct        hpc          2  gpu:v100s:1        gpu:1    RUNNING  0:0 
	1171       nbody.sh        fct        hpc          2     gpu:t4:1        gpu:1    RUNNING  0:0 
	1175       teste.sh        fct        hpc          1                              RUNNING  0:0 

We can also refine the ***sacct*** command, and **gacct** alias, results, check the man page for more details:

	[user@cirrus01 ~]$ man sacct

### pestat

The ***[pestat](https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat)*** command line tool shows a list of all available resources, including GPU's, on a comprehensive way, we can see occupied and free resources and some more usefull parameters such as CPU load or memory:

	[user@cirrus01 ~]$ pestat
    Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  GRES/       Joblist
                                State Use/Tot              (MB)     (MB)  node        JobId User GRES/job ...
	hpc046              hpc*    alloc  64  64   41.59*   512000   412358  (null)      83704 user3  83673 user3  83146 user4
	hpc047              hpc*      mix  32  64    9.03*   512000   498828  (null)      83733 user3  83147 user4
	hpc048              hpc*     idle   0  64    0.05    256000   504534  (null)       
	hpc049              hpc*     drng* 16  64    1.01*   512000   465488  (null)      81028 user4   
	hpc050               gpu    alloc  16  16    1.06*    32000    15818  (null)      83672 user3   
	hpc051               gpu     idle   0  16    0.02     32000    19415  (null)       
	hpc057               gpu     idle   0  16    0.03     32000    19659  (null)       
	hpc058               gpu     idle   0  16    0.02     32000    19307  (null)       
	hpc059              hpc*      mix  55  64   30.57*   512000   455164  (null)      80228 user1  80229 user2
	hpc060               fct      mix   2  96    1.02*   512000   491288  gpu:v100s:1 80230 user5   
	hpc061               fct      mix   2  96    1.02*   512000   497421  gpu:t4:1    80231 user5   
	hpc062               fct     idle   0  96    0.13    512000   508454  gpu:v100s:1  
	hpc063               fct     idle   0  96    0.01    512000   455017  gpu:t4:1     

The command result may be refine with extra options, check the man page for details:

	[user@cirrus01 ~]$ pestat --help

# Consumption Monitoring

## `report` command

We have created some script for monitoring consumption of the group and it's constituient users.
Please note that the first line in the chart is in regard to the total consumption of resources made by the entire group.
[![](https://wiki.incd.pt/uploads/images/gallery/2023-10/scaled-1680-/7P9s25fTLoCo77Wg-image-1697458111895.png)](https://wiki.incd.pt/uploads/images/gallery/2023-10/7P9s25fTLoCo77Wg-image-1697458111895.png)

The script we developed can be used to get the data from a specific starting time, you just have to pass the flag -s and the starting date as an argument for the script. Similarly, to specify the end date you just need to pass the flag -e followed by the end date. 
**Note:** Be mindfull that the data format must be *'MM/DD/YY'*.

[comment]: <> (This command will present the user with information regarding the group that he belongs to, starting from the date he passed as and argument until the current day, as you can see in the example bellow.)

# Job pipeline using slurm dependencies

Some times we need to launch a list of jobs that execute in sequence, one after another.
In those cases we will use the **--depency** **sbatch** option, check the manual page for more details, we will only present a simple example.

## Simple example

Suppose we need to submit the script ***my_first_job.sh*** and then ***mu_second_job.sh*** that should run after the first one:

	[user@cirrus01 ~]$ sbatch my_first_job.sh
    Submitted batch job 1843928

	[user@cirrus01 ~]$ sbatch --dependency=after:1843928 my_second_job.sh
    Submitted batch job 1843921
    
    [user@cirrus01 ~]$ squeue
    JOBID   PARTITION              NAME USER ST TIME NODES NODELIST(REASON) 
    1843928       hpc   my_first_job.sh user  R 0:11     1 hpc046
    1843921       hpc  my_second_job.sh user PD 0:00     1 hpc047

In this case the second job will run even if the first job fails for some reason. The pending job will execute when the first finish his execution.

## Tipical example

On a real case we may need the ensure that a good termination of the first job, for example, the first job may produce some output file needed as input for the second job:

	[user@cirrus01 ~]$ sbatch my_first_job.sh
    Submitted batch job 1843922

	[user@cirrus01 ~]$ sbatch --dependency=afterok:1843922 my_second_job.sh
    Submitted batch job 1843923

The ***afterok*** parameter states that the second job would start only if the previous job terminate with no errors.

## Complex cases

Check the ***sbatch*** manual page for more details:

	 [user@cirrus01 ~]$ man sbatch

search for the ***-d, --dependency=<dependency_list>*** options explanation.

# Use of user QOS for CPU jobs

In order to use QOS you will to have an assigned user QOS. In the following example the user will submit a job to the fct partition using an specific created cpca097822021.

```

#!/bin/bash
#SBATCH --job-name=prod01
#SBATCH --time=0:10:0
#SBATCH --partition=fct
#SBATCH --qos=cpca097822021
#SBATCH --output=%x.o%j
#SBATCH --error=%x.o%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

### Prepare the environment
module purge
module load gcc83/openmpi/4.1.1 cuda-11.2

echo hostname
```

**Not all queues allow QOS please follow guidance provided by INCD team when assigning the QOS.**

# Job Submission Hints

### Provide job time limit when possible

Users can provide the expected execution time for jobs with option

	sbatch|srun --time=<time> ...

to the **sbatch** or **srun** comands, or instead insert in job script

	#SBATCH --time=<time>

The **\<time\>** accepted formats are "**minutes**",  "**minutes:seconds**", "**hours:minutes:seconds**",  "**days-hours**", "**days-hours:minutes**" and "**days-hours:minutes:seconds**".

The jobs execution scheduling is based on the *First Come First Served* principle in conjugation with the *Backfill* algorithm, the pending job waiting time may decrease up to 20% if the expected execution time is provided[***1***]. If the time limit is not provided then the partition maximum execution time is used, **96 hours**. If the user expect the job would execute on a shorter time the system may be able to find a "hole" on scheduled jobs and execute it long before foreseen.

If the user is unable to estimate the job execution time then the best pratice would be to launch a first test job in order to check the elapsed time.

### Provide job memory if jobs abort with OUT_OF_MEMORY

If your jobs finish with "**memory out**" error then you may need to request a memory volume bigger than the default setting.

You will know a job finish because of "**memory out**" checking the jobs standard output files or running the command **sacct**, for example:

	$ sacct -j <job_id> -X
    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
	------------ ---------- ---------- ---------- ---------- ---------- --------
    <job_id>          myjob        hpc       csys          1 OUT_OF_ME+    0:125

the "*state*" column shows the message "*OUT_OF_MEMORY*".

The first step would be to check what was the allocated memory for your job again with the command **sacct** as following:

	$ sacct -j <job_id> -X -o AllocTRES%40
                                   AllocTRES 
	---------------------------------------- 
            billing=1,cpu=1,mem=5000M,node=1

As we can see the job maximum allocated memory was *5000MB*, the job also use one *CPU core*; in this case the system select *5000MB* per *CPU core* automatically, this is the default for this partition. The memory rate per cpu depends on the *SLURM* partition but in all cases if you select two *CPU cores*, for example, you would get *2 x 5000 = 10000 MB*.

Now we can proceed to find out the required memory volume for this job. You can submit test jobs in sequence increasing the requested memory by *5000MB* increments, for example, or doble the value on each iteraction. You continue sending jobs until you get a successful job or you reach the worker nodes maximum RAM memory.

To request *10000MB* add the option 

	sbatch|srun --mem=10000M ...

or 

	sbatch|srun --mem-per-cpu=10000M ...
    
to the **sbatch** or **srun** commands; you can instead insert the following directives on job script:

	#SBATCH --mem=10000M
    
or

	#SBATCH --mem-per-cpu=10000M

Note that if you use option *--mem-per-cpu* and you request more than on *CPU cores* with option *--ntasks=N* then the total request memory would be *10000 x N*.

> **Please note**
>
> You may request any memory volume up to the maximum hardware limit as long it is required for your computations but if the memory rate per cpu core is exceeded then the system will select extra cpu cores automatically in order to respect the default partition memory rate per cpu core.
>
> The reason for this behavior is to account for all committed resources; for example, if one request the maximum memory for a worker node but use only one cpu core, the machine will be unavailable for other jobs and should be included on the job accounting, even if it doesn't use all cpu cores.

### Job execution time longer the partition limit

Some jobs take much more time to execute than the partition limit of 96 hours. Usually the affected users ask us to increase the partition limit but we can not comply, the system would become unmanageable and less fair for most users.

One possible solution is to parallelize your code with MPI or OPENMP and reduce the elaps time executing on multiple CPU's and nodes. If the code is not already parallelized then it will not be easy to adapt the code but it could be the only way to carry out the job.

There are some simpler problems that can be breaked into multiple independent parts and executed simultaneously without the need of interaction between the parts. This kind of problems could be easier to adapt and it is also possible to run the parts in parallel.

If the code can not be broken and must be executed sequentially you can to implementation a checkpoint procedure, this may not be simple and depends on the application but it worthwhile the effort, when it is possible. In this approach the code suspend the execution when the partition timelimit is near, save on disk the present state and exit. Next, a new job is launch parametrized to resume the execution on the same point the previous job stop. This could be done as many time as needed until the whole computation is finished. The jobs chain could be launch in one go as showned on ***[Job pipeline using slurm dependencies](https://wiki.incd.pt/books/manage-jobs/page/job-pipeline-using-slurm-dependencies)***

***

[***1***] **Leonenkov, Sergei** and **Zhumatiy, Sergei**. "*Introducing new backfill-based scheduler for SLURM resource manager*", Procedia Computer Science, volume 66, pages 661-669, 2015