How to Run a Job with a GPU

Let's run the gravitational N-body simulation found on the CUDA toolkit samples on a GPU. This example is suited for a standard INCD user elegible to use the hpc and gpu partitions.

The fct partition and included resources is meant for users with a FCT grant and although the request of GPUs is made on the same way, they have specific instructions to follow found at FCT Calls

The GPU's are only available at CIRRUS-A infrastruture on Lisbon.

Login on the user interface cirrus.ncg.ingrid.pt

$ ssh -l user cirrus.ncg.ingrid.pt
[user@cirrus01 ~]$ _

Prepare your working directory

Prepare your environment on a specific directory in order to protect from inter job interferences and create a submission batch script: *** only works for Cuda 10.2

[user@cirrus01 ~]$ mkdir myworkdir
[user@cirrus01 ~]$ cd myworkdir
[user@cirrus01 ~]$ cat nbody.sh
#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --mem=8192MB

COMMON=/usr/local/cuda/samples/common
SAMPLE=/usr/local/cuda/samples/5_Simulations/nbody

[ -d ../common ] || cp -r $COMMON ..
[ -d nbody     ] || cp -r $SAMPLE .

module load cuda
cd nbody
make clean
make

if [ -e nbody ]; then
	chmod u+x nbody
	./nbody -benchmark -numbodies=2560000
fi

In this example we copy the n-body CUDA toolkit sample simulation to the working directory, load cuda environment, build the simulation and run it.

Requesting the partition

Standard INCD users at CIRRUS-A have access to the gpu partition providing NVIDIA Tesla-T4 GPUs. In order to access these GPUs request the gpu partition with directive:

#SBATCH --partition=gpu

The partition fct provide several types of NVIDIA: T4 and V100S (please check current resources available page). As a general rule and depending on the application, the types of GPUs available on the cluster are similar but the Tesla-V100S perform the same work in half the time when compared with the Tesla-T4. Nevertheless, if you request a Tesla-V100S you may have to wait for resource availability until you have a free Tesla-T4 ready to go.

If you only want a free GPU allocated for your job then the #SBATCH --grep=gpu* form would be the best choice.

Requesting the Tesla-T4 GPU

We request the allocation of one GPU NVIDIA Tesla-T4 throught the option:

#SBATCH --gres=gpu:t4

Standard INCD users can access only NVIDIA Tesla-T4 GPUs, so we can simplify the request:

#SBATCH --gres=gpu

this way we ask for a GPU of any type, the same is valid on partitions with more than one type of GPU if we do not care about the type of allocated GPU to our job.

Requesting memory

Ensure enough memory for your simulation, follow the tips on Determining Memory Requirements(page_to_be) page.

On our example 8GB is sufficient to run the simulation:

#SBATCH --mem=8192M

Submit the simulation

[user@cirrus01 ~]$ sbatch nbody.sh
Submitted batch job 1176

Monitor your job

You can use the squeue command line tool

[user@cirrus01 ~]$ gqueue 
JOBID PARTITION NAME      USER     ST STATIME       NODES CPUS TRES_PER_NODE NODELIST 
1176  gpu       nbody.sh  user5    R  RUN0-00:02:33 1     1    gpu:t4        hpc058

or use the command sacct, the job is completed when the State field mark is COMPLETED.

[user@cirrus01 ~]$ gacct 
   JobID    JobName  Partition    Account  AllocCPUS      ReqGRES    AllocGRES      State ExitCode 
------------ ---------- ---------- ---------- ---------- ------------ ------------ ---------- -------- 
1170       nbody.sh        fct        hpc          2  gpu:v100s:1        gpu:1  COMPLETED      0:0 
1171       nbody.sh        fct        hpc          2     gpu:t4:1        gpu:1  COMPLETED      0:0 
1175       teste.sh        fct        hpc          1                            COMPLETED      0:0 
1176       nbody.sh        gpu        hpc          1        gpu:1        gpu:1  COMPLETED      0:0

if the state is different from COMPLETED or RUNNING then check your simulation or request help throught the email address helpdesk@incd.pt providing the JOBID, the submission script, the relevant slurm output files, e.g. slurm-1176.out, or other remarks you think it may be helpfull

Check the results at job completion

[user@cirrus01 ~]$ ls -l
-rw-r-----+ 1 user hpc  268 Oct 22 13:56 gpu.sh
drwxr-x---+ 3 user hpc 4096 Oct 20 18:09 nbody
-rw-r-----+ 1 user hpc  611 Oct 22 13:41 slurm-1176.out


[user@cirrus01 ~]$ cat slurm-1176.out
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Turing" with compute capability 7.5

> Compute 7.5 CUDA device: [Tesla T4]
number of bodies = 2560000
2560000 bodies, total time for 10 iterations: 308586.156 ms
= 212.375 billion interactions per second
= 4247.501 single-precision GFLOP/s at 20 flops per interaction