How to Run a Job with a GPU
Let's run the gravitational N-body simulation found on the CUDA toolkit samples on a GPU. This example is suited for a standard INCD user elegible to use the hpc and gpu partitions.
The fct partition and included resources is meant for users with a FCT grant and although the request of GPUs is made on the same way, they have specific instructions to follow found at FCT Calls
The GPU's are only available at CIRRUS-A infrastruture on Lisbon.
Login on the user interface cirrus.ncg.ingrid.pt
$ ssh -l user cirrus.ncg.ingrid.pt [user@cirrus01 ~]$ _
Prepare your working directory
Prepare your environment on a specific directory in order to protect from inter job interferences and create a submission batch script:
[user@cirrus01 ~]$ mkdir myworkdir [user@cirrus01 ~]$ cd myworkdir [user@cirrus01 ~]$ cat nbody.sh #!/bin/bash #SBATCH --partition=gpu #SBATCH --gres=gpu #SBATCH --mem=8192MB #SBATCH --ntasks=1 COMMON=/usr/local/cuda/samples/common SAMPLE=/usr/local/cuda/samples/5_Simulations/nbody [ -d ../common ] || cp -r $COMMON .. [ -d nbody ] || cp -r $SAMPLE . module load cuda cd nbody make clean make if [ -e nbody ]; then chmod u+x nbody ./nbody -benchmark -numbodies=2560000 fi
In this example we copy the n-body CUDA toolkit sample simulation to the working directory, load cuda environment, build the simulation and run it.
Requesting the partition
Standard INCD users at CIRRUS-A have access to the gpu partition providing NVIDIA Tesla-T4 GPUs. In order to access these GPUs request the gpu partition with directive:
The partition fct provide two types of NVIDIA Tesla: T4 and V100S. This partition is exclusive for FCT grant users. As a general rule and depending on the application, the two types of GPUs available on the cluster are very similar but the Tesla-V100S perform the same work in half the time when compared with the Tesla-T4. Nevertheless, if you request a Tesla-V100S you may have to wait for resource availability until you have a free Tesla-T4 ready to go. If you only want a free GPU allocated for your job then the #SBATCH --grep=gpu* form would be the best choice.
Requesting the GPU
We request the allocation of one GPU NVIDIA Tesla-T4 throught the option:
Standard INCD users can access only NVIDIA Tesla-T4 GPUs, so we can simplify the request:
this way we ask for a GPU of any type, the same is valid on partitions with more than one type of GPU if we do not care about the type of allocated GPU to our job.
Ensure enough memory for your simulation, follow the tips on Determining Memory Requirements(page_to_be) page.
On our example 8GB is sufficient to run the simulation:
You should also plan the number of tasks in use, this will depend on the application, for our simulation we need only one task, or CPU:
we could ommit this directive, the default would be 1 task.
Submit the simulation
[user@cirrus01 ~]$ sbatch nbody.sh Submitted batch job 1176
Monitor your job
You can use the squeue command line tool
[user@cirrus01 ~]$ gqueue JOBID PARTITION NAME USER ST STATIME NODES CPUS TRES_PER_NODE NODELIST 1176 gpu nbody.sh user5 R RUN0-00:02:33 1 1 gpu:t4 hpc058
or use the command sacct, the job is completed when the State field mark is COMPLETED.
[user@cirrus01 ~]$ gacct JobID JobName Partition Account AllocCPUS ReqGRES AllocGRES State ExitCode ------------ ---------- ---------- ---------- ---------- ------------ ------------ ---------- -------- 1170 nbody.sh fct hpc 2 gpu:v100s:1 gpu:1 COMPLETED 0:0 1171 nbody.sh fct hpc 2 gpu:t4:1 gpu:1 COMPLETED 0:0 1175 teste.sh fct hpc 1 COMPLETED 0:0 1176 nbody.sh gpu hpc 1 gpu:1 gpu:1 COMPLETED 0:0
if the state is different from COMPLETED or RUNNING then check your simulation or request help throught the email address firstname.lastname@example.org providing the JOBID, the submission script, the relevant slurm output files, e.g. slurm-1176.out, or other remarks you think it may be helpfull
Check the results at job completion
[user@cirrus01 ~]$ ls -l -rw-r-----+ 1 user hpc 268 Oct 22 13:56 gpu.sh drwxr-x---+ 3 user hpc 4096 Oct 20 18:09 nbody -rw-r-----+ 1 user hpc 611 Oct 22 13:41 slurm-1176.out [user@cirrus01 ~]$ cat slurm-1176.out ... > Windowed mode > Simulation data stored in video memory > Single precision floating point simulation > 1 Devices used for simulation GPU Device 0: "Turing" with compute capability 7.5 > Compute 7.5 CUDA device: [Tesla T4] number of bodies = 2560000 2560000 bodies, total time for 10 iterations: 308586.156 ms = 212.375 billion interactions per second = 4247.501 single-precision GFLOP/s at 20 flops per interaction