Deep Learning Example
The INCD-Lisbon facility provide a few GPU, check the Comput Node Specs page.
Login on the submit node
Login on the cluster submition node, check the How to Access page for more information:
$ ssh -l <username> cirrus8.a.incd.pt
[username@cirrus01 ~]$ _
Alternatives to run the Deep Learning example
We have alternatives to run the Deep Learning example, or any other python based script:
- prepare a user python virtual environment on home directory and launch a batch job;
The next three sections shows how to run the example for each method.
1) Run a Deep Learning job using a prepared CVMFS python virtual environment
Instead of preparing an user python virtual environment we can use the environment already available on the system, named python/3.10.13, check it with the command
[username@cirrus08 ~]$ module avail
---------------- /cvmfs/sw.el8/modules/hpc/main ------------------
...
intel/oneapi/2023 python/3.8 udocker/alphafold/2.3.2
julia/1.6.7 python/3.10.13 (D)
...
We will find other python version, namely version 3.7 and 3.8, this version do not contain the tensorflo module due to python version incompatibility.
We will change the submit script dl.sh to the following:
[username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G
module load python/3.10.7
python run.py
[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup 124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup 1417 Feb 26 16:46 run.py
Submit the Job
[username@cirrus08 dl]$ sbatch dl.sh
Submitted batch job 15135448
JOBID PARTITION NAME USER ST TIME NODES CPUS TRES_PER_NODE NODELIST
15290034 gpu dl.sh jpina PD 0:00 1 1 gres/gpu
Check Job results
On completion check results on standard output and error files:
[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup 124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup 1417 Feb 26 16:46 run.py
-rw-r-----+ 1 username usergroup 18000 Feb 26 18:51 slurm-15135448.out
and procceed as in the previous example.