Deep Learning Example

The INCD-Lisbon facility provide a few GPU, check the Comput Node Specs page.

$ ssh -l <username> cirrus8.a.incd.pt
[username@cirrus01 ~]$ _

Alternatives to run the Deep Learning example

We have three alternatives to run the Deep Learning example, or any other python based script:

prepare a user python virtual environment on home directory and launch a batch job;
we can also use the python virtual environment already prepared on the system and run the same example on it;
finally we can use a udocker container also available on the system.

The next three sections shows how to run the example for each method.

1) Run a Deep Learning job using a user python virtual environment

Prepare a python virtual environment

We will create a python virtual environment and include needed components, users do not have permission to install modules on the operating system.

[username@cirrus08 ~]$ python3 -m venv ~/pvenv
[username@cirrus08 ~]$ . ~/pvenv/bin/activate
[username@cirrus08 ~]$ pip3 install --upgrade pip
[username@cirrus08 ~]$ pip3 install --upgrade setuptools
[username@cirrus08 ~]$ pip3 install tensorflow
[username@cirrus08 ~]$ pip3 install keras

This opperation is performed only once, the python virtual environment will be reused all over your jobs.

Check the python virtual environment

You may check if the python virtual environment is working as expected, for example:

[username@cirrus08 ~]$ . ~/pvenv/bin/activate
[username@cirrus08 ~]$ python --version
Python 3.6.8
[username@cirrus08 ~]$ pip3 list
Package              Version   
-------------------- ----------
...
Keras                   2.6.0
Keras-Preprocessing     1.1.2
...
setuptools              59.6.0
...
tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
tensorflow              2.6.2
tensorflow-estimator    2.6.0

Prepare your code

Choose a working directory for your code, for the purpose of this example we will run a deep learning python script named run.py, create also a submit script:

[username@cirrus08 ~]$ mkdir dl
[username@cirrus08 ~]$ cd dl
[username@cirrus08 dl]$ cp /cvmfs/sw.el8/share/deep_learning/run.py .

[username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G
. ~/pvenv/bin/activate
module load cuda-10.2
python run.py

[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup  124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup 1417 Feb 26 16:46 run.py

Submit the Job

[username@cirrus08 dl]$ qbatch dl.sh
Submitted batch job 15135448

[username@cirrus08 dl]$ $ squeue 
   JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON) 
15135448       gpu    dl.sh username  R  0:01      1 hpc062

Check Job results

On completion check results on standard output and error files:

[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup   124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup  1417 Feb 26 16:46 run.py
-rw-r-----+ 1 username usergroup 18000 Feb 26 18:51 slurm-15135448.out

2) Run a Deep Learning job using the CVMFS python virtual environment

Instead of preparing an user python virtual environment we can use the environment already available on the system, named python/3.10.13, check it with the command

[username@cirrus08 ~]$ module avail
---------------- /cvmfs/sw.el8/modules/hpc/main ------------------
...
intel/oneapi/2023    python/3.8          udocker/alphafold/2.3.2
julia/1.6.7          python/3.10.13 (D)
...

We will find other python version, namely version 3.7 and 3.8, this version do not contain the tensorflo module due to python version incompatibility.

We will change the submit script dl.sh to the following:

[username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G

module load python/3.10.7
python run.py

[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup   124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup  1417 Feb 26 16:46 run.py

and procceed as in the previous example.