Deep Learning Example
The INCD-Lisbon facility provide a few GPU, check the Comput Node Specs page.
Login on the submit node
Login on the cluster submition node, check the How to Access page for more information:
$ ssh -l <username> cirrus.ncg.ingrid.cirrus8.a.incd.pt
[username@cirrus01 ~]$ _
Alternatives to run the Deep Learning example
We have three alternatives to run the Deep Learning example, or any other python based script:
- prepare a user python virtual environment on home directory and launch a batch job;
- we can also use the python virtual environment already prepared on the system and run the same example on it;
- finally we can use a udocker container also available on the system.
The next three sections shows how to run the example for each method.
1) Run a Deep Learning job using a user python virtual environment
List available virtual environments
[jpina@cirrus08 ~]$ scl list-collections
gcc-toolset-12
gcc-toolset-13
Prepare a python virtual environment
The default python version for CentOS 7.x is 2.7.5 which is not suitable for our example that rely on version 3.6 and up. So, weWe will create a python virtual environment and include needed components:components, users do not have permission to install modules on the operating system.
[username@cirrus01username@cirrus08 ~]$ scl enable gcc-toolset-13 bash
[username@cirrus01 ~]$ python3.6python3 -m venv ~/pvenv
[username@cirrus01 ~]$ . ~/pvenv/bin/activate
[username@cirrus01 ~]$ pippip3 install --upgrade pip
[username@cirrus01 ~]$ pippip3 install --upgrade setuptools
[username@cirrus01 ~]$ pippip3 install tensorflow-gputensorflow
[username@cirrus01 ~]$ pippip3 install keras
This opperation is performed only once, the python virtual environment will be reused all over your jobs.
Submit a Job to install TensorFlow and Keras on the python virtual environment
Since we do not have direct access to the GPU on the submit node then we have to submit a job, and only one, to install TensorFlow and Keras on our python virtual environment.
Create a submit script like as showed bellow and submit it:
[username@cirrus01 ~]$ vi pip_install.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
scl enable rh-python36 bash
. ~/pvenv/bin/activate
pip install tensorflow-gpu
pip install keras
[username@cirrus01 ~]$ sbatch pip_install.sh
Check the job output files after finished for correct completion, if something is wrong try to solve the problem or request support from helpdesk@incd.pt. You can also include in the job the full python virtual environment preparation as showed on the previous section if you like.
Check the python virtual environment
You may check if the python virtual environment is working as expected, for example:
[username@cirrus01username@cirrus08 ~]$ python. --version
Python 2.7.5~/pvenv/bin/activate
[username@cirrus01 ~]$ scl enable rh-python36 bash
[username@cirrus01username@cirrus08 ~]$ python --version
Python 3.6.98
[username@cirrus01username@cirrus08 ~]$ . ~/pvenv/bin/activate
[username@cirrus01 ~]$ pippip3 list
Package Version
-------------------- ----------
...
Keras 2.3.1
Keras-Applications 1.0.86.0
Keras-Preprocessing 1.1.02
...
setuptools 44.0.59.6.0
...
tensorboard 2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.6.2
tensorflow-estimator 2.0.1
tensorflow-gpu 2.0.6.0
Prepare your code
Choose a working directory for your code, for the purpose of this example we will run a deep learning python script named run.py, create also a submit script:
[username@cirrus01username@cirrus08 ~]$ mkdir dl
[username@cirrus01username@cirrus08 ~]$ cd dl
[username@cirrus01username@cirrus08 dl]$ wgetcp --no-check-certificate https://wiki.incd.pt/attachments/79 -O cvmfs/sw.el8/share/deep_learning/run.py .
[username@cirrus01username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G
scl enable rh-python36 bash
. ~/pvenv/bin/activate
module load cuda-10.2
python run.py
[username@cirrus01username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup 514124 JanFeb 526 13:4216:44 dl.sh
-rw-r-----+ 1 username usergroup 13781417 JanFeb 526 15:4216:46 run.py
Submit the Job
[username@cirrus01username@cirrus08 dl]$ qbatch dl.sh
Submitted batch job 202749715135448
[username@cirrus01username@cirrus08 dl]$ $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
202749715135448 gpu dl.sh username R 0:01 1 hpc062
Check Job results
On completion check results on standard output and error files:
[username@cirrus01username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup 514124 JanFeb 526 13:4216:44 dl.sh
-rw-r-----+ 1 username usergroup 13781417 JanFeb 526 15:4216:46 run.py
-rw-r-----+ 1 username usergroup 495618000 JanFeb 626 13:4418:51 slurm-2027497.15135448.out
2) Run a Deep Learning job using the CVMFS python virtual environment
Instead of preparing an user python virtual environment we can use the environment already available on the system, named tensorflow/2.4.1python/3.10.13, check it with the command
[username@cirrus01username@cirrus08 ~]$ module avail
---------------- /cvmfs/sw.el7/el8/modules/hpchpc/main ------------------
...
intel/2019.mkloneapi/2023 mpich-python/3.8 udocker/alphafold/2.3.2
tensorflow/2.4.1julia/1.6.7 python/3.10.13 (D)
...
Yowill find other python version, namely version 3.7 and 3.8, this version do not contain the tensorflo module due to python version incompatibility.
We will change the submit script dl.sh to the following:
[username@cirrus01username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G
module load tensorflow/2.4.1python/3.10.7
python run.py
[username@cirrus01username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup 145124 JanFeb 526 13:4216:44 dl.sh
-rw-r-----+ 1 username usergroup 13781417 JanFeb 526 15:4216:46 run.py
and procceed as in the previous example.
3) Run a Deep Learning job using a container available on CVMFS
Another alternative would be use a tensorflow container available on the CVMFS volume, this container is access by a wrapper, check more details on UDocker Containders. The environment is named udocker/tensorflow/gpu/2.4.1, check it with:
[username@cirrus01 ~]$ module avail
---------------......- /cvmfs/sw.el7/modules/hpc ----------------------
...
intel/hdf4/4.2.15 netcdf-fortran/4.5.2 udocker/tensorflow/gpu/2.4.1
...
In this case we will edit the submit script dl.sh to load a different environment and evoke the run.py script with the wrapper u_wrapper, like shown bellow:
[username@cirrus01 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G
module load udocker/tensorflow/gpu/2.4.1
u_wrapper python run.py
[username@cirrus01 dl]$ ls -l
-rwxr-----+ 1 username usergroup 157 Jan 5 13:42 dl.sh
-rw-r-----+ 1 username usergroup 1378 Jan 5 15:42 run.py
then do as before to submit the job and get the results.