Skip to main content

Deep Learning Example

The INCD-Lisbon facility provide a few GPU, check the Comput Node Specs page.

Login on the submit node

Login on the cluster submition node, check the How to Access page for more information:

$ ssh -l <username> hpc7.cirrus.ncg.ingrid.pt
[username@hpc7username@cirrus01 ~]$ _
Prepare a python virtual environment

The default python version for CentOS 7.x is 2.7.5 which is not suitable for our example that rely on version 3.6 and up. So, we will create a python virtual environment and include needed components:

[username@hpc7username@cirrus01 ~]$ scl enable rh-python36 bash
[username@hpc7username@cirrus01 ~]$ python -m venv ~/pvenv
[username@hpc7username@cirrus01 ~]$ . ~/pvenv/bin/activate
[username@hpc7username@cirrus01 ~]$ pip install --upgrade pip
[username@hpc7username@cirrus01 ~]$ pip install --upgrade setuptools
[username@hpc7username@cirrus01 ~]$ pip install tensorflow-gpu
[username@hpc7username@cirrus01 ~]$ pip install keras

This opperation is performed only once, the python virtual environment will be reused all over your jobs.

Submit a Job to install TensorFlow and Keras on the python virtual envionment

Since we do not have direct access to the GPU on the submit node then we have to submit one job, and only one, to install TensorFlow and Keras on our python virtual environment.

Create a submit script like as showed bellow and submit it:

[username@hpc7username@cirrus01 ~]$ vi pip_install.sh
#!/bin/bash
#$#SBATCH -qp gpu
#SBATCH --gres=gpu

scl enable rh-python36 bash
. ~/pvenv/bin/activate
pip install tensorflow-gpu
pip install keras

[username@hpc7username@cirrus01 ~]$ qsubsbatch pip_install.sh

Check the job output files after finished for correct completion, if something is wrong try to solve the problem or request support from helpdesk@incd.pt. You can also include in the job the full python virtual environment preparation as showed on the previous section if you like.

Check the python virtual environment

You may check if the python virtual environment is working as expected, for example:

[username@hpc7username@cirrus01 ~]$ python --version
Python 2.7.5
[username@hpc7username@cirrus01 ~]$ scl enable rh-python36 bash
[username@hpc7username@cirrus01 ~]$ python --version
Python 3.6.9
[username@hpc7username@cirrus01 ~]$ . ~/pvenv/bin/activate
[username@hpc7username@cirrus01 ~]$ pip list
Package              Version   
-------------------- ----------
...
Keras                2.3.1
Keras-Applications   1.0.8
Keras-Preprocessing  1.1.0
...
setuptools           44.0.0
...
tensorboard          2.0.2     
tensorflow-estimator 2.0.1     
tensorflow-gpu       2.0.0
Prepare your code

Choose a working directory for your code, for the purpose of this example we will run a deep learning python script named run.py, create also a submit script:

[username@hpc7username@cirrus01 ~]$ mkdir dl
[username@hpc7username@cirrus01 ~]$ cd dl
[username@hpc7username@cirrus01 dl]$ wget --no-check-certificate https://wiki.incd.pt/attachments/7079 -O run.py

[username@hpc7username@cirrus01 dl]$ vi dl.sh
#!/bin/bash
#$#SBATCH -qp gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G

scl enable rh-python36 bash
. ~/pvenv/bin/activate
module load cuda-10.2
python run.py

[username@hpc7username@cirrus01 dl]$ ls -l
-rwxr-----+ 1 username hpcusergroup  514 Jan  5 13:42 dl.sh
-rw-r-----+ 1 username hpcusergroup 1378 Jan  5 15:42 run.py
Submit the Job
[username@hpc7username@cirrus01 dl]$ qsubqbatch dl.sh
YourSubmitted batch job 2027497

("dl.sh") has been submitted

[username@hpc7username@cirrus01 dl]$ qstat$ job-IDsqueue 
  priorJOBID namePARTITION     userNAME     stateUSER submit/startST  atTIME  queueNODES slots ja-task-ID 
----------------------------------------------------------------------------------NODELIST(REASON) 
2027497       0.10134gpu    dl.sh username  rR  01/06/2020 13:28:36 gpu@hpc0460:01      1 hpc062 
Check Job results

On completion check results on standard output and error files:

[username@hpc7username@cirrus01 dl]$ ls -l
-rwxr-----+ 1 username hpcusergroup   514 Jan  5 13:42 dl.sh
-rw-r-----+ 1 username hpcusergroup  1378 Jan  5 15:42 run.py
-rw-r-----+ 1 username hpcusergroup  4956 Jan  6 13:44 dl.sh.e2027497
-rw-r-----+ 1 username hpc 14009 Jan  6 13:44 dl.sh.o2027497slurm-2027497.out