Deep Learning Example

The INCD-Lisbon facility provide a few GPU, check the Comput Node Specs page.

$ ssh -l <username> cirrus8.a.incd.pt
[username@cirrus01 ~]$ _

Alternatives to run the Deep Learning example

We have alternatives to run the Deep Learning example, or any other python based script:

prepare a user python virtual environment on home directory and launch a batch job;

The next three sections shows how to run the example for each method.

1) Run a Deep Learning job using a prepared CVMFS python virtual environment

Instead of preparing an user python virtual environment we can use the environment already available on the system, named python/3.10.13, check it with the command

[username@cirrus08 ~]$ module avail
---------------- /cvmfs/sw.el8/modules/hpc/main ------------------
...
intel/oneapi/2023    python/3.8          udocker/alphafold/2.3.2
julia/1.6.7          python/3.10.13 (D)
...

We will find other python version, namely version 3.7 and 3.8, this version do not contain the tensorflo module due to python version incompatibility.

We will change the submit script dl.sh to the following:

[username@cirrus08 dl]$ vi dl.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --mem=64G

module load python/3.10.7
python run.py

[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup   124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup  1417 Feb 26 16:46 run.py

Submit the Job

[username@cirrus08 dl]$ sbatch dl.sh
Submitted batch job 15135448
JOBID    PARTITION NAME       USER        ST TIME       NODES CPUS TRES_PER_NODE  NODELIST
15290034 gpu       dl.sh      jpina       PD 0:00       1     1    gres/gpu

Check Job results

On completion check results on standard output and error files:

[username@cirrus08 dl]$ ls -l
-rwxr-----+ 1 username usergroup   124 Feb 26 16:44 dl.sh
-rw-r-----+ 1 username usergroup  1417 Feb 26 16:46 run.py
-rw-r-----+ 1 username usergroup 18000 Feb 26 18:51 slurm-15135448.out

and procceed as in the previous example.

Slurm

Jobs information

My first slurm job

overview of the resources offered

show job accounting data

stop or cancel jobs

Show jobs information in queue

How to run parallel job's with srun

Preparing the Environment

Interactive Sessions

Job pipeline using slurm dependencies

Use of user QOS for CPU jobs

How to Run a Job with a GPU

Use QOS to run GPU jobs

Deep Learning Example

How to selected a GPU

My jobs need to run longer than the queues permit

Resource Consuption

Deep Learning Example

Alternatives to run the Deep Learning example

1) Run a Deep Learning job using a prepared CVMFS python virtual environment

Submit the Job

Check Job results

Slurm

Jobs information

My first slurm job

overview of the resources offered

show job accounting data

stop or cancel jobs

Show jobs information in queue

How to run parallel job's with srun

Preparing the Environment

Interactive Sessions

Job pipeline using slurm dependencies

Use of user QOS for CPU jobs

How to Run a Job with a GPU

Use QOS to run GPU jobs

Deep Learning Example

How to selected a GPU

My jobs need to run longer than the queues permit

Resource Consuption

Deep Learning Example

Login on the submit node

Alternatives to run the Deep Learning example

1) Run a Deep Learning job using a prepared CVMFS python virtual environment

Submit the Job

Check Job results