Manage SoGE jobs - (deprecated)

How to manage jobs using the SoGE batch system at the old INCD-Lisbon cluster, HPC users should follow Manage SLURM jobs at https://wiki.incd.pt/books/manage-jobs/chapter/manage-slurm-jobs,

My first job

Examples

Submit a simple MPI job

vi my_first_job_submit.sh

#!/bin/bash

# Call the MPI environment, selecting two cores 
#$ -pe mpi 2

# Choose the queue hpc (OPTIONAL). If you leave this option empty system will assume the default queue
#$ -q hpc

# Load software modules (Open MPI 2.1.0 and GCC compiler 7.3). Please check session software for the details
source /etc/profile.d/modules.sh
module load openmpi-2.1.0 
module load gcc-7.3


# Compile application 
echo "=== Compiling ===" 
mpicc -o cpi cpi.c

# Run application. Please note that the number of cores used by MPI must be same as the ones requested.
echo "=== Running ==="
mpirun -np $NSLOTS cpi

qsub my_first_job_submit.sh

Your job 75463 ("my_first_job_submit.sh") has been submitted
 
 [jpina@hpc7 ~]$ qstat 
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  75463 0.15016 my_first_j jpina        qw    04/30/2019 11:55:55                                    2        
[jpina@hpc7 ~]$ 


State qw means jobs waiting

qstat -j 75463 
[jpina@hpc7 ~]$ qstat 
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  76877 0.68130 my_first_j jpina        r     04/30/2019 12:03:38 hpc@hpc046.ncg.ingrid.pt           2        

State r means job running

[jpina@hpc7 ~]$ qstat 

empty no jobs running under your user (all finished)

For each job you will have an output and error file produced (if you use parallel environment you will receive an additional pair of files) at thte directory where you submited you job:

rw-r-----+  1 jpina csys        0 Apr 30 13:48 my_first_job_submit.sh.e77285
-rw-r-----+  1 jpina csys      186 Apr 30 13:48 my_first_job_submit.sh.o77285
-rw-r-----+  1 jpina csys        0 Apr 30 13:48 my_first_job_submit.sh.pe77285
-rw-r-----+  1 jpina csys        0 Apr 30 13:48 my_first_job_submit.sh.po77285

Reading the output of the job:

[jpina@hpc7 jpina]$ cat my_first_job_submit.sh.o77285
=== Compiling ===
=== Running ===
Hello world from processor hpc046.ncg.ingrid.pt, rank 0 out of 2 processors
Hello world from processor hpc046.ncg.ingrid.pt, rank 1 out of 2 processors

Other usefull commands

[jpina@hpc7 ~]$ qdel 75463
jpina has deleted job 75463 
qstat -u '*' 
qstat -g c

My MPI example:

[jpina@hpc7 jpina]$ cat cpi.c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    
    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    
    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
    processor_name, world_rank, world_size);
    
    // Finalize the MPI environment.
    MPI_Finalize();
    }

Submit a simple MPI job

INCD uses soSGE for job submission and schedulling. We currently planning a migration to a new batch system slurm

Quick syntax guide

Command Comments
qstat Standard queue status command (see man qstat for details )
qdel Delete your jobs from the queues. The jobid is returned by qsub at job submission time
qsub Submit jobs to the queues. (see man qsub for details)

Examples

Submit a simple MPI job

vi my_first_job_submit.sh

#!/bin/bash

# Call the MPI environment, selecting two cores 
#$ -pe mpi 2

# Choose the queue hpc (OPTIONAL). If you leave this option empty system will assume the default queue
#$ -q hpc

# Load software modules (Open MPI 2.1.0 and GCC compiler 7.3). Please check session software for the details
source /etc/profile.d/modules.sh
module load openmpi-2.1.0 
module load gcc-7.3


# Compile application 
echo "=== Compiling ===" 
mpicc -o cpi cpi.c

# Run application. Please note that the number of cores used by MPI must be same as the ones requested.
echo "=== Running ==="
mpirun -np $NSLOTS cpi

qsub my_first_job_submit.sh


Your job 5076311 ("my_first_job_submit.sh") has been submitted

Using job dependencies

Example

$ cat dependencies.sh

#!/bin/sh

j=$1
NJOBS=$2
NPROC=$3

jfinal=`expr $j + $NJOBS`
first="true"

while [ $j -lt $jfinal ]; do
   j=`expr $j + 1`
   j1=`expr $j + 1`

cat - > $0.$j << EOF;
#!/bin/bash
#!/bin/bash


# Call MPI environment with #NPROC
#$ -pe mpi $NPROC


# Load modules 
source /etc/profile.d/modules.sh
module load gcc44/openmpi-1.4.1

# Compile application
echo "=== Compiling ===" 
mpicc -o cpi cpi.c 

# Execute application 
echo "=== Running ==="
mpirun -np $NSLOTS cpi
EOF

if [ "X$first" == "Xtrue" ]; then
   JID=`qsub $0.$j | awk '{print $3}'`
   if [ $? == 0 ]; then
            first="false"
               echo "Submitting job for $j (JID=$JID)"
       else
          echo "Problem submitting job for $j"
          exit 1
   fi
else
   OLDJID=$JID
   JID=`qsub -hold_jid $JID $0.$j | awk '{print $3}'`
       if [ $? != 0 ]; then
              echo "Problem submitting job $j"
              exit 1
       else
       echo "Submitting job for j=$j (JID=$JID). Depends on $OLDJID"
       fi
fi

rm -f $0.$j
done
./dependencies.sh 1 100 8

How to find your job output information

Summary out table

File Default path Comments
<file_submission_name>.e< jobID > /home/ < groupname > / < username> / <file_submission_name> File containing the error output of the job
<file_submission_name>.o< jobID > /home/ < groupname > / < username> / <file_submission_name> File containing the job output of the job
<file_submission_name>.pe< jobID > /home/ < groupname > / < username> / <file_submission_name> File containing the error output for parallel environment of the job
<file_submission_name>.po< jobID > /home/ < groupname > / < username> / <file_submission_name> File containing the job output parallel environment of the job

How to find available resources

Global parameters

Global resources

qstat -g c -q hpcgrid
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
hpcgrid                           0.55    160      0     48    264      0     56 

Special resources

 qhost -l mem_total=300g 
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
hpc044                  lx-amd64       64    2   64   64 60.40  378.6G   20.3G  128.0G   34.6M
hpc045                  lx-amd64       64    2   64   64 64.34  378.6G   20.3G  447.1G   44.3M
hpc046                  lx-amd64      128    2   64  128  0.03  377.6G    6.4G  128.0G     0.0
hpc047                  lx-amd64      128    2   64  128  0.04  377.6G    5.6G  128.0G     0.0
hpc048                  lx-amd64      128    2  128  128 77.03  378.6G    6.5G  128.0G     0.0
qhost -F

How to use job dependencies

Example

$ cat dependencies.sh

#!/bin/sh

j=$1
NJOBS=$2
NPROC=$3

jfinal=`expr $j + $NJOBS`
first="true"

while [ $j -lt $jfinal ]; do
   j=`expr $j + 1`
   j1=`expr $j + 1`

cat - > $0.$j << EOF;
#!/bin/bash
#!/bin/bash


# Call MPI environment with #NPROC
#$ -pe mpi $NPROC


# Load modules 
source /etc/profile.d/modules.sh
module load gcc44/openmpi-1.4.1

# Compile application
echo "=== Compiling ===" 
mpicc -o cpi cpi.c 

# Execute application 
echo "=== Running ==="
mpirun -np $NSLOTS cpi
EOF

if [ "X$first" == "Xtrue" ]; then
   JID=`qsub $0.$j | awk '{print $3}'`
   if [ $? == 0 ]; then
            first="false"
               echo "Submitting job for $j (JID=$JID)"
       else
          echo "Problem submitting job for $j"
          exit 1
   fi
else
   OLDJID=$JID
   JID=`qsub -hold_jid $JID $0.$j | awk '{print $3}'`
       if [ $? != 0 ]; then
              echo "Problem submitting job $j"
              exit 1
       else
       echo "Submitting job for j=$j (JID=$JID). Depends on $OLDJID"
       fi
fi

rm -f $0.$j
done
./dependencies.sh 1 100 8

How to interpret the job status

qstat

Status Meaning
qw job in queue and waiting for available resources
hqw job in queue, waiting and sytem hold
R job is running
t job running and transfering
eqw , ehqw job pending in error state
dr all running and suspended with deletion

Full details

qstat -j jobID

or

qstat -j jobID | less 

Example

qstat -j 190407
job_number:                 190407    (jobID number)
exec_file:                  job_scripts/190407  
submission_time:            Thu May  9 13:01:35 2019  (submission jobTIME)
owner:                      biomed015 (username)
uid:                        3060015  (userID)
group:                      biomed (userGROUP)
gid:                        3060000  (usergroupID)
sge_o_home:                 /home/biomed/biomed015  (home of the username)
sge_o_log_name:             biomed015  (SGE internal parameters)
sge_o_path:                 /opt/sge/bin/lx-amd64:/sbin:/bin:/usr/sbin:/usr/bin (SGE internal parameters)
sge_o_shell:                /sbin/nologin   (SGE internal parameters)
sge_o_workdir:              /var/tmp  (SGE internal parameters)
sge_o_host:                 ce06 (host name submiter)
account:                    sge  (SGE internal parameters)
mail_list:                  biomed015@ce06.ncg.ingrid.pt   (SGE internal parameters)
notify:                     FALSE (email job notifier ON / OFF) -> at INCD this feature is not available
job_name:                   cream_940554311 (name of job: job submission script) 
jobshare:                   0 (jobSHARE: Equal to everyone)
hard_queue_list:            hpc (queue name)
shell_list:                 NONE:/bin/bash 
env_list:                   ..... (very long list). ....
script_file:                /tmp/cream_940554311 (SGE internal parameters)
project:                    BiomedGrid 
binding:                    NONE 
job_type:                   NONE 
scheduling info:            queue instance "csyslip@wn216.ncg.ingrid.pt" dropped because it is temporarily not available 

Explanation

job_number:                 jobID number
exec_file:                  scripts needede to run the job: create by SGE no action from user required  
submission_time:            submission jobTIME
owner:                      username
uid:                        userid
group:                      user group
gid:                        usergroupid
sge_o_home:                 home of the username
sge_o_log_name:             SGE internal parameters
sge_o_path:                 SGE internal parameters
sge_o_shell:                SGE internal parameters
sge_o_workdir:              SGE internal parameters
sge_o_host:                 hostname from which the job is submited
account:                    SGE internal parameters
mail_list:                  SGE internal parameters
notify:                     email job finish ON / OFF -> at INCD this feature is not available
job_name:                   name of job: job submission script)
jobshare:                   jobSHARE: Equal to everyone
hard_queue_list:            queue name to which the job is submited
shell_list:                 shell used (taken form user account information)
env_list:                   FULL list of environment variables
script_file:                /tmp/cream_940554311 (SGE internal parameters)
project:                    Project name to hwich user is assigned (SGE internal parameters) 
binding:                    not used at INCD 
job_type:                   not used at INCD
scheduling info:            long list of resources available to SGE. Take into account that SGE looks at ALL resources available to him and only latter it check checks if users is entitled to run there. 

How to submit interactive jobs

INCD HPC clusters(*) allow interactive jobs

  1. Login to your usual login machine
  2. Type the follwing commands:
$ qlogin
JSV "/opt/ge-tools/submit/jsv.sh" has been started
JSV "/opt/ge-tools/submit/jsv.sh" has been stopped
Your job 977402 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 977402 has been successfully scheduled.
Establishing /opt/sge/util/resources/wrappers/qlogin_wrapper session to host hpc017.ncg.ingrid.pt ...
Warning: Permanently added '[hpc017.ncg.ingrid.pt]:58283,[10.193.5.17]:58283' (RSA) to the list of known hosts.
[username@hpc017 ~]$ 
[username@hpc017 ~]$ hostname
hpc017.ncg.ingrid.pt
  1. You can run your application

(*) only available at the INCD-Lisbon cluster

Why short jobs are better than long jobs

There are lots of reasons why short jobs have advantage over long jobs and, in particular, why really long jobs should be broken up into smaller jobs if possible: