Concepts

General presentation

The Slurm Workload manager is a software that is used to schedule compute resources between multiple users, and multiple compute nodes. It allows users to launch jobs, according to some policies and some priorities, without them staying in front of a screen. Everything is kept as a priority queue.

The priority of a job depends on many factors. The higher the priority, the faster your job will run. Some priority factors that we take into account when using GLiCID are :

  • The QoS used

  • The Wall Time

  • The partition

  • The number of resources asked

In some rare cases, you might need to allocate a node for an interactive job. This is done with salloc to allocate a job on a compute node. This is, however, not the way we intend you to use GLiCID. While doable, we won’t provide directives on how to use salloc on GLiCID.

Mandatory options

When launching jobs on GLiCID, you MUST provide mandatory options when running jobs. These are provided in the example scripts provided, but you need to remember to add them in case you want to create your own scripts :

  1. --account=bla : This is used to count the hours on the project bla. Specifying a wrong one will result in the Slurm error : Invalid account or account/partition combination specified

  2. --time=00:05:00 : This is used to specify the maximum time your job can take. Not specifying it will result in the Slurm error : Time limit specification required, but not provided

The Slurm Queue

Because GLiCID is a shared space, all the jobs need to sit in a queue until the requested resources are available. For each job in the queue, you will see it’s status (The ST column in the glicid_squeue output). The two main statuses will be R (Running) or PD (Pending) with a reason tied to the pending job. Here are common reasons why your job is pending :

Table 1. Example pending reasons for a job

Resources

There is not enough available cores/nodes to satisfy your job

Priority

Your job is eligible and have resources to run, but some other higher-priority job is above your job

QOSMaxJobsPerUserLimit

You are already running too much jobs in parallel. See the QoS documentation to understand why

QOSMaxGRESPerUser

You are already using many GPUs for the QoS that you are using. See the QoS documentation to understand why

Dependency

Your job is waiting for another one to complete. The dependency feature of Slurm is an advanced topic and is not detailed here

JobArrayTaskLimit

Your job already has the maximum number of tasks per job array

BadConstraints

You specified wrong constraints for your job