Concepts
General presentation
The Slurm Workload manager is a software that is used to schedule compute resources between multiple users, and multiple compute nodes. It allows users to launch jobs, according to some policies and some priorities, without them staying in front of a screen. Everything is kept as a priority queue.
The priority of a job depends on many factors. The higher the priority, the faster your job will run. Some priority factors that we take into account when using GLiCID are :
In some rare cases, you might need to allocate a node for an interactive job. This is done with salloc to allocate a job on a compute node.
This is, however, not the way we intend you to use GLiCID. While doable, we won’t provide directives on how to use salloc on GLiCID.
|
Mandatory options
When launching jobs on GLiCID, you MUST provide mandatory options when running jobs. These are provided in the example scripts provided, but you need to remember to add them in case you want to create your own scripts :
-
--account=bla: This is used to count the hours on the projectbla. Specifying a wrong one will result in the Slurm error :Invalid account or account/partition combination specified -
--time=00:05:00: This is used to specify the maximum time your job can take. Not specifying it will result in the Slurm error :Time limit specification required, but not provided
The Slurm Queue
Because GLiCID is a shared space, all the jobs need to sit in a queue until the requested resources are available.
For each job in the queue, you will see it’s status (The ST column in the glicid_squeue output).
The two main statuses will be R (Running) or PD (Pending) with a reason tied to the pending job. Here are common reasons why your job is pending :
|
There is not enough available cores/nodes to satisfy your job |
|
Your job is eligible and have resources to run, but some other higher-priority job is above your job |
|
You are already running too much jobs in parallel. See the QoS documentation to understand why |
|
You are already using many GPUs for the QoS that you are using. See the QoS documentation to understand why |
|
Your job is waiting for another one to complete. The dependency feature of Slurm is an advanced topic and is not detailed here |
|
Your job already has the maximum number of tasks per job array |
|
You specified wrong constraints for your job |