constraint slurm
1. QOS use
You have to specify a Quality of Service (QOS) for each job submitted to Slurm. The QOSs are defined in the Slurm database.
The slurm configuration for resource sharing is not yet stabilized.
| the glicid_qos command allows you to visualize the restrictions that are applied. | 
    $ glicid_qosresult :
   Name   Priority     MaxWall                Flags MaxJobsPU MaxJobsAccruePU            MaxTRESPU
---------- ---------- ----------- -------------------- --------- --------------- --------------------
    normal          1    00:05:00          DenyOnLimit         1               0    cpu=12,gres/gpu=1
     short         50  1-00:00:00          DenyOnLimit        10              10   cpu=512,gres/gpu=2
    medium         40  3-00:00:00          DenyOnLimit        10              10   cpu=512,gres/gpu=2
      long         30  8-00:00:00          DenyOnLimit         5              10   cpu=512,gres/gpu=2
 unlimited         10                      DenyOnLimit         1
     debug        100    00:20:00          DenyOnLimit         2               5   cpu=512,gres/gpu=2
  priority        200  8-00:00:00          DenyOnLimit
     quick        100    03:00:00          DenyOnLimit        50              10   cpu=512,gres/gpu=4
      gpus         70    03:00:00          DenyOnLimit         1              10| 1 | for example the qos debug is limited to 20 minutes, the priority is high compared to long, a user can launch max 2 Job on debug at the same time and a user can only use 500 cores and 2 GPUs at the same time. | 
1.1. Effects
The QOS associated with a job will affect the job in three key ways: scheduling priority, preemption, and resource limits. If you want to know more you can refer to the slurm documentation
2. Constraints use
To easily launch jobs from any front end, we have also implemented constraints. These constraints allow you to target the desired nodes, especially if you are not on the target cluster. To use them, add the Slurm --constraint=<constraint_name> option
Below is a list of constrained slurms. This can evolve like the hardware installed in the cluster.
| If you want to have them constrained on a specific node you can use a specific slurm command and . | 
Example : To know the constraint on gnode2 on nautilus cluster :
scontrol --cluster=nautilus show node gnode2 |grep ActiveFeatures
Result :
ActiveFeatures=loc_ecn,cpu_amd,cpu_zen4,cpu_genoa,cpu_9474,net_ib,net_100g,gpu_a100,gpu_a100_80
2.1. Location of equipment
Allows you to consider the nodes hosted in a specific machine room. Interesting to specify when resources are primarily available at this specific location
| constraint_name | meaning | 
| loc_maths | historic room of CCIPL | 
| loc_dc | Nantes University data center | 
| loc_ecn | machine room of the Central School | 
2.1.1. CPU type
Allows you to target either the brand of processors or their characteristics.
2.1.1.2. Extensions
| constraint_name | meaning | 
| cpu_avx | avx extensions are required | 
| cpu_avx2 or cpu_avx256 | avx2 extensions are required | 
| cpu_avx512 | avx512 extensions are required | 
2.1.1.3. Generations
| constraint_name | meaning | 
| cpu_westmere, cpu_X5650 | Intel, génération westmere | 
| cpu_broadwell, cpu_e5_2530v4 or cpu_e5_2640v4 | Intel, génération Broadwell | 
| cpu_skylake, cpu_silver_4114 | Intel, génération skylake | 
| cpu_cascadelake, cpu_silver_4210 or cpu_silver_4210r or cpu_silver_4216 | Intel, génération cascadelake | 
| cpu_zen2, cpu_rome, cpu_7282 | AMD zen2 | 
| cpu_zen3, cpu_milan ,cpu_7213 | AMD zen3 | 
2.1.2. Rapid interconnection network
The network is of three types :
- 
Infiniband (IB) 
- 
Omnipath (OPA) 
- 
RoCE (Roce) 
These 3 networks are incompatible with each other. In the case of multi-node work, it is important to aim for a particular network type. It is recommended to use a universal MPI strain that is capable of efficiently driving any type of network (OpenMPI).
2.1.3. Type
| constraint_name | meaning | 
| net_ib | infiniband | 
| net_opa | omnipath | 
| net_roce | roce | 
| net_dr | dual rail (whatever the technology) | 
2.1.4. Interconnect speed
| constraint_name | meaning | 
| net_25g | 25 gbit/s (Roce) | 
| net_40g | 40 gbit/s (Infiniband QDR) | 
| net_50g | 50 gbit/s (Roce in dual-rail) | 
| net_100g | 100 gbit/s (Omnipath 100 or RoCE 100) | 
2.2. Waves Table of constraints
| Table of upcoming constraints (some purely proprietary nodes are omitted). When “or”s are placed, this means that only a part of these nodes has these properties | 
| machines | index | loc_ | cpu_ | net_ | gpu_ | hw_ | 
| chezine | 001-078 | maths | intel, westmere,x5650 | ib, 40g | sgi | |
| nazare | 001-128 | dc | intel, broadwell, avx, avx2, e52630v4 | opa, opa100, 100g | asus or dell | |
| cribbar | 001-100 | dc | intel, (skylake,silver4114) or (cascadelake, cpu_silver4210 or cpu_silver4210R) avx, avx2, avx512 | (opa, opa100, 100g) or (roce, (roce25, 25g) or (roce50, dr, 50g)) | dell | |
| cloudbreak | 001-040 | dc | amd, rome or milan, zen2 or zen3, 7282 or 7353 | roce, (roce25, 25g) or (roce50, dr, 50g) | dell or hpe | |
| budbud | 001-023 | dc or ecn | (intel (broadwell,e52640v4 or skylake,silver4114 or cascadelake,silver4210)) or (amd,zen3,milan,7313) | (opa, opa100, 100g) or (roce,(roce25, 25g) or (roce100,100g)) | k40 or p100 or t4 or a40 or a100 | dell | 
2.3. Nautilus Table of constraints
| Type | Name | Core per Node | Ram per Node | GPU | constraints | 
| standard | cnode[301-340] | 96 | 384 Go | None | loc_ecn cpu_amd cpu_zen4 cpu_genoa cpu_9474 net_ib net_100g mem_low | 
| BigMem | cnode[701-708] | 96 | 768 Go | None | loc_ecn cpu_amd cpu_zen4 cpu_genoa cpu_9474 net_ib net_100g mem_high | 
| Gpus | gnode[1-4] | 96 | 768 Go | 4 * Nvidia Tesla A100 80G | loc_ecn cpu_amd cpu_zen4 cpu_genoa cpu_9474 net_ib net_100g gpu_a100 gpu_a100_80 | 
| VIsulalisation | visu[1-4] | 96 | 768 Go | 2 * Nvidia Tesla A40 48G | loc_ecn cpu_amd cpu_zen4 cpu_genoa cpu_9474 net_ib net_100g gpu_a40 |