2. Brief technical characteristics
The GLiCID infrastructure will eventually have several clusters available within the same slurm cluster federation. The Nautilus cluster is the first cluster available. The Waves cluster, which is the old CCIPL cluster, will gradually migrate into the GLiCID infrastructure.
2.1. Nautilus cluster
As of December 15, 2023, the GLiCID “Nautilus” cluster generally has:
Cœurs |
4992 |
GPU |
24 |
Nodes |
56 |
Total Ram |
22 To |
Storage /home (NFS) per user |
3 To |
Storage /scratch (GPFS) |
428 To |
Storage for Research Data (Ceph) |
2,8 Po Bruts |
Very fast volatile storage (SSD Nvme) |
43 To Bruts |
Interconnection network HPC |
InfiniBand HDR EDR 100Gb |
Main operating system |
Linux Redhat Enterprise 8.7 |
Batch system |
Slurm 22.05.9 |
3. Login Nodes
There are 2 types of login nodes.
-
The guix-devel-00[1-3] login nodes which allow access to GLiCID clusters from virtual machines with "guix-system" OS.
These virtual machines have very limited resources 8 vcpu and 8 GB Ram and should not be used to launch compilations or pre-/post-processing. |
-
The login nodes of nautilus-devel-00[1-2] which also allow access to GLiCID clusters with OS Linux Redhat Enterprise 8.7. These nodes can be used to launch calculations but also to launch compilations or some short pre-/post-processing tasks.
Technical characteristics of nautilus-devel-00[1-2]:
-
2x AMD EPYC Genoa 9474F 48-Core
-
384 GB of DDR5 memory @4800MT/s
-
2x 960GB SSD
4. Resource reservation
4.1. Partition
The squeue command allows you to see the status of jobs on the cluster
the glicid_squeue command allows you to see your jobs and not the global list. |
$ glicid_squeue
result:
CLUSTER: nautilus JOBID PARTITION NAME USER ST TIME NODES CPUS QOS PRIORITY FEATURES NODELIST(REASON) 3587556 all bdscf john-doe@un R 6:30:51 1 96 medium 48164 (null) cnode313 CLUSTER: waves JOBID PARTITION NAME USER ST TIME NODES CPUS QOS PRIORITY FEATURES NODELIST(REASON)
4.2. QOS
The slurm configuration for resource sharing is not yet stabilized.
the glicid_qos command allows you to visualize the restrictions that are applied. |
$ glicid_qos
result :
Name Priority MaxWall Flags MaxJobsPU MaxJobsAccruePU MaxTRESPU
---------- ---------- ----------- -------------------- --------- --------------- --------------------
normal 1 00:05:00 DenyOnLimit 12 0 cpu=500,gres/gpu=2
short 50 1-00:00:00 DenyOnLimit 8 10 cpu=500,gres/gpu=2
medium 40 3-00:00:00 DenyOnLimit 8 10 cpu=500,gres/gpu=2
long 30 8-00:00:00 DenyOnLimit 4 10 cpu=500,gres/gpu=2
unlimited 10 DenyOnLimit 1
debug 100 00:20:00 DenyOnLimit 2 5 cpu=500,gres/gpu=2(1)
priority 200 8-00:00:00 DenyOnLimit
1 | for example the qos debug is limited to 20 minutes, the priority is high compared to long, a user can launch max 2 Job on debug at the same time and a user can only use 500 cores and 2 GPUs at the same time. |
4.3. Constraint
To easily launch jobs from any front end, we have implemented constraint relating to particular configurations of clusters and nodes. These constraints allow you to target the desired nodes, especially if you are not on the target cluster. To use them, add the Slurm --constraint=cpu_amd option