Sans titre :: Documentation technique (Beta version)

Hardware and Software

1. Equipment

2. Brief technical characteristics

The GLiCID infrastructure will eventually have several clusters available within the same slurm cluster federation. The Nautilus cluster is the first cluster available. The Waves cluster, which is the old CCIPL cluster, will gradually migrate into the GLiCID infrastructure.

2.1. Nautilus cluster

As of December 15, 2023, the GLiCID “Nautilus” cluster generally has:

Cœurs

4992

GPU

Nodes

Total Ram

22 To

Storage /home (NFS) per user

3 To

Storage /scratch (GPFS)

428 To

Storage for Research Data (Ceph)

2,8 Po Bruts

Very fast volatile storage (SSD Nvme)

43 To Bruts

Interconnection network HPC

InfiniBand HDR EDR 100Gb

Main operating system

Linux Redhat Enterprise 8.7

Batch system

Slurm 22.05.9

2.2. Waves cluster

The migration of CCIPL cluster "Waves" in GLiCID infrastructure is in progress

3. Login Nodes

There are 2 types of login nodes.

The guix-devel-00[1-3] login nodes which allow access to GLiCID clusters from virtual machines with "guix-system" OS.

These virtual machines have very limited resources 8 vcpu and 8 GB Ram and should not be used to launch compilations or pre-/post-processing.

The login nodes of nautilus-devel-00[1-2] which also allow access to GLiCID clusters with OS Linux Redhat Enterprise 8.7. These nodes can be used to launch calculations but also to launch compilations or some short pre-/post-processing tasks.

Technical characteristics of nautilus-devel-00[1-2]:

2x AMD EPYC Genoa 9474F 48-Core
384 GB of DDR5 memory @4800MT/s
2x 960GB SSD

4. Resource reservation

4.1. Partition

The squeue command allows you to see the status of jobs on the cluster

the glicid_squeue command allows you to see your jobs and not the global list.

    $ glicid_squeue

result:

CLUSTER: nautilus
               JOBID PARTITION                           NAME         USER ST         TIME  NODES CPUS QOS        PRIORITY   FEATURES NODELIST(REASON)
             3587556       all                          bdscf john-doe@un  R      6:30:51      1   96 medium        48164     (null) cnode313

CLUSTER: waves
               JOBID PARTITION                           NAME         USER ST         TIME  NODES CPUS QOS        PRIORITY   FEATURES NODELIST(REASON)

4.2. QOS

The slurm configuration for resource sharing is not yet stabilized.

the glicid_qos command allows you to visualize the restrictions that are applied.

    $ glicid_qos

result :

   Name   Priority     MaxWall                Flags MaxJobsPU MaxJobsAccruePU            MaxTRESPU
---------- ---------- ----------- -------------------- --------- --------------- --------------------
    normal          1    00:05:00          DenyOnLimit        12               0   cpu=500,gres/gpu=2
     short         50  1-00:00:00          DenyOnLimit         8              10   cpu=500,gres/gpu=2
    medium         40  3-00:00:00          DenyOnLimit         8              10   cpu=500,gres/gpu=2
      long         30  8-00:00:00          DenyOnLimit         4              10   cpu=500,gres/gpu=2
 unlimited         10                      DenyOnLimit         1
     debug        100    00:20:00          DenyOnLimit         2               5   cpu=500,gres/gpu=2(1)
  priority        200  8-00:00:00          DenyOnLimit

1	for example the qos debug is limited to 20 minutes, the priority is high compared to long, a user can launch max 2 Job on debug at the same time and a user can only use 500 cores and 2 GPUs at the same time.

4.3. Constraint

To easily launch jobs from any front end, we have implemented constraint relating to particular configurations of clusters and nodes. These constraints allow you to target the desired nodes, especially if you are not on the target cluster. To use them, add the Slurm --constraint=cpu_amd option