Waves migrations

GLiCID: a multi-cluster mesocenter

Rapidly evolving documentation

Additional computing resources are now available in GLiCID, through a second cluster named after the historic CCIPL cluster: Waves. From GLiCID, 2 computing clusters can be accessed simultaneously. This slightly changes the way jobs are launched. To limit the impact of this change, access to the Waves cluster is currently conditional on membership of a test group. We’re taking advantage of this pre-production period to evaluate and fine-tune a few settings To access these new machines, you’ll need to read this documentation all the way through.

Nautilus and Waves

2 computing clusters are now available:

Nautilus, installed since 2023 in the ECN machine room, the first phase of the GLiCID computer.
Waves, installed in the Nantes Université datacenter. More than 50 machines from the former CCIPL and BiRD computing clusters are now available, and this number will continue to increase progressively.

These 2 clusters can be accessed equally from all front-ends, and feature the same software, storage space, partitions, QoS and identifiers. This means that jobs can be launched on the most suitable machines, depending on the computational constraints.

To specify a given cluster, use the “--cluster” parameter in slurm commands. For example, the ‘sinfo’ command:
- sinfo --cluster=waves limits the command to the waves cluster
- sinfo --cluster=nautilus limits the command to the nautilus cluster
- sinfo --cluster=all considers all available clusters
If you don’t specify which cluster to consider, the home cluster will be used;
- On nautilus-devel machines, the default nautilus cluster is used.
- On guix-devel machines, the default is the waves cluster

all front ends access both clusters. By default, slurm commands will only target the front-end’s home cluster. To make life easier for users, the slurm command set is doubled by equivalent commands starting with glicid_ . These can be used to target both clusters.

Choice of front-ends

All front-ends access both clusters equally, but the type of front-end used depends on the way jobs are launched:
- If development is done in a RedHat environment and with modules, it’s preferable to use nautilus-devel machines (even if jobs are launched to Waves).
- If development is carried out in a Guix environment, it is preferable to use guix-devel front-ends (whether jobs are launched to Nautilus or Waves).
- there are back-up machines when the main nautilus-devel front-ends are unavailable: they can be accessed by configuring devel-rh.waves.intra.glicid.fr in SSH configuration (see main doc)

Differences between Waves and Nautilus

Both clusters have been configured so that jobs can run transparently on either cluster, but there are some differences When a job is launched, in order to control where the job will be allocated, it is necessary to give as many details and constraints as possible, otherwise slurm will allocate the job where space is available; it is therefore a good idea to use constraints to tell slurm what is needed for the job to be allocated in the best possible way; the list of constraints is available here https://doc.glicid.fr/GLiCID-PUBLIC/detailled/slurm/constraint_slurm.html .

CPU and memory

Nautilus is a fairly homogeneous cluster: cnode nodes with identical AMD processors and infiniband network.
Waves is a more heterogeneous cluster (cloudbreak, cribbar and budbud nodes) with AMD or Intel processors and a different network (RoCE).

Machine families

Cribbar machines are 2x10 Intel 4210 cores.
Cloudbreak machines are 2x16-core AMD Zen2 machines.
budbud machines are 2x16-core AMD Zen3 machines equipped with 2 x 40 Gb A100 GPUs.

Hyperthread

Waves machines currently have hyperthreading enabled, but by default, slurm is configured not to use it, as efficiency in an HPC setting is not often cost-effective. To enable it, you can, on your jobs:

use the -hint=multithread option
or --threads-per-core=2

Please test the effectiveness of this option on your code. And to give us feedback so that we can determine whether offering the option is cost-effective.

Resource heterogeneity management

By combining the resources available on Nautilus and Waves, a number of parameters are available:

amount of global memory
cpu architecture
network type

The judicious use of constraints enables you to refine the target machines. On the other hand, especially during development, it’s sometimes useful to specify the minimum number of constraints to aim for the maximum range of available nodes, or to encourage a quick start to work.

Here are a few examples:

Coming soon.

Preferential scratch space

The two clusters are installed in different machine rooms, raising concerns about bandwidth and network technology. This is the reason for the existence of 2 differentiated scratch spaces. ( /scratch/nautilus and /scratch/waves).

If the calculation starts on Nautilus, it is generally preferable to use //scratch/nautilus, as a very fast network is available between the cnodes and this scratch, guaranteeing low latency and good throughput. /scratch/waves is also accessible under good conditions from Nautilus. This scratch has a large processing capacity. Depending on the situation (e.g. massively parallel multi-node calculations), it may be beneficial to use this scratch from Nautilus. (to be tested according to the calculation).
If the calculation starts on Waves, you have to use /scratch/waves, which is optimal in these conditions. /scratch/nautilus is also available, but cannot benefit from the same access conditions as from cnode machines (non-native network and lower bandwidth).

In context of calculations, these penalties can cause /scratch/nautilus may not respond correctly due to the limitations, and this could impact jobs and even the nodes themselves.

regardless of the scratch program used, it is necessary to move the resulting data once the calculation campaign is complete.

As /scratch/waves is currently limited in space, please pay particular attention to its use. The available space will be gradually extended.

The other areas (/home and /LAB-DATA) are accessible under the same conditions from the 2 clusters.

Migrating files from Waves to GLiCID

This part only concerns users of the old Waves cluster: its storage spaces are not available on the old cluster. You must therefore transfer the data using the following procedure: (Complete)

TODO

Access Waves

→ You must have a valid GLiCID account (see docs.glicid.fr).