Adastra (CINES)

The Adastra cluster is located at CINES (France). Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).

Introduction

If you are new to this system, please see the following resources:

Adastra user guide
Batch system: Slurm
Production directories:
- $SHAREDSCRATCHDIR: meant for short-term data storage, shared with all members of a project, purged every 30 days (17.6 TB default quota)
- $SCRATCHDIR: meant for short-term data storage, single user, purged every 30 days
- $SHAREDWORKDIR: meant for mid-term data storage, shared with all members of a project, never purged (4.76 TB default quota)
- $WORKDIR: meant for mid-term data storage, single user, never purged
- $STORE : meant for long term storage, single user, never purged, backed up
- $SHAREDHOMEDIR : meant for scripts and tools, shared with all members of a project, never purged, backed up
- $HOME : meant for scripts and tools, single user, never purged, backed up

Installation

Use the following commands to download the WarpX source code and switch to the correct branch:

git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx

We use the following modules and environments on the system ($HOME/adastra_warpx.profile).

Listing 1 You can copy this file from Tools/machines/adastra-cines/adastra_warpx.profile.example.

# please set your project account
#export proj=your_project_id

# required dependencies
module load craype-accel-amd-gfx90a craype-x86-trento
module load PrgEnv-cray
module load amd-mixed/5.2.3
module load CPE-22.11-cce-15.0.0-softs

# optional: for PSATD in RZ geometry support
module load cray-libsci_acc/22.12.2.1

# optional: for QED lookup table generation support
module load boost/1.80.0-mpi-python3

# optional: for Python bindings or libEnsemble
module load cray-python/3.9.13.1

# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand

# make output group-readable by default
umask 0027

# an alias to request an interactive batch node for one hour
# for paralle execution, start on the batch node: srun <command>
alias getNode="salloc --account=$proj --job-name=warpx --constraint=MI250 --nodes=1 --ntasks-per-node=8 --cpus-per-task=8 --gpus-per-node=8 --threads-per-core=1 --exclusive --time=01:00:00"
# note: to access a compute note it is required to get its name (look at the `NODELIST` column)
#    $ squeue -u $USER
# and then to ssh into the node:
#    $ ssh node_name

# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1

# optimize ROCm/HIP compilation for MI250X
export AMREX_AMD_ARCH=gfx90a

# compiler environment hints
export CC=$(which cc)
export CXX=$(which CC)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"

We recommend to store the above lines in a file, such as $HOME/adastra_warpx.profile, and load it into your shell after a login:

source $HOME/adastra_warpx.profile

And since Adastra does not yet provide a module for them, install c-blosc and ADIOS2:

export CMAKE_PREFIX_PATH=${HOME}/sw/adastra/gpu/c-blosc-1.21.1:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/adastra/gpu/adios2-2.8.3:$CMAKE_PREFIX_PATH

# c-blosc (I/O compression)
git clone -b v1.21.1 https://github.com/Blosc/c-blosc.git src/c-blosc
rm -rf src/c-blosc-pm-build
cmake -S src/c-blosc -B src/c-blosc-pm-build -DBUILD_TESTS=OFF -DBUILD_BENCHMARKS=OFF -DDEACTIVATE_AVX2=OFF -DCMAKE_INSTALL_PREFIX=${HOME}/sw/adastra/gpu/c-blosc-1.21.1
cmake --build src/c-blosc-pm-build --target install --parallel 16

# ADIOS2
git clone -b v2.8.3 https://github.com/ornladios/ADIOS2.git src/adios2
rm -rf src/adios2-pm-build
cmake -S src/adios2 -B src/adios2-pm-build -DADIOS2_USE_Blosc=ON -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_Python=OFF -DADIOS2_USE_ZeroMQ=OFF -DCMAKE_INSTALL_PREFIX=${HOME}/sw/adastra/gpu//adios2-2.8.3
cmake --build src/adios2-pm-build --target install -j 16

Then, cd into the directory $HOME/src/warpx and use the following commands to compile:

cd $HOME/src/warpx
rm -rf build

cmake -S . -B build -DWarpX_COMPUTE=HIP
cmake --build build -j 32

The general cmake compile-time options apply as usual.

That’s it! A 3D WarpX executable is now in build/bin/ and can be run with a 3D example inputs file. Most people execute the binary directly or copy it out to a location in $WORKDIR or $SCRATCHDIR.

Running

MI250X GPUs (2x64 GB)

In non-interactive runs:

Listing 2 You can copy this file from Tools/machines/adastra-cines/submit.sh.

#!/bin/bash
#SBATCH --job-name=warpx
#SBATCH --account=<account_to_charge>
#SBATCH --constraint=MI250
#SBATCH --ntasks-per-node=8 --cpus-per-task=8 --gpus-per-node=8
#SBATCH --threads-per-core=1 # --hint=nomultithread
#SBATCH --exclusive
#SBATCH --output=%x-%j.out
#SBATCH --time=00:10:00
#SBATCH --nodes=2

module purge

# Architecture
module load craype-accel-amd-gfx90a craype-x86-trento
# A compiler to target the architecture
module load PrgEnv-cray
# Some architecture related libraries and tools
module load amd-mixed

export MPICH_GPU_SUPPORT_ENABLED=1

# note
# this environment setting is currently needed to work-around a
# known issue with Libfabric
#export FI_MR_CACHE_MAX_COUNT=0  # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks  # alternative cache monitor

# note
# this environment setting is needed to avoid that rocFFT writes a cache in
# the home directory, which does not scale.
export ROCFFT_RTC_CACHE_PATH=/dev/null

export OMP_NUM_THREADS=1
export WARPX_NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} ))
srun -N${SLURM_JOB_NUM_NODES} -n${TOTAL_NMPI} --ntasks-per-node=${WARPX_NMPI_PER_NODE} \
    ./warpx inputs > output.txt

Post-Processing

Note

TODO: Document any Jupyter or data services.

Known System Issues

Warning

May 16th, 2022: There is a caching bug in Libfabric that causes WarpX simulations to occasionally hang on on more than 1 node.

As a work-around, please export the following environment variable in your job scripts until the issue is fixed:

#export FI_MR_CACHE_MAX_COUNT=0  # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks  # alternative cache monitor

Warning

Sep 2nd, 2022: rocFFT in ROCm 5.1+ tries to write to a cache in the home area by default. This does not scale, disable it via:

export ROCFFT_RTC_CACHE_PATH=/dev/null

Warning

January, 2023: We discovered a regression in AMD ROCm, leading to 2x slower current deposition (and other slowdowns) in ROCm 5.3 and 5.4. Reported to AMD and fixed for the next release of ROCm.

Stay with the ROCm 5.2 module to avoid.