Crusher (OLCF)¶
The Crusher cluster is located at OLCF. Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).
Introduction¶
If you are new to this system, please see the following resources:
Batch system: Slurm
-
$PROJWORK/$proj/
: shared with all members of a project, purged every 90 days (recommended)$MEMBERWORK/$proj/
: single user, purged every 90 days (usually smaller quota)$WORLDWORK/$proj/
: shared with all users, purged every 90 daysNote that the
$HOME
directory is mounted as read-only on compute nodes. That means you cannot run in your$HOME
.
Installation¶
Use the following commands to download the WarpX source code and switch to the correct branch:
git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx
We use the following modules and environments on the system ($HOME/crusher_warpx.profile
).
# please set your project account
# note: WarpX ECP members use aph114_crusher
#export proj=<yourProject>
# required dependencies
module load cpe/22.08
module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.2.0
module load cray-mpich
module load cce/14.0.2 # must be loaded after rocm
# optional: faster builds
module load ccache
module load ninja
# optional: just an additional text editor
module load nano
# optional: for PSATD in RZ geometry support (not yet available)
#module load blaspp
#module load lapackpp
export CMAKE_PREFIX_PATH=$HOME/sw/crusher/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$HOME/sw/crusher/lapackpp-master:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$HOME/sw/crusher/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/sw/crusher/lapackpp-master/lib64:$LD_LIBRARY_PATH
# optional: for QED lookup table generation support
#module load boost/1.78.0-cxx17
# optional: for openPMD support
module load adios2/2.8.1
module load cray-hdf5-parallel/1.12.1.5
# optional: for Python bindings or libEnsemble
module load cray-python/3.9.12.1
# fix system defaults: do not escape $ with a \ on tab completion
shopt -s direxpand
# an alias to request an interactive batch node for one hour
# for paralle execution, start on the batch node: srun <command>
alias getNode="salloc -A $proj -J warpx -t 01:00:00 -p batch -N 1 -c 8 --ntasks-per-node=8"
# an alias to run a command on a batch node for up to 30min
# usage: runNode <command>
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 -c 8 --ntasks-per-node=8"
# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1
# optimize CUDA compilation for MI250X
export AMREX_AMD_ARCH=gfx90a
# compiler environment hints
export CC=$(which cc)
export CXX=$(which CC)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include"
#export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"
We recommend to store the above lines in a file, such as $HOME/crusher_warpx.profile
, and load it into your shell after a login:
source $HOME/crusher_warpx.profile
And since Crusher does not yet provide a module for them, install BLAS++ and LAPACK++:
# BLAS++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/blaspp.git src/blaspp
rm -rf src/blaspp-crusher-build
cmake -S src/blaspp -B src/blaspp-crusher-build -Duse_openmp=OFF -Dgpu_backend=hip -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=$HOME/sw/crusher/blaspp-master
cmake --build src/blaspp-crusher-build --target install --parallel 10
# LAPACK++ (for PSATD+RZ)
git clone https://github.com/icl-utk-edu/lapackpp.git src/lapackpp
rm -rf src/lapackpp-crusher-build
cmake -S src/lapackpp -B src/lapackpp-crusher-build -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=$HOME/sw/crusher/lapackpp-master
cmake --build src/lapackpp-crusher-build --target install --parallel 10
Then, cd
into the directory $HOME/src/warpx
and use the following commands to compile:
cd $HOME/src/warpx
rm -rf build
cmake -S . -B build -DWarpX_DIMS=3 -DWarpX_COMPUTE=HIP
cmake --build build -j 10
The general cmake compile-time options apply as usual.
Running¶
MI250X GPUs (2x64 GB)¶
ECP WarpX project members, use the aph114
project ID.
After requesting an interactive node with the getNode
alias above, run a simulation like this, here using 8 MPI ranks and a single node:
runNode ./warpx inputs
Or in non-interactive runs:
#!/usr/bin/env bash
#SBATCH -A <project id>
# note: WarpX ECP members use aph114
#SBATCH -J warpx
#SBATCH -o %x-%j.out
#SBATCH -t 00:10:00
#SBATCH -p batch
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH -N 1
# From the documentation:
# Each Crusher compute node consists of [1x] 64-core AMD EPYC 7A53
# "Optimized 3rd Gen EPYC" CPU (with 2 hardware threads per physical core) with
# access to 512 GB of DDR4 memory.
# Each node also contains [4x] AMD MI250X, each with 2 Graphics Compute Dies
# (GCDs) for a total of 8 GCDs per node. The programmer can think of the 8 GCDs
# as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).
# note (5-16-22, OLCFHELP-6888)
# this environment setting is currently needed on Crusher to work-around a
# known issue with Libfabric
#export FI_MR_CACHE_MAX_COUNT=0 # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks # alternative cache monitor
# note (9-2-22, OLCFDEV-1079)
# this environment setting is needed to avoid that rocFFT writes a cache in
# the home directory, which does not scale.
export ROCFFT_RTC_CACHE_PATH=/dev/null
export OMP_NUM_THREADS=8
srun ./warpx inputs > output.txt
Post-Processing¶
For post-processing, most users use Python via OLCFs’s Jupyter service (Docs).
Please follow the same guidance as for OLCF Summit post-processing.
Known System Issues¶
Warning
May 16th, 2022 (OLCFHELP-6888): There is a caching bug in Libfrabric that causes WarpX simulations to occasionally hang on Crusher on more than 1 node.
As a work-around, please export the following environment variable in your job scripts until the issue is fixed:
#export FI_MR_CACHE_MAX_COUNT=0 # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks # alternative cache monitor
Warning
Sep 2nd, 2022 (OLCFDEV-1079): rocFFT in ROCm 5.1+ tries to write to a cache in the home area by default. This does not scale, disable it via:
export ROCFFT_RTC_CACHE_PATH=/dev/null