Running on specific platforms¶
Running on Cori KNL at NERSC¶
The batch script below can be used to run a WarpX simulation on 2 KNL nodes on
the supercomputer Cori at NERSC. Replace descriptions between chevrons <>
by relevant values, for instance <job name>
could be laserWakefield
.
#!/bin/bash -l
# Copyright 2019 Maxence Thevenet
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -q regular
#SBATCH -C knl
#SBATCH -S 4
#SBATCH -J <job name>
#SBATCH -A <allocation ID>
#SBATCH -e error.txt
#SBATCH -o output.txt
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
# KNLs have 4 hyperthreads max
export CORI_MAX_HYPETHREAD_LEVEL=4
# We use 64 cores out of the 68 available on Cori KNL,
# and leave 4 to the system (see "#SBATCH -S 4" above).
export CORI_NCORES_PER_NODE=64
# Typically use 8 MPI ranks per node without hyperthreading,
# i.e., OMP_NUM_THREADS=8
export WARPX_NMPI_PER_NODE=8
export WARPX_HYPERTHREAD_LEVEL=1
# Compute OMP_NUM_THREADS and the thread count (-c option)
export CORI_NHYPERTHREADS_MAX=$(( ${CORI_MAX_HYPETHREAD_LEVEL} * ${CORI_NCORES_PER_NODE} ))
export WARPX_NTHREADS_PER_NODE=$(( ${WARPX_HYPERTHREAD_LEVEL} * ${CORI_NCORES_PER_NODE} ))
export OMP_NUM_THREADS=$(( ${WARPX_NTHREADS_PER_NODE} / ${WARPX_NMPI_PER_NODE} ))
export WARPX_THREAD_COUNT=$(( ${CORI_NHYPERTHREADS_MAX} / ${WARPX_NMPI_PER_NODE} ))
# for async_io support: (optional)
export MPICH_MAX_THREAD_SAFETY=multiple
srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) -c ${WARPX_THREAD_COUNT} <path/to/executable> <input file>
To run a simulation, copy the lines above to a file batch_cori.sh
and
run
sbatch batch_cori.sh
to submit the job.
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Cori KNL for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:
amr.max_grid_size=64
andamr.blocking_factor=64
so that the size of each grid is fixed to64**3
(we are not using load-balancing here).8 MPI ranks per KNL node, with
OMP_NUM_THREADS=8
(that is 64 threads per KNL node, i.e. 1 thread per physical core, and 4 cores left to the system).2 grids per MPI, i.e., 16 grids per KNL node.
Running on Cori Haswell at NERSC¶
The batch script below can be used to run a WarpX simulation on 1 Haswell node on the supercomputer Cori at NERSC.
#!/bin/bash -l
# Just increase this number of you need more nodes.
#SBATCH -N 1
#SBATCH -t 03:00:00
#SBATCH -q regular
#SBATCH -C haswell
#SBATCH -J <job name>
#SBATCH -A <allocation ID>
#SBATCH -e error.txt
#SBATCH -o output.txt
# one MPI rank per half-socket (see below)
#SBATCH --tasks-per-node=4
# request all logical (virtual) cores per half-socket
#SBATCH --cpus-per-task=16
# each Cori Haswell node has 2 sockets of Intel Xeon E5-2698 v3
# each Xeon CPU is divided into 2 bus rings that each have direct L3 access
export WARPX_NMPI_PER_NODE=4
# each MPI rank per half-socket has 8 physical cores
# or 16 logical (virtual) cores
# over-subscribing each physical core with 2x
# hyperthreading leads to a slight (3.5%) speedup
# the settings below make sure threads are close to the
# controlling MPI rank (process) per half socket and
# distribute equally over close-by physical cores and,
# for N>8, also equally over close-by logical cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=16
# for async_io support: (optional)
export MPICH_MAX_THREAD_SAFETY=multiple
EXE="<path/to/executable>"
srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) ${EXE} <input file>
To run a simulation, copy the lines above to a file batch_cori_haswell.sh
and
run
sbatch batch_cori_haswell.sh
to submit the job.
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Cori Haswell for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:
4 MPI ranks per Haswell node (2 MPI ranks per Intel Xeon E5-2698 v3), with
OMP_NUM_THREADS=16
(which uses 2x hyperthreading)
Running on Summit at OLCF¶
V100 GPUs¶
The batch script below can be used to run a WarpX simulation on 2 nodes on
the supercomputer Summit at OLCF. Replace descriptions between chevrons <>
by relevant values, for instance <input file>
could be
plasma_mirror_inputs
. Note that the only option so far is to run with one
MPI rank per GPU.
#!/bin/bash
# Copyright 2019-2020 Maxence Thevenet, Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
# https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
# https://docs.olcf.ornl.gov/systems/summit_user_guide.html#cuda-aware-mpi
#BSUB -P <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J
module load gcc
module load cuda
export OMP_NUM_THREADS=1
jsrun -r 6 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs --smpiargs="-gpu" <path/to/executable> <input file> > output.txt
To run a simulation, copy the lines above to a file batch_summit.sh
and
run
bsub batch_summit.sh
to submit the job.
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Summit for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:
amr.max_grid_size=256
andamr.blocking_factor=128
.One MPI rank per GPU (e.g., 6 MPI ranks for the 6 GPUs on each Summit node)
Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.
A batch script with more options regarding profiling on Summit can be found at
Summit batch script
Power9 CPUs¶
Similar to above, the batch script below can be used to run a WarpX simulation on 1 node on the supercomputer Summit at OLCF, on Power9 CPUs (i.e., the GPUs are ignored).
#!/bin/bash
# Copyright 2019-2020 Maxence Thevenet, Axel Huebl, Michael Rowan
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
# https://jsrunvisualizer.olcf.ornl.gov/?s1f0o121n2c21g0r11d1b1l0=
#BSUB -P <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J
export OMP_NUM_THREADS=21
jsrun -n 2 -a 1 -c 21 -r 2 -l CPU-CPU -d packed -b rs <path/to/executable> <input file> > output.txt
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on Summit for a well load-balanced problem, the following set of parameters provided good performance:
amr.max_grid_size=64
andamr.blocking_factor=64
Two MPI ranks per node (i.e. 2 resource sets per node; equivalently, 1 resource set per socket)
21 physical CPU cores per MPI rank
21 OpenMP threads per MPI rank (i.e. 1 OpenMP thread per physical core)
SMT 1 (Simultaneous Multithreading level 1)
Sixteen `64x64x64` grids per MPI rank (with default tiling in WarpX, this results in ~49 tiles per OpenMP thread)
Running on Lassen at LLNL¶
V100 GPUs¶
The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Lassen at LLNL.
Replace descriptions between chevrons <>
by relevant values, for instance <input file>
could be plasma_mirror_inputs
.
Note that the only option so far is to run with one MPI rank per GPU.
#!/bin/bash
# Copyright 2020 Axel Huebl
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#
# Refs.:
# https://jsrunvisualizer.olcf.ornl.gov/?s4f0o11n6c7g1r11d1b1l0=
# https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-system#quick16
#BSUB -G <allocation ID>
#BSUB -W 00:10
#BSUB -nnodes 2
#BSUB -alloc_flags smt4
#BSUB -J WarpX
#BSUB -o WarpXo.%J
#BSUB -e WarpXe.%J
export OMP_NUM_THREADS=1
jsrun -r 4 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs -M "-gpu" <path/to/executable> <input file> > output.txt
To run a simulation, copy the lines above to a file batch_lassen.sh
and run
bsub batch_lassen.sh
to submit the job.
For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell solver on V100 GPUs for a well load-balanced problem (in our case laser wakefield acceleration simulation in a boosted frame in the quasi-linear regime), the following set of parameters provided good performance:
amr.max_grid_size=256
andamr.blocking_factor=128
.One MPI rank per GPU (e.g., 4 MPI ranks for the 4 GPUs on each Lassen node)
Two `128x128x128` grids per GPU, or one `128x128x256` grid per GPU.
Running on Quartz at LLNL¶
Intel Xeon E5-2695 v4 CPUs¶
The batch script below can be used to run a WarpX simulation on 2 nodes on the supercomputer Quartz at LLNL.
Replace descriptions between chevrons <>
by relevant values, for instance <input file>
could be plasma_mirror_inputs
.
#!/bin/bash -l
# Just increase this number of you need more nodes.
#SBATCH -N 2
#SBATCH -t 24:00:00
#SBATCH -A <allocation ID>
#SBATCH -J WarpX
#SBATCH -q pbatch
#SBATCH --qos=normal
#SBATCH --license=lustre1,lustre2
#SBATCH --export=ALL
#SBATCH -e error.txt
#SBATCH -o output.txt
# one MPI rank per half-socket (see below)
#SBATCH --tasks-per-node=2
# request all logical (virtual) cores per half-socket
#SBATCH --cpus-per-task=18
# each Quartz node has 1 socket of Intel Xeon E5-2695 v4
# each Xeon CPU is divided into 2 bus rings that each have direct L3 access
export WARPX_NMPI_PER_NODE=2
# each MPI rank per half-socket has 9 physical cores
# or 18 logical (virtual) cores
# over-subscribing each physical core with 2x
# hyperthreading led to a slight (3.5%) speedup on Cori's Intel Xeon E5-2698 v3,
# so we do the same here
# the settings below make sure threads are close to the
# controlling MPI rank (process) per half socket and
# distribute equally over close-by physical cores and,
# for N>9, also equally over close-by logical cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=18
EXE="<path/to/executable>" # e.g. ./warpx
srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) ${EXE} <input file>
To run a simulation, copy the lines above to a file batch_quartz.sh
and run
sbatch batch_quartz.sh
to submit the job.